Chapter 2: Tokenization and Embeddings

2.1 Byte Pair Encoding (BPE), WordPiece, SentencePiece

When we read, we see words. When a computer reads, it sees numbers. The bridge between the two is tokenization: the process of breaking human language into units (tokens) that a model can understand. This fundamental transformation is what enables machines to process and generate text that appears natural to humans.

At first glance, you might think: Why not just split text by spaces and call each word a token? In fact, early natural language processing systems did just that. But words are messy: languages have compounds, rare words, typos, and endless variations. A word-based tokenizer quickly breaks down when it encounters something it has never seen before, like "hyperparameterization". Additionally, many languages don't use spaces between words (like Chinese or Japanese), making word-based tokenization impractical for multilingual applications.

Modern LLMs use subword tokenization. Instead of treating entire words as indivisible units, they break words into smaller chunks that can be recombined. This allows the model to cover a vast vocabulary with fewer tokens, handle new or rare words gracefully, and support multiple languages efficiently. For example, a word like "unfriendliness" might be broken into "un", "friend", "li", "ness" - chunks that the model can recognize individually and then process together for meaning.

The evolution from word-based to subword tokenization has been crucial for scaling language models to billions of parameters while keeping vocabulary sizes manageable (typically between 30,000 to 100,000 tokens). Without subword tokenization, we'd need millions of tokens to cover all possible words in multiple languages, making models computationally infeasible.

In this section, we'll explore the three most widely used subword tokenization techniques: Byte Pair Encoding (BPE), WordPiece, and SentencePiece. These algorithms power almost every major LLM today, from GPT to BERT to LLaMA. Each technique has its own approach to the fundamental challenge of breaking text into meaningful, reusable pieces that balance efficiency and semantic coherence.

Tokenization is the fundamental process that transforms human language into data that machines can process. In this section, we'll explore the three dominant tokenization algorithms that power modern Language Models: Byte Pair Encoding (BPE), WordPiece, and SentencePiece. These algorithms represent different approaches to the same challenge: how to efficiently break text into meaningful units that balance vocabulary size, computational efficiency, and semantic understanding.

Each algorithm has unique strengths that make it suitable for different applications. BPE excels at efficiency and has become the backbone of OpenAI's GPT models. WordPiece, with its probability-based approach, powers Google's BERT family. SentencePiece addresses the challenges of multilingual models where word boundaries may not be clearly defined. Understanding these tokenization methods is crucial because they directly influence how models interpret and generate text, affecting everything from translation quality to the handling of rare words.

By the end of this section, you'll understand how these algorithms work, their practical implementations, and why choosing the right tokenization strategy is a critical design decision when building language models. The tokenization approach can significantly impact a model's performance across languages, domains, and specific tasks - making it an essential concept to master in NLP engineering.

2.1.1 Byte Pair Encoding (BPE)

BPE (Byte Pair Encoding) is one of the simplest yet most powerful tokenization algorithms in modern NLP. It was originally developed for data compression in the 1990s but has been adapted for NLP to great effect. BPE works by iteratively merging the most frequent pairs of characters or character sequences in a corpus, creating a vocabulary that efficiently represents common patterns in language. This iterative approach allows BPE to build a vocabulary that organically captures the statistical regularities of the text it's trained on, making it extremely adaptable to different languages and domains without requiring linguistic expertise.

The key insight behind BPE is that frequently occurring character combinations often represent meaningful linguistic units. For example, common prefixes like "un-" or suffixes like "-ing" appear in many words and can be treated as single tokens. This allows models to understand word structure even when encountering new words.

By breaking words into subword units, BPE strikes an elegant balance between character-level tokenization (which is too granular and loses word structure) and word-level tokenization (which can't handle out-of-vocabulary words). This middle ground approach gives models the ability to process rare, compound, or even misspelled words by decomposing them into familiar subword components.

Furthermore, BPE's data-driven approach means it adapts to the specific domain it's trained on - a BPE tokenizer trained on medical texts will develop different merges than one trained on social media content, reflecting the different vocabulary distributions in these domains.

How BPE Works (step by step):

Start with characters as the initial vocabulary (e.g., individual letters, punctuation). This creates the base layer of tokens from which more complex tokens will be built. For example, in English, you might start with the 26 letters of the alphabet, digits 0-9, and common punctuation marks.
Count how often each pair of adjacent symbols appears together across the entire training corpus. This frequency analysis is crucial as it identifies patterns that occur naturally in language. For instance, in English text, you might find that "th" appears very frequently, while "zq" almost never does.
Merge the most frequent pair into a new token, adding it to the vocabulary. This creates a new, longer token that represents a commonly occurring pattern. For example, if "th" is the most common pair, it becomes a single token, reducing the need to process "t" and "h" separately in words like "the," "this," and "that."
Update the corpus to reflect this merge, replacing all instances of the pair with the new token. This step is critical as it changes the frequency distribution of the remaining pairs. After merging "th," for example, new pairs like "the" might become more prominent in the frequency count.
Repeat steps 2-4 until you reach the desired vocabulary size or a minimum frequency threshold. Each iteration creates increasingly complex tokens that capture common patterns in language. This recursive process might eventually create tokens for common prefixes (like "un-"), suffixes ("-ing"), or even complete common words.

This iterative process gradually builds up a vocabulary that captures meaningful subword units at different granularities, from single characters to full words. The beauty of BPE lies in its data-driven approach – it doesn't require linguistic rules but instead learns patterns directly from the text. This makes it adaptable to any language or domain without manual intervention, while still creating tokens that often align with intuitive linguistic units like morphemes (the smallest meaningful units of language).

Example: Let's tokenize the word "lower" using BPE with a simplified approach to illustrate the process in depth.

BPE begins by breaking down text into its smallest units before building it back up. This iterative merging process creates increasingly complex tokens that represent common patterns in language.

Start: l o w e r (each character is a separate token)Initially, every character is treated as an individual token. This character-level representation gives us maximum flexibility but is inefficient for processing.
Merge most frequent pair (assuming "lo" appears often) → lo w e rThe algorithm identifies that "l" and "o" frequently appear together across the entire corpus. By merging them, we create our first subword unit, reducing token count from 5 to 4.
Merge again (assuming "low" appears often) → low e rIn the next iteration, BPE finds that "lo" and "w" commonly co-occur, forming "low" which is a meaningful semantic unit in English (a complete morpheme). Now we're down to 3 tokens.
Merge again (assuming "er" appears often) → low erFinally, "e" and "r" are merged because this suffix appears frequently across many English words (worker, faster, higher, etc.). We've now compressed the representation to just 2 tokens.
Final representation: low er (two tokens)What started as 5 separate characters has been compressed into 2 meaningful subword units, reflecting common patterns in the language. This compression maintains semantic meaning while dramatically reducing the token count.

The real power of BPE becomes evident when processing new words. For instance, after training, the model learns that "low" and "er" appear often in the training corpus, so it can efficiently handle words like "lowest" or "lowering" even if they never appeared during training:

"lowest" → low est (assuming "est" is a learned token)Here, the model recognizes the base morpheme "low" and the superlative suffix "est" as separate tokens, even though it may never have seen "lowest" during training. This demonstrates how BPE enables compositional understanding of language.
"lowering" → low er ing (assuming "ing" is a learned token)Similarly, a word like "lowering" gets broken down into three meaningful components: the root "low", the comparative suffix "er", and the gerund suffix "ing". Each piece carries semantic information that helps the model understand the full meaning.

This ability to decompose words into meaningful subunits gives BPE-based models remarkable flexibility, allowing them to process vocabulary far beyond what they explicitly saw during training. It's particularly valuable for morphologically rich languages (like Finnish or Turkish) where words can have many variations through suffixes and prefixes.

For example, in Finnish, a single word can express what might require an entire phrase in English. The word "taloissanikinko" (meaning "in my houses too?") would be nearly impossible to process with word-level tokenization unless that exact form appeared in training. But with BPE, it might be split into components like "talo" (house), "issa" (in), "ni" (my), "kin" (too), and "ko" (question marker), allowing the model to understand even this complex construction.

Code Example: Training a toy BPE tokenizer with Hugging Face

Here's a simple implementation:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=200, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on a small dataset
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)

# Encode a word
output = tokenizer.encode("lowering")
print(output.tokens)  # Example output: ['low', 'er', 'ing']

Code breakdown:

1. Importing Libraries
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

This imports the necessary components from the Hugging Face tokenizers library to create, train, and use a BPE tokenizer.

2. Initializing the Tokenizer
tokenizer = Tokenizer(models.BPE())

This creates a new tokenizer using the Byte Pair Encoding (BPE) algorithm.

3. Configuring the Trainer
trainer = trainers.BpeTrainer(vocab_size=200, min_frequency=2)

This sets up a BPE trainer with two important parameters:

vocab_size=200: Limits the maximum vocabulary size to 200 tokens.
min_frequency=2: Only creates tokens from pairs that appear at least twice in the corpus.

4. Setting Pre-tokenization
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

Before BPE is applied, this configures the tokenizer to split text on whitespace, giving the BPE algorithm word-level contexts to work with.

5. Training Data
corpus = ["low", "lowest", "lower", "newest"]

This defines a small training corpus with four words that share some common patterns.

6. Training the Tokenizer
tokenizer.train_from_iterator(corpus, trainer)

This trains the tokenizer on the corpus using the configured BPE trainer.

7. Testing the Tokenizer
output = tokenizer.encode("lowering")

This encodes a new word ("lowering") that wasn't in the training corpus.

8. Displaying Results
print(output.tokens) # Example output: ['low', 'er', 'ing']

This prints the tokens that result from encoding "lowering". The example output shows how BPE might break this word into subword units: 'low', 'er', and 'ing'.

Key Points About This Implementation:

This is a minimal example that demonstrates the core BPE workflow: initialize, train, and encode.
Even with a tiny corpus, the tokenizer can handle unseen words by breaking them into meaningful subword components.
The expected output shows how "lowering" gets split into "low" (which was in the training data), plus common English suffixes "er" and "ing".

Enhanced implementation:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, decoders
import matplotlib.pyplot as plt
import pandas as pd

# Initialize a BPE tokenizer
tokenizer = Tokenizer(models.BPE(unk_token="[UNK]"))

# Configure the trainer with more options
trainer = trainers.BpeTrainer(
    vocab_size=200,           # Maximum vocabulary size
    min_frequency=2,          # Minimum frequency to create a token
    special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"],  # Special tokens
    show_progress=True,       # Show progress during training
    initial_alphabet=None     # Use default initial alphabet
)

# Configure pre-tokenization (how text is split before BPE)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Configure decoder (how tokens are joined back into text)
tokenizer.decoder = decoders.WordPiece(prefix="##")

# Define a more diverse training corpus
corpus = [
    "low", "lowest", "lower", "lowering", "slowly", "follow", "hollow",
    "below", "fellowship", "yellow", "mellow", "pillow", "newest",
    "newer", "news", "newspaper", "newt", "newton", "newborn"
]

# Train the tokenizer
tokenizer.train_from_iterator(corpus, trainer)

# Print the vocabulary
print("Vocabulary:")
vocab = tokenizer.get_vocab()
sorted_vocab = sorted(vocab.items(), key=lambda x: x[1])
for token, id in sorted_vocab:
    print(f"Token: {token:15} ID: {id}")

# Encode example words
example_words = ["lowering", "lowered", "follower", "newlywed", "slowness"]
print("\nEncoding examples:")
for word in example_words:
    output = tokenizer.encode(word)
    print(f"{word:15} → {output.tokens} (IDs: {output.ids})")

# Visualize token frequencies
plt.figure(figsize=(12, 6))
tokens = [t for t, _ in sorted_vocab if t not in ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]]
ids = [i for t, i in sorted_vocab if t not in ["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"]]
plt.bar(tokens, [len(token) for token in tokens])
plt.title("Token Length Distribution")
plt.xlabel("Token")
plt.ylabel("Length")
plt.xticks(rotation=90)
# plt.show()  # Uncomment to show plot

# Create a function to demonstrate the BPE merge process step by step
def simulate_bpe_merges(word, merges):
    """Simulate BPE merge process on a single word."""
    # Start with characters
    chars = list(word)
    print(f"Initial: {' '.join(chars)}")
    
    # Apply merges in order
    for i, merge in enumerate(merges):
        a, b = merge
        j = 0
        while j < len(chars) - 1:
            if chars[j] == a and chars[j+1] == b:
                chars[j] = a + b
                chars.pop(j+1)
            else:
                j += 1
        print(f"Merge {i+1} ({a}+{b}): {' '.join(chars)}")
    
    return chars

# Example of manually tracing the BPE process
print("\nSimulating BPE merge process for 'lowering':")
# These merges are hypothetical - in practice they'd be learned from data
merges = [('l', 'o'), ('lo', 'w'), ('e', 'r'), ('er', 'i'), ('eri', 'n'), ('erin', 'g')]
final_tokens = simulate_bpe_merges("lowering", merges)

The code example demonstrates a complete implementation of Byte Pair Encoding (BPE) tokenization using the Hugging Face tokenizers library. Let's break down each component:

1. Setup and Initialization

tokenizer = Tokenizer(models.BPE(unk_token="[UNK]")) - Creates a new tokenizer using the BPE algorithm. The unk_token parameter defines a special token to use for characters or sequences not in the vocabulary.

2. Trainer Configuration

The BpeTrainer is configured with several important parameters:

vocab_size=200 - Sets a maximum vocabulary size of 200 tokens. This is an upper bound; the actual vocabulary might be smaller if there aren't enough frequent pairs.
min_frequency=2 - Only creates tokens from pairs that appear at least twice in the corpus. This prevents overfitting to rare sequences.
special_tokens - Adds standard special tokens used in many transformer models:
- [UNK] - Unknown token
- [CLS] - Classification token (used at the start of sequences)
- [SEP] - Separator token (separates different segments)
- [PAD] - Padding token
- [MASK] - Masking token (for masked language modeling)

3. Pre-tokenization and Decoding

tokenizer.pre_tokenizer = pre_tokenizers.Whitespace() - Before BPE is applied, text is split on whitespace. This gives BPE word-level contexts to work with.

tokenizer.decoder = decoders.WordPiece(prefix="##") - Configures how tokens are joined back into text. Using the WordPiece decoder with "##" prefix helps visualize token boundaries.

4. Training Corpus

The corpus is expanded to include more diverse examples with common patterns:

Words with the "low" root: "low", "lowest", "lower", etc.
Words with the "new" root: "newest", "newer", "newborn", etc.
This gives BPE more opportunities to learn meaningful subword patterns.

5. Training and Vocabulary Inspection

tokenizer.train_from_iterator(corpus, trainer) - Trains the tokenizer on our corpus using the configured trainer.

The code then prints the entire vocabulary with token IDs, showing what the model learned.

6. Testing with Example Words

Several test words are encoded to demonstrate how BPE handles both seen and unseen words:

"lowering" - A word from the training corpus
"lowered" - A variation on words in the corpus
"newlywed" - A compound of subwords from the corpus
"slowness" - Tests how different morphological forms are handled

7. Visualization

The code includes a visualization component that shows the distribution of token lengths, which helps understand what kinds of subword units BPE is learning.

8. BPE Process Simulation

The simulate_bpe_merges function provides a step-by-step illustration of how BPE progressively merges character pairs:

It starts with individual characters (e.g., "l o w e r i n g")
It applies merges in sequence (e.g., "l+o" → "lo w e r i n g")
It continues until all possible merges are applied
This simulation helps visualize how tokens are built up from characters

This expanded implementation demonstrates the complete BPE tokenization workflow from initialization to training, testing, and visualization - all key components for understanding how modern language models process text.

2.1.2 WordPiece

WordPiece, developed by Google for machine translation and later used in BERT, is similar to BPE but uses a likelihood-based approach. Instead of just merging the most frequent character pairs, WordPiece merges tokens that maximize the probability of the training data under a language model. This means that WordPiece evaluates potential merges based on how they would improve the overall language model's ability to predict the training corpus.

In practical terms, WordPiece starts with a vocabulary of individual characters and iteratively adds new tokens by combining existing ones. For each potential merge, it calculates how much this merge would increase the likelihood of the training data. The merge that offers the greatest improvement in likelihood is chosen in each iteration. This approach tends to favor merges that create linguistically meaningful units like common prefixes, suffixes, and word stems.

To understand this better, let's look at how WordPiece works step by step:

Initialization: Begin with a vocabulary containing individual characters and special tokens. This forms the foundation upon which the algorithm will build more complex tokens. For example, with English text, this would include a-z, digits, punctuation, and special tokens like [UNK], [PAD], etc.
Training Procedure:
- Calculate the likelihood of the training corpus under the current vocabulary. This involves computing how well the current set of tokens can represent the training data when used in a language model.
- For each possible pair of tokens in the vocabulary, compute how merging them would change the corpus likelihood. This step evaluates the "value" of creating new tokens by combining existing ones.
- Select the merge that maximizes the likelihood improvement. Unlike BPE which simply chooses the most frequent pair, WordPiece selects the pair that most improves the model's ability to predict the training data.
- Add the new merged token to the vocabulary. This expands the model's vocabulary with meaningful units rather than just frequent character combinations.
- Repeat until the target vocabulary size is reached or likelihood improvements fall below a threshold. This iterative process continues until we have a sufficiently powerful vocabulary or diminishing returns set in.
Scoring Function: Uses a language model probability to evaluate each potential merge. The key innovation in WordPiece is this scoring mechanism, which considers how each potential token contributes to modeling the entire corpus, not just local statistics. This leads to more semantically meaningful tokens that capture linguistic patterns.

This differs from BPE in a crucial way: while BPE simply counts frequencies of adjacent pairs, WordPiece considers the global impact of each merge on modeling the entire corpus. This difference becomes particularly important when handling morphologically rich languages where meaningful word parts carry significant semantic information.

For example, in English, WordPiece might quickly learn merges like "in" + "g" → "ing" or "re" + "s" → "res" because these combinations frequently occur in meaningful contexts across many words. However, it might also learn combinations like "dis" + "like" → "dislike" even if they're less frequent than some purely statistical pairs, because "dislike" as a unit helps the model better predict surrounding words in sentences.

Another example that demonstrates WordPiece's strength is how it might handle the word "unwrappable":

A frequency-based approach might break it as "unw" + "rapp" + "able" based purely on character pair countsA frequency-based approach might break it as "unw" + "rapp" + "able" based purely on character pair counts
WordPiece is more likely to produce "un" + "wrap" + "able" because these subwords are more meaningful units that better predict contextWordPiece is more likely to produce "un" + "wrap" + "able" because these subwords are more meaningful units that better predict context

This subtle difference often yields a more efficient vocabulary for tasks like translation, as it captures more semantically meaningful subword units rather than just statistically frequent ones. The likelihood-based approach helps WordPiece create tokens that better align with linguistic structures, potentially improving downstream task performance. In practice, this means models using WordPiece tokenization can often better handle words with common prefixes and suffixes, as well as compound words, even when specific combinations weren't seen during training.

Example: Understanding WordPiece Tokenization in Detail

The word "unhappiness" might be tokenized as:

["un", "##happiness"]

Notice the ## prefix used in BERT's WordPiece tokenizer. It signals that "happiness" is not a standalone word here, but a continuation. This notation is crucial for two reasons:

It preserves information about word boundaries during processing
It allows the model to distinguish between the same sequence appearing at the start of a word versus within/at the end of a word

For instance, "un" as a standalone token has different semantic implications than when it appears as a prefix meaning "not" or "opposite of." Similarly, "happiness" as a complete word differs from "##happiness" as a word segment.

This distinction is important for understanding context. When "un" appears alone, it might be part of various words or phrases like "un-American" or "UN resolution." But when paired with "##happiness," the model knows it's specifically functioning as a negating prefix.

The "##" marker system also helps with disambiguation. For example:

In "understand," the "un" isn't functioning as a negation (it's not the opposite of "derstand")
In "unhappy," the "un" is clearly negating "happy"

By learning these patterns, the model can better grasp the compositional meaning of words it encounters.

This segmentation enables the model to recognize common affixes (like the negative prefix "un-") and root words separately, allowing it to understand relationships between words like "happy," "unhappy," and "unhappiness" even if some forms were rare or absent in training. When decoding/detokenizing, the model knows to join tokens with the "##" prefix directly to the preceding token without adding spaces.

The power of this approach becomes evident when dealing with words the model hasn't seen before. For example, if the model encounters "unremarkableness" for the first time, it might tokenize it as ["un", "##remark", "##able", "##ness"]. Even if this exact word wasn't in the training data, the model can still understand its meaning by recognizing familiar components:

"un" - negation prefix
"##remark" - root word related to "remark"
"##able" - suffix indicating capability
"##ness" - suffix that forms a noun expressing a state or quality

This compositional understanding is what allows modern language models to handle vast vocabularies without explicitly storing every possible word form.

Code Example: Using a WordPiece tokenizer (via Hugging Face BERT)

from transformers import BertTokenizer

# Load pre-trained WordPiece tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Basic tokenization example
simple_word = "unhappiness"
tokens = tokenizer.tokenize(simple_word)
print(f"'{simple_word}' tokenized: {tokens}")  # Output: ['un', '##happiness']

# More complex examples
example_texts = [
    "unremarkableness",
    "antidisestablishmentarianism",
    "She's reading about counterrevolutionaries.",
    "The neurotransmitter affects neuroplasticity."
]

print("\nMore examples of WordPiece tokenization:")
for text in example_texts:
    tokens = tokenizer.tokenize(text)
    print(f"'{text}' tokenized:")
    print(f"  {tokens}")
    print(f"  Token count: {len(tokens)}")
    
# Demonstrate full tokenization pipeline (including special tokens)
sentence = "WordPiece handles unseen words like 'hyperparameterization' effectively."
inputs = tokenizer(sentence, return_tensors="pt")
print(f"\nFull sentence tokenization:")
print(f"Input text: {sentence}")
print(f"Input IDs: {inputs['input_ids'][0].tolist()}")
print(f"Decoded: {tokenizer.decode(inputs['input_ids'][0])}")

# Demonstrate handling of out-of-vocabulary words
oov_word = "supercalifragilisticexpialidocious"
oov_tokens = tokenizer.tokenize(oov_word)
print(f"\nOOV word '{oov_word}' tokenized:")
print(f"  {oov_tokens}")
print(f"  Token count: {len(oov_tokens)}")

Breakdown of the WordPiece Tokenization Code Example:

1. Basic Initialization and Simple Example
- We import the BertTokenizer from the transformers library, which implements WordPiece tokenization.
- We load the pre-trained tokenizer for "bert-base-uncased" which contains a vocabulary of 30,522 tokens learned from a large English corpus.
- The simple example with "unhappiness" shows how WordPiece breaks this into ["un", "##happiness"], recognizing "un" as a common prefix.
2. Complex Word Examples
- The code demonstrates tokenization of increasingly complex words to show how WordPiece handles morphologically rich terms.
- "unremarkableness" would likely be broken into ["un", "##remark", "##able", "##ness"], showing how the algorithm identifies common affixes.
- "antidisestablishmentarianism" demonstrates how very long words get broken into meaningful subword units.
- The sentence examples show how WordPiece handles real-world text with punctuation and multiple words.
3. Full Tokenization Pipeline
- The example shows the complete tokenization process (not just splitting) including:
  - Addition of special tokens ([CLS] at the beginning, [SEP] at the end)
  - Conversion to token IDs (numbers that the model actually processes)
  - Decoding back to text (showing how the process is reversible)
- This demonstrates that tokenization isn't just about splitting text but preparing it in the exact format required by the model.
4. Out-of-Vocabulary (OOV) Handling
- The example with "supercalifragilisticexpialidocious" shows how WordPiece handles words it has never seen before.
- Instead of using a generic [UNK] token for the entire word (which would lose all information), WordPiece breaks it into familiar subwords.
- This demonstrates the key advantage of subword tokenization: the ability to process unlimited vocabulary by compositional understanding.
5. Key Insights from this Example
- WordPiece's use of the "##" prefix clearly marks token positions (word-initial vs. word-internal).
- The tokenizer balances between character-level granularity (which would be too fine) and word-level tokens (which would require an enormous vocabulary).
- For machine learning, this approach creates a manageable vocabulary size while preserving meaningful semantic units.
- The tokenizer maintains enough information for the model to understand morphology (word structure) and reconstruct original text.

2.1.3 SentencePiece

SentencePiece, developed by Google specifically for multilingual models like mT5, represents a significant advancement in tokenization technology. It offers substantially more flexibility and power compared to previous methods such as BPE and WordPiece. What makes SentencePiece truly revolutionary is its fundamental approach to text processing - it treats text as a raw stream of bytes without relying on whitespace as word boundaries. This fundamental difference from earlier approaches enables it to handle any language with equal effectiveness.

This design choice is particularly valuable for languages like Chinese, Japanese, Thai, and Korean, where words aren't separated by spaces and word segmentation itself is a complex linguistic challenge. For example, in Japanese, the sentence "私は東京に住んでいます" (I live in Tokyo) has no spaces between words, making traditional word-based tokenization extremely difficult. SentencePiece handles such cases naturally without requiring specialized preprocessing steps for each language.

Unlike BPE and WordPiece, which typically operate on pre-tokenized text (often assuming words are already separated by spaces), SentencePiece works directly on raw text without any prerequisite tokenization steps. This fundamental difference represents a significant advancement in tokenization technology. SentencePiece treats the entire text as a continuous stream of characters, making no assumptions about word boundaries or language-specific rules. This approach has several key advantages:

It eliminates language-specific pre-processing requirements, making the tokenization pipeline simpler and more universal. Traditional tokenizers often require separate rules for different languages (like word segmentation for Asian languages), while SentencePiece applies the same algorithm universally, drastically simplifying multilingual systems.
It creates a completely reversible tokenization process, allowing perfect reconstruction of the original text without ambiguity. By using special symbols to mark word boundaries (rather than assuming spaces), SentencePiece can precisely reconstruct the original text - critical for tasks like translation where preserving exact formatting matters.
It handles all languages with a unified approach, regardless of their writing system or grammatical structure. This means Japanese, Chinese, English, Arabic, and any other language can be processed through the exact same pipeline without specialized rules, greatly simplifying multilingual model development.
It maintains consistent tokenization across languages in multilingual models, which improves cross-lingual transfer learning and translation quality. When all languages are tokenized with the same approach, the model can more easily identify patterns across languages, facilitating knowledge transfer between high-resource and low-resource languages.
It significantly reduces the need for language-specific engineering efforts when expanding to new languages. Adding support for a new language requires minimal effort - simply include text samples in the tokenizer training data without creating custom preprocessing rules, dramatically accelerating the development of truly multilingual AI systems.

SentencePiece can be used with either BPE or Unigram language models to decide the best token splits. The Unigram language model approach is particularly interesting as it uses a probabilistic model to find the most likely segmentation of text, which often produces more linguistically meaningful tokens.

Unlike BPE's deterministic merging approach, the Unigram method employs statistical modeling to evaluate multiple possible segmentations of a text sequence. This probabilistic foundation allows it to capture more nuanced patterns in language. The Unigram method works by:

Starting with a large vocabulary of potential subword units (often tens or hundreds of thousands of candidates)
Iteratively removing tokens that contribute least to the overall likelihood of the corpus, using a careful pruning strategy that considers both token frequency and their contribution to overall text compression
Using a probabilistic model to select the optimal segmentation from multiple possibilities, where each token has an associated probability in the model
Employing a variant of the Viterbi algorithm (a dynamic programming approach) to find the most probable segmentation of any given text

The mathematical foundation of the Unigram model is based on likelihood maximization. For a sequence of characters, it attempts to find the segmentation that maximizes:

P(x) = ∏ P(xi)

Where xi represents individual tokens in a particular segmentation. This formula captures the idea that the probability of a sequence is the product of the probabilities of its component tokens, assuming independence between tokens.

For example, when tokenizing the word "unbelievable", the Unigram model might consider multiple segmentations:

["un", "believable"] with probability P("un") × P("believable")
["un", "believe", "able"] with probability P("un") × P("believe") × P("able")
["unbelievable"] with probability P("unbelievable")

The model would select whichever has the highest probability according to its trained parameters. This approach allows SentencePiece to adapt to the specific characteristics of each language while maintaining a consistent methodology across all languages, making it the tokenizer of choice for state-of-the-art multilingual language models.

Example in detail:

The Japanese sentence "私は学生です" (I am a student) might be tokenized as:

["▁私", "は", "学", "生", "です"]

Here, the special underscore ▁ (called "meta symbol") indicates a word boundary. This is a critical feature that allows the model to reconstruct the original text without ambiguity. Let's break down what's happening in more detail:

The underscore before "私" (watashi - "I") indicates the start of a new word. This boundary information is crucial because Japanese doesn't use spaces between words in written text.
Each character is tokenized separately, reflecting the character-based nature of Japanese writing. Unlike alphabetic systems where letters combine to form phonetic units, each Japanese character often carries semantic meaning.
The model can learn to group common character sequences like "です" (desu - verb "to be") as single tokens. This demonstrates SentencePiece's ability to identify functional units in language beyond simple character divisions.
The algorithm dynamically determines the optimal granularity for tokenization based on statistical patterns in the training data, not rigid rules.
This approach preserves the logical structure of Japanese text without requiring language-specific preprocessing or word segmentation tools.

For comparison, the same sentence might be processed differently in English. "I am a student" could be tokenized as:

["▁I", "▁am", "▁a", "▁student"]

Notice how every token in the English example has the underscore prefix, while in Japanese only the first token does. This is because:

In English, SentencePiece recognizes the space characters as natural word boundaries and replaces them with the underscore symbol.
In Japanese, only the beginning of the sentence (or after punctuation) would get the underscore, as there are no explicit spaces in the original text.
This allows the tokenizer to handle the fundamental structural differences between languages transparently.

This consistent approach across languages with different writing systems is what makes SentencePiece particularly valuable for multilingual models and translation tasks. The model doesn't need separate tokenization strategies for each language—it learns appropriate segmentation patterns directly from the data, making it exceptionally versatile for processing dozens or even hundreds of languages simultaneously.

Code Example: Training SentencePiece on a toy dataset

import sentencepiece as spm
import numpy as np
import matplotlib.pyplot as plt

# 1. Create a more diverse corpus with multiple languages
with open("multilingual_corpus.txt", "w") as f:
    f.write("I am a student\nI am learning AI\n")  # English
    f.write("私は学生です\n人工知能を勉強しています\n")  # Japanese
    f.write("Yo soy estudiante\nEstoy aprendiendo IA\n")  # Spanish
    f.write("我是学生\n我正在学习人工智能\n")  # Chinese

# 2. Train SentencePiece with more configuration options
spm.SentencePieceTrainer.train(
    input="multilingual_corpus.txt",
    model_prefix="multilingual_model",
    vocab_size=500,  # Larger vocabulary for multilingual support
    character_coverage=0.9995,  # Higher coverage for non-Latin scripts
    model_type="unigram",  # Using the unigram model instead of BPE
    user_defined_symbols=["<mask>", "<cls>", "<sep>"],  # Special tokens for ML tasks
    input_sentence_size=10000,  # Maximum sentences to load
    shuffle_input_sentence=True  # Shuffle sentences for better distribution
)

# 3. Load the trained tokenizer
sp = spm.SentencePieceProcessor()
sp.load("multilingual_model.model")

# 4. Basic tokenization examples across languages
examples = [
    "I am a student learning about AI and machine learning.",
    "私は人工知能について学んでいる学生です。",
    "Yo soy un estudiante que aprende sobre inteligencia artificial.",
    "我是一个学习人工智能的学生。"
]

print("===== Basic Tokenization Examples =====")
for text in examples:
    tokens = sp.encode(text, out_type=str)
    print(f"\nOriginal: {text}")
    print(f"Tokens: {tokens}")
    print(f"Token IDs: {sp.encode(text)}")
    print(f"Decoded: {sp.decode(sp.encode(text))}")
    print(f"Number of tokens: {len(tokens)}")

# 5. Demonstrating reversibility
test_text = "SentencePiece handles multiple languages: English, 日本語, Español, 中文"
encoded = sp.encode(test_text)
decoded = sp.decode(encoded)

print("\n===== Demonstrating Reversibility =====")
print(f"Original: {test_text}")
print(f"Encoded and decoded: {decoded}")
print(f"Matches original: {test_text == decoded}")

# 6. Exploring the vocabulary
print("\n===== Vocabulary Exploration =====")
vocab_size = sp.get_piece_size()
print(f"Vocabulary size: {vocab_size}")

# Show the first 20 tokens in the vocabulary
print("\nFirst 20 tokens in vocabulary:")
for i in range(min(20, vocab_size)):
    piece = sp.id_to_piece(i)
    score = sp.get_score(i)
    print(f"ID: {i}, Token: '{piece}', Score: {score}")

# 7. Visualizing token distribution
test_long = " ".join(examples)
token_ids = sp.encode(test_long)
token_counts = {}

for token_id in token_ids:
    token = sp.id_to_piece(token_id)
    if token in token_counts:
        token_counts[token] += 1
    else:
        token_counts[token] = 1

# Get top 15 tokens by frequency
top_tokens = sorted(token_counts.items(), key=lambda x: x[1], reverse=True)[:15]
tokens, counts = zip(*top_tokens)

print("\n===== Token Frequency Distribution =====")
print(f"Most common tokens: {tokens}")
print(f"With counts: {counts}")

# Plot option (commented out for compatibility)
"""
plt.figure(figsize=(12, 6))
plt.bar(tokens, counts)
plt.title("Top 15 Token Frequencies")
plt.xlabel("Tokens")
plt.ylabel("Frequency")
plt.xticks(rotation=45, ha="right")
plt.tight_layout()
plt.savefig("token_distribution.png")
"""

# 8. Out of vocabulary handling demonstration
rare_text = "supercalifragilisticexpialidocious is an extraordinary word"
print("\n===== OOV Handling =====")
print(f"Original: {rare_text}")
print(f"Tokenized: {sp.encode(rare_text, out_type=str)}")
print(f"Token count: {len(sp.encode(rare_text))}")

Code Breakdown and Explanation:

1. Creating a Multilingual Corpus

The example creates a diverse training corpus with text in multiple languages (English, Japanese, Spanish, and Chinese). This demonstrates SentencePiece's key strength in handling multiple languages with different writing systems within a single tokenizer.

2. Training Configuration

vocab_size=500: Increased from the original 100 to better handle multiple languages.
character_coverage=0.9995: Controls what percentage of characters in the training data should be covered by the model. Higher values ensure rare characters in non-Latin scripts are included.
model_type="unigram": Explicitly uses the Unigram algorithm instead of BPE, which is better for handling multiple languages with different morphological structures.
user_defined_symbols: Adds special tokens that might be needed for specific machine learning tasks like masked language modeling.
shuffle_input_sentence: Ensures the training data is well mixed across languages.

3. Basic Tokenization Examples

The code demonstrates tokenization across four languages, showing:

How the same tokenizer handles different scripts (Latin, Japanese, Chinese)
The output tokens in human-readable form (out_type=str)
The corresponding token IDs used by models
Perfect reconstruction of the original text through decoding
Token count for each example (important for understanding how efficiently different languages are tokenized)

4. Demonstrating Reversibility

This section showcases SentencePiece's perfect reversibility - the ability to decode tokenized text back to the exact original text without loss of information. This is critical for tasks like translation where preserving exact text structure matters.

5. Vocabulary Exploration

The code examines the learned vocabulary by:

Displaying the total vocabulary size
Showing the first 20 tokens with their IDs and scores
The scores represent the log probability of each token in the unigram model

6. Token Distribution Analysis

This section analyzes how tokens are distributed in actual text by:

Counting token frequencies in a mixed-language sample
Identifying the most common tokens across languages
Including (commented) visualization code that would plot these distributions

7. OOV Handling Demonstration

The final section shows how SentencePiece handles out-of-vocabulary (OOV) words like "supercalifragilisticexpialidocious" by breaking them into smaller subword units. This demonstrates SentencePiece's ability to handle any text, even words never seen during training.

This comprehensive example illustrates the key strengths of SentencePiece for multilingual NLP applications:

Language-agnostic tokenization without preprocessing
Perfect reversibility for lossless text handling
Efficient subword segmentation across multiple writing systems
Graceful handling of out-of-vocabulary words
Statistical approach to token selection that adapts to language patterns

2.1.4 Why These Matter

BPE (Byte Pair Encoding)

BPE is fast, simple, and widely used in leading models (e.g., GPT-2, GPT-3). It works by iteratively merging the most frequent character pairs in a corpus, creating new tokens from common sequences.

The algorithm starts with a vocabulary of individual characters and repeatedly combines the most frequently occurring adjacent pairs until it reaches a desired vocabulary size. For example, if "er" frequently appears together in English text, BPE would create a single token representing this character pair.

Let's walk through a simplified example:

Start with character-level tokens: ["h", "e", "l", "l", "o", " ", "w", "o", "r", "l", "d"]
Count frequencies of adjacent pairs: ("h", "e"), ("e", "l"), ("l", "l"), etc.
Merge the most frequent pair, e.g., if ("l", "l") is most frequent: ["h", "e", "ll", "o", " ", "w", "o", "r", "l", "d"]
Repeat until vocabulary size limit is reached or no more frequent pairs exist

This approach efficiently handles common subwords while still being able to break down rare words into smaller components. BPE's simplicity makes it computationally efficient, which is crucial when training on massive datasets. The method provides a good balance between character-level tokenization (which produces too many tokens) and word-level tokenization (which struggles with out-of-vocabulary words).

WordPiece

WordPiece is optimized for probability-based merges, used in BERT and other transformer models. Unlike BPE which merges based on frequency alone, WordPiece selects merges that maximize the likelihood of the training data. This results in a vocabulary that better captures linguistic patterns and can improve model performance on tasks requiring nuanced language understanding.

In practice, WordPiece works similarly to BPE but with a crucial difference in the selection criteria. It uses a language modeling objective to decide which subword units to merge, calculating the likelihood of the training corpus after each potential merge and selecting the one that increases this likelihood the most. This approach can be thought of as "greedy language modeling" - at each step, it chooses the merge that best improves the model's ability to predict text.

For example, in English, WordPiece might prefer to merge "ing" as a single token because this suffix appears in many words and forms a meaningful linguistic unit. Similarly, in German, it might efficiently tokenize compound words by identifying common components.

The algorithm also handles word boundaries differently than BPE. It typically marks the beginning of words with a special character (often "##" in BERT implementations), which helps the model distinguish between the same character sequence appearing at the start versus within a word. This feature is particularly useful for languages where morphology carries important grammatical information.

WordPiece typically produces slightly different tokenization patterns than BPE, especially for morphologically rich languages.

SentencePiece

SentencePiece is a language-agnostic tokenization method designed specifically for multilingual NLP applications. Unlike traditional tokenizers that require language-specific rules, SentencePiece treats the input as a raw stream of Unicode characters without any assumptions about word boundaries or language structure. This fundamental design choice offers several key advantages:

True language-agnosticism: By operating directly on Unicode code points, SentencePiece eliminates the need for language-specific pre-processing like word segmentation or morphological analysis. This makes it equally effective across all human languages.
Seamless handling of scriptio continua: Languages like Japanese, Chinese, and Thai that don't use spaces between words have traditionally required specialized tokenizers. SentencePiece handles these languages natively, learning appropriate segmentation patterns directly from data.
Consistent multilingual representation: When trained on multilingual corpora, SentencePiece develops a shared vocabulary that effectively represents cross-lingual patterns, making it ideal for translation systems and multilingual models that need to handle dozens or hundreds of languages simultaneously.
Perfect reversibility: SentencePiece maintains lossless round-trip conversion between text and tokens. This ensures tokenized text can be perfectly reconstructed without information loss, which is crucial for generation tasks.
Whitespace preservation: Unlike many tokenizers that discard or normalize whitespace, SentencePiece preserves the exact spacing of the original text, treating spaces as regular characters. This enables models to learn proper formatting and layout.
Implementation flexibility: SentencePiece supports both unigram and BPE algorithms within the same framework, allowing researchers to choose the most appropriate method for their specific application while maintaining consistent preprocessing.

Together, these methods form the backbone of tokenization in Large Language Models (LLMs). Without these sophisticated tokenization approaches, modern models would struggle to process the vast diversity of human language efficiently. Let's explore why these methods are so critical:

Critical Role in Model Architecture

Tokenization serves as the first layer of translation between human language and machine understanding. The quality and characteristics of this translation directly affect everything that happens afterward in the model. Poor tokenization can introduce biases, inefficiencies, and limitations that no amount of parameter tuning can fully overcome.

Think of tokenization as the foundation upon which the entire language model is built. Just as a building with a weak foundation will have structural problems regardless of how well-designed its upper floors are, a model with suboptimal tokenization will struggle to reach its full potential despite having sophisticated neural architectures.

This critical role manifests in several key ways:

Tokenization determines what patterns the model can learn. If important linguistic units are split across multiple tokens, the model must work harder to recognize these patterns.
The efficiency of token representation directly impacts computational requirements. Models process text token-by-token, so inefficient tokenization can significantly slow down both training and inference.
Token distribution affects attention mechanisms. Transformer-based models rely on attention to establish relationships between tokens, and the way text is tokenized shapes these relationships.
Language representation is fundamentally shaped by tokenization. Languages with different scripts or structures may be represented with varying degrees of efficiency, potentially creating performance disparities across languages.

The vocabulary size itself represents an important architectural trade-off. Larger vocabularies can capture more linguistic patterns directly but require more parameters in the embedding layer. Smaller vocabularies are more computationally efficient but may require more tokens to represent the same text.

Distinct Advantages of Each Method

BPE (Byte Pair Encoding): Provides computational efficiency by maintaining a relatively small vocabulary (typically 30-50K tokens) while still capturing common subword patterns. This efficiency makes training faster and reduces memory requirements. BPE is particularly effective for European languages with similar alphabets and morphological structures.
WordPiece: Offers probability-aware tokenization that better captures linguistic units. By optimizing for likelihood rather than just frequency, WordPiece can develop a vocabulary that more accurately represents meaningful language components. This probability-based approach helps models better understand the semantic structure of text, improving performance on tasks requiring nuanced language comprehension.
SentencePiece: Enables truly language-agnostic processing by treating all text as Unicode sequences without assumptions about word boundaries. This approach is revolutionary for multilingual models, as it eliminates the need for language-specific preprocessing pipelines. SentencePiece can seamlessly handle languages with different writing systems, word boundary conventions, and morphological structures within a unified framework.

Performance Implications

The choice of tokenization method can significantly impact a model's performance across different dimensions:

Language coverage: Models using BPE might excel at Indo-European languages but struggle with languages using different scripts or linguistic structures. This is because BPE was originally designed with English and similar languages in mind, which share certain characteristics like clear word boundaries and relatively simple morphology. When applied to languages with different writing systems (like Chinese, Japanese, or Thai) or complex morphological structures (like Turkish or Finnish), BPE often creates inefficient tokenization patterns. SentencePiece offers more consistent performance across diverse language families because it treats all text as a raw sequence of Unicode characters without making assumptions about word boundaries, allowing it to learn appropriate segmentation for each language directly from data.
Vocabulary efficiency: Different methods achieve varying levels of compression. A more efficient tokenizer can represent the same information with fewer tokens, reducing computational costs during both training and inference. BPE tends to be efficient for languages it was designed for, but may require more tokens for others. WordPiece's probability-based approach often creates more semantically meaningful tokens, potentially improving efficiency for certain tasks. SentencePiece's language-agnostic approach can be particularly efficient for multilingual content, as it develops a shared vocabulary that captures cross-lingual patterns. This efficiency directly impacts model performance, as longer token sequences require more computational resources and can exceed context window limitations.
Out-of-vocabulary handling: All three methods provide mechanisms to handle previously unseen words, but they differ in how gracefully they manage truly novel constructions or rare words from low-resource languages. BPE handles unknown words by breaking them down into smaller subword units, but may create inefficient representations for certain word types. WordPiece uses its probability-aware approach to create more linguistically informed decompositions. SentencePiece's character-level fallback ensures that any text can be tokenized, even if inefficiently. This becomes especially important when models encounter specialized terminology, proper names from less-represented languages, or deliberately obfuscated text that wasn't present in training data.
Cross-lingual transfer: For multilingual models, the tokenization strategy affects how well knowledge transfers between languages. SentencePiece's language-agnostic approach often facilitates better cross-lingual performance because it creates consistent tokenization patterns across languages, enabling the model to recognize similar linguistic structures even when they appear in different languages. This is particularly valuable for translation tasks, multilingual understanding, and zero-shot learning where knowledge learned in high-resource languages needs to transfer to low-resource ones. Models using more language-specific tokenization approaches may develop separate "subnetworks" for each language, limiting knowledge sharing between them.

Ultimately, the choice of tokenization method represents a crucial architectural decision that shapes a model's capabilities, biases, and performance characteristics across languages and tasks. Recent research continues to explore hybrid approaches and novel tokenization strategies to address the limitations of existing methods.