Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)
5. Evaluate Tokenizer Quality
A good tokenizer reduces sequence length and respects domain terms. The quality of your tokenizer will directly impact how efficiently your model processes text and how well it understands domain-specific concepts.
Metrics to check:
- Avg. tokens per sample (lower is better) - This measures how efficiently your tokenizer compresses text. Fewer tokens means less computation during model training and inference, resulting in faster processing and reduced memory requirements.
- Compression ratio = (chars per sample) / (tokens per sample) - This ratio helps quantify tokenization efficiency. Higher values indicate better compression, where more characters are represented by fewer tokens. For domain-specific text, you should aim for ratios higher than general-purpose tokenizers.
- Term integrity: Are key phrases single tokens? ("plaintiff", "MSA", "Section 2.3") - Domain expertise is often encoded in specialized terminology. When these terms are preserved as single tokens, models can learn their semantic meaning more effectively. Check whether important domain-specific terms remain intact rather than being split into multiple subword tokens.
- Symbol handling: parentheses, section symbols, citations, code-like patterns - Specialized domains often use unique formatting conventions (like "§2.1.3" in legal text or "C3H8O3" in chemistry). A good domain-specific tokenizer should handle these patterns appropriately, either preserving them as single tokens or splitting them consistently in a way that maintains their semantic meaning.
def eval_stats(tokenize, samples):
tok_lens = []
char_lens = []
for s in samples:
ids = tokenize(s)
tok_lens.append(len(ids))
char_lens.append(len(s))
return {
"avg_tokens": statistics.mean(tok_lens),
"avg_chars": statistics.mean(char_lens),
"compression": statistics.mean(char_lens) / max(1e-6, statistics.mean(tok_lens))
}
def tokenize_bpe(s): return bpe_tok.encode(s).ids
def tokenize_sp(s): return sp.encode(s, out_type=int)
samples = [
"Pursuant to Section 2.3, the agreement is extended.",
"WHEREAS, the Parties desire to amend the MSA.",
"Rule 12(b)(6) permits dismissal for failure to state a claim."
]
print("BPE:", eval_stats(tokenize_bpe, samples))
print("SP :", eval_stats(tokenize_sp, samples))
This step defines a framework for evaluating and comparing tokenizer performance for specialized text domains (like legal documents). Here's a comprehensive breakdown:
Function: eval_stats
This function calculates key statistics to evaluate tokenizer efficiency:
- Takes two parameters: a tokenization function and a list of text samples
- For each sample, it tracks both token length (after tokenization) and character length (original text)
- Returns a dictionary with three metrics:
- "avg_tokens": Average number of tokens per sample
- "avg_chars": Average number of characters per sample
- "compression": The ratio of characters to tokens (higher values indicate better compression)
- Uses the statistics module (implied by the code) to calculate means
- Includes a safety mechanism (max(1e-6, statistics.mean(tok_lens))) to prevent division by zero
Helper Functions
Two small wrapper functions that standardize how different tokenizers are called:
tokenize_bpe(s): Tokenizes text using a BPE tokenizer and returns token IDstokenize_sp(s): Tokenizes text using a SentencePiece tokenizer and returns token IDs
Test Data
The code defines a list of legal text samples to evaluate tokenizer performance:
- Three representative legal phrases containing domain-specific terminology and formatting:
- "Pursuant to Section 2.3, the agreement is extended."
- "WHEREAS, the Parties desire to amend the MSA."
- "Rule 12(b)(6) permits dismissal for failure to state a claim."
- These samples intentionally include legal jargon (WHEREAS, Pursuant), references (Section 2.3), acronyms (MSA), and complex citations (Rule 12(b)(6))
Evaluation and Output
Finally, the code evaluates both tokenizers on the test samples and prints the results:
- Calls
eval_statswith each tokenizer function and the sample texts - Prints the results with labels "BPE:" and "SP:" (for SentencePiece)
- This allows direct comparison of how efficiently each tokenizer handles the legal text samples
This evaluation is crucial for domain-specific tokenizers because it quantifies how well they compress text while preserving meaningful semantic units. An effective legal tokenizer would ideally keep terms like "MSA" or "Rule 12(b)(6)" as single tokens while achieving good compression overall.
If key terms split into many subwords, try increasing vocab_size, tweaking normalization (e.g., keep case), or adding a user vocabulary list to force merges for frequent domain strings.
5. Evaluate Tokenizer Quality
A good tokenizer reduces sequence length and respects domain terms. The quality of your tokenizer will directly impact how efficiently your model processes text and how well it understands domain-specific concepts.
Metrics to check:
- Avg. tokens per sample (lower is better) - This measures how efficiently your tokenizer compresses text. Fewer tokens means less computation during model training and inference, resulting in faster processing and reduced memory requirements.
- Compression ratio = (chars per sample) / (tokens per sample) - This ratio helps quantify tokenization efficiency. Higher values indicate better compression, where more characters are represented by fewer tokens. For domain-specific text, you should aim for ratios higher than general-purpose tokenizers.
- Term integrity: Are key phrases single tokens? ("plaintiff", "MSA", "Section 2.3") - Domain expertise is often encoded in specialized terminology. When these terms are preserved as single tokens, models can learn their semantic meaning more effectively. Check whether important domain-specific terms remain intact rather than being split into multiple subword tokens.
- Symbol handling: parentheses, section symbols, citations, code-like patterns - Specialized domains often use unique formatting conventions (like "§2.1.3" in legal text or "C3H8O3" in chemistry). A good domain-specific tokenizer should handle these patterns appropriately, either preserving them as single tokens or splitting them consistently in a way that maintains their semantic meaning.
def eval_stats(tokenize, samples):
tok_lens = []
char_lens = []
for s in samples:
ids = tokenize(s)
tok_lens.append(len(ids))
char_lens.append(len(s))
return {
"avg_tokens": statistics.mean(tok_lens),
"avg_chars": statistics.mean(char_lens),
"compression": statistics.mean(char_lens) / max(1e-6, statistics.mean(tok_lens))
}
def tokenize_bpe(s): return bpe_tok.encode(s).ids
def tokenize_sp(s): return sp.encode(s, out_type=int)
samples = [
"Pursuant to Section 2.3, the agreement is extended.",
"WHEREAS, the Parties desire to amend the MSA.",
"Rule 12(b)(6) permits dismissal for failure to state a claim."
]
print("BPE:", eval_stats(tokenize_bpe, samples))
print("SP :", eval_stats(tokenize_sp, samples))
This step defines a framework for evaluating and comparing tokenizer performance for specialized text domains (like legal documents). Here's a comprehensive breakdown:
Function: eval_stats
This function calculates key statistics to evaluate tokenizer efficiency:
- Takes two parameters: a tokenization function and a list of text samples
- For each sample, it tracks both token length (after tokenization) and character length (original text)
- Returns a dictionary with three metrics:
- "avg_tokens": Average number of tokens per sample
- "avg_chars": Average number of characters per sample
- "compression": The ratio of characters to tokens (higher values indicate better compression)
- Uses the statistics module (implied by the code) to calculate means
- Includes a safety mechanism (max(1e-6, statistics.mean(tok_lens))) to prevent division by zero
Helper Functions
Two small wrapper functions that standardize how different tokenizers are called:
tokenize_bpe(s): Tokenizes text using a BPE tokenizer and returns token IDstokenize_sp(s): Tokenizes text using a SentencePiece tokenizer and returns token IDs
Test Data
The code defines a list of legal text samples to evaluate tokenizer performance:
- Three representative legal phrases containing domain-specific terminology and formatting:
- "Pursuant to Section 2.3, the agreement is extended."
- "WHEREAS, the Parties desire to amend the MSA."
- "Rule 12(b)(6) permits dismissal for failure to state a claim."
- These samples intentionally include legal jargon (WHEREAS, Pursuant), references (Section 2.3), acronyms (MSA), and complex citations (Rule 12(b)(6))
Evaluation and Output
Finally, the code evaluates both tokenizers on the test samples and prints the results:
- Calls
eval_statswith each tokenizer function and the sample texts - Prints the results with labels "BPE:" and "SP:" (for SentencePiece)
- This allows direct comparison of how efficiently each tokenizer handles the legal text samples
This evaluation is crucial for domain-specific tokenizers because it quantifies how well they compress text while preserving meaningful semantic units. An effective legal tokenizer would ideally keep terms like "MSA" or "Rule 12(b)(6)" as single tokens while achieving good compression overall.
If key terms split into many subwords, try increasing vocab_size, tweaking normalization (e.g., keep case), or adding a user vocabulary list to force merges for frequent domain strings.
5. Evaluate Tokenizer Quality
A good tokenizer reduces sequence length and respects domain terms. The quality of your tokenizer will directly impact how efficiently your model processes text and how well it understands domain-specific concepts.
Metrics to check:
- Avg. tokens per sample (lower is better) - This measures how efficiently your tokenizer compresses text. Fewer tokens means less computation during model training and inference, resulting in faster processing and reduced memory requirements.
- Compression ratio = (chars per sample) / (tokens per sample) - This ratio helps quantify tokenization efficiency. Higher values indicate better compression, where more characters are represented by fewer tokens. For domain-specific text, you should aim for ratios higher than general-purpose tokenizers.
- Term integrity: Are key phrases single tokens? ("plaintiff", "MSA", "Section 2.3") - Domain expertise is often encoded in specialized terminology. When these terms are preserved as single tokens, models can learn their semantic meaning more effectively. Check whether important domain-specific terms remain intact rather than being split into multiple subword tokens.
- Symbol handling: parentheses, section symbols, citations, code-like patterns - Specialized domains often use unique formatting conventions (like "§2.1.3" in legal text or "C3H8O3" in chemistry). A good domain-specific tokenizer should handle these patterns appropriately, either preserving them as single tokens or splitting them consistently in a way that maintains their semantic meaning.
def eval_stats(tokenize, samples):
tok_lens = []
char_lens = []
for s in samples:
ids = tokenize(s)
tok_lens.append(len(ids))
char_lens.append(len(s))
return {
"avg_tokens": statistics.mean(tok_lens),
"avg_chars": statistics.mean(char_lens),
"compression": statistics.mean(char_lens) / max(1e-6, statistics.mean(tok_lens))
}
def tokenize_bpe(s): return bpe_tok.encode(s).ids
def tokenize_sp(s): return sp.encode(s, out_type=int)
samples = [
"Pursuant to Section 2.3, the agreement is extended.",
"WHEREAS, the Parties desire to amend the MSA.",
"Rule 12(b)(6) permits dismissal for failure to state a claim."
]
print("BPE:", eval_stats(tokenize_bpe, samples))
print("SP :", eval_stats(tokenize_sp, samples))
This step defines a framework for evaluating and comparing tokenizer performance for specialized text domains (like legal documents). Here's a comprehensive breakdown:
Function: eval_stats
This function calculates key statistics to evaluate tokenizer efficiency:
- Takes two parameters: a tokenization function and a list of text samples
- For each sample, it tracks both token length (after tokenization) and character length (original text)
- Returns a dictionary with three metrics:
- "avg_tokens": Average number of tokens per sample
- "avg_chars": Average number of characters per sample
- "compression": The ratio of characters to tokens (higher values indicate better compression)
- Uses the statistics module (implied by the code) to calculate means
- Includes a safety mechanism (max(1e-6, statistics.mean(tok_lens))) to prevent division by zero
Helper Functions
Two small wrapper functions that standardize how different tokenizers are called:
tokenize_bpe(s): Tokenizes text using a BPE tokenizer and returns token IDstokenize_sp(s): Tokenizes text using a SentencePiece tokenizer and returns token IDs
Test Data
The code defines a list of legal text samples to evaluate tokenizer performance:
- Three representative legal phrases containing domain-specific terminology and formatting:
- "Pursuant to Section 2.3, the agreement is extended."
- "WHEREAS, the Parties desire to amend the MSA."
- "Rule 12(b)(6) permits dismissal for failure to state a claim."
- These samples intentionally include legal jargon (WHEREAS, Pursuant), references (Section 2.3), acronyms (MSA), and complex citations (Rule 12(b)(6))
Evaluation and Output
Finally, the code evaluates both tokenizers on the test samples and prints the results:
- Calls
eval_statswith each tokenizer function and the sample texts - Prints the results with labels "BPE:" and "SP:" (for SentencePiece)
- This allows direct comparison of how efficiently each tokenizer handles the legal text samples
This evaluation is crucial for domain-specific tokenizers because it quantifies how well they compress text while preserving meaningful semantic units. An effective legal tokenizer would ideally keep terms like "MSA" or "Rule 12(b)(6)" as single tokens while achieving good compression overall.
If key terms split into many subwords, try increasing vocab_size, tweaking normalization (e.g., keep case), or adding a user vocabulary list to force merges for frequent domain strings.
5. Evaluate Tokenizer Quality
A good tokenizer reduces sequence length and respects domain terms. The quality of your tokenizer will directly impact how efficiently your model processes text and how well it understands domain-specific concepts.
Metrics to check:
- Avg. tokens per sample (lower is better) - This measures how efficiently your tokenizer compresses text. Fewer tokens means less computation during model training and inference, resulting in faster processing and reduced memory requirements.
- Compression ratio = (chars per sample) / (tokens per sample) - This ratio helps quantify tokenization efficiency. Higher values indicate better compression, where more characters are represented by fewer tokens. For domain-specific text, you should aim for ratios higher than general-purpose tokenizers.
- Term integrity: Are key phrases single tokens? ("plaintiff", "MSA", "Section 2.3") - Domain expertise is often encoded in specialized terminology. When these terms are preserved as single tokens, models can learn their semantic meaning more effectively. Check whether important domain-specific terms remain intact rather than being split into multiple subword tokens.
- Symbol handling: parentheses, section symbols, citations, code-like patterns - Specialized domains often use unique formatting conventions (like "§2.1.3" in legal text or "C3H8O3" in chemistry). A good domain-specific tokenizer should handle these patterns appropriately, either preserving them as single tokens or splitting them consistently in a way that maintains their semantic meaning.
def eval_stats(tokenize, samples):
tok_lens = []
char_lens = []
for s in samples:
ids = tokenize(s)
tok_lens.append(len(ids))
char_lens.append(len(s))
return {
"avg_tokens": statistics.mean(tok_lens),
"avg_chars": statistics.mean(char_lens),
"compression": statistics.mean(char_lens) / max(1e-6, statistics.mean(tok_lens))
}
def tokenize_bpe(s): return bpe_tok.encode(s).ids
def tokenize_sp(s): return sp.encode(s, out_type=int)
samples = [
"Pursuant to Section 2.3, the agreement is extended.",
"WHEREAS, the Parties desire to amend the MSA.",
"Rule 12(b)(6) permits dismissal for failure to state a claim."
]
print("BPE:", eval_stats(tokenize_bpe, samples))
print("SP :", eval_stats(tokenize_sp, samples))
This step defines a framework for evaluating and comparing tokenizer performance for specialized text domains (like legal documents). Here's a comprehensive breakdown:
Function: eval_stats
This function calculates key statistics to evaluate tokenizer efficiency:
- Takes two parameters: a tokenization function and a list of text samples
- For each sample, it tracks both token length (after tokenization) and character length (original text)
- Returns a dictionary with three metrics:
- "avg_tokens": Average number of tokens per sample
- "avg_chars": Average number of characters per sample
- "compression": The ratio of characters to tokens (higher values indicate better compression)
- Uses the statistics module (implied by the code) to calculate means
- Includes a safety mechanism (max(1e-6, statistics.mean(tok_lens))) to prevent division by zero
Helper Functions
Two small wrapper functions that standardize how different tokenizers are called:
tokenize_bpe(s): Tokenizes text using a BPE tokenizer and returns token IDstokenize_sp(s): Tokenizes text using a SentencePiece tokenizer and returns token IDs
Test Data
The code defines a list of legal text samples to evaluate tokenizer performance:
- Three representative legal phrases containing domain-specific terminology and formatting:
- "Pursuant to Section 2.3, the agreement is extended."
- "WHEREAS, the Parties desire to amend the MSA."
- "Rule 12(b)(6) permits dismissal for failure to state a claim."
- These samples intentionally include legal jargon (WHEREAS, Pursuant), references (Section 2.3), acronyms (MSA), and complex citations (Rule 12(b)(6))
Evaluation and Output
Finally, the code evaluates both tokenizers on the test samples and prints the results:
- Calls
eval_statswith each tokenizer function and the sample texts - Prints the results with labels "BPE:" and "SP:" (for SentencePiece)
- This allows direct comparison of how efficiently each tokenizer handles the legal text samples
This evaluation is crucial for domain-specific tokenizers because it quantifies how well they compress text while preserving meaningful semantic units. An effective legal tokenizer would ideally keep terms like "MSA" or "Rule 12(b)(6)" as single tokens while achieving good compression overall.
If key terms split into many subwords, try increasing vocab_size, tweaking normalization (e.g., keep case), or adding a user vocabulary list to force merges for frequent domain strings.

