Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

6. Add a User Vocabulary (optional but powerful)

Force the tokenizer to treat certain strings as single tokens (common in code, chemistry, finance identifiers). This technique is particularly valuable in domain-specific contexts where preserving the integrity of technical terminology is crucial. For example:

  • In legal text: Terms like "plaintiff," "defendant," or citation formats like "Rule 12(b)(6)" should remain intact
  • In chemistry: Chemical formulas like "C6H12O6" or "NH4Cl" should be recognized as single meaningful units
  • In programming: Function names, API calls, or variable naming conventions should be preserved
  • In finance: Stock tickers, financial metrics, or accounting codes benefit from being treated as atomic units

By specifying a user vocabulary, you prevent the tokenizer from breaking these domain-critical terms into subword pieces, which helps maintain their semantic meaning and improves model understanding of domain-specific concepts. This is especially important when these terms have meaning beyond their constituent parts.

user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

# For 🤗 tokenizers you can bake merges indirectly by seeding corpus or post-filtering:
# A practical trick: augment the training corpus with many occurrences of user_vocab.
with open("data/legal_aug.txt", "w", encoding="utf-8") as f:
    for _ in range(200):
        f.write(" ".join(user_vocab) + "\n")
    for line in corpus:
        f.write(line + "\n")

# Retrain quickly with augmented freq (same code as Section 2)
bpe_tok2 = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok2.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tok2.normalizer = normalizers.NFKC()
trainer2 = trainers.BpeTrainer(
    vocab_size=900,
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)
bpe_tok2.train_from_iterator(line_iter("data/legal_aug.txt"), trainer2)
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This step demonstrates how to create a domain-specific tokenizer with a user vocabulary for the legal text using the Hugging Face tokenizers library. Here's a detailed breakdown:

User Vocabulary Definition

The code begins by defining a list of legal terms that should be treated as single tokens rather than being broken down into subwords:
user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

Training Data Augmentation

The code uses a clever technique to ensure these special terms are preserved as single tokens:

  • It creates a new augmented training file ("data/legal_aug.txt")
  • Writes 200 lines containing just the user vocabulary terms
  • Appends the original corpus to this augmented file

This technique effectively increases the frequency of these terms in the training data, making it more likely that the BPE algorithm will keep them as single tokens.

Tokenizer Configuration and Training

The code then creates and configures a new BPE tokenizer:

  • Initializes a BPE tokenizer with an unknown token "[UNK]"
  • Sets a whitespace pre-tokenizer that splits on spaces
  • Sets NFKC normalization for Unicode character normalization
  • Configures a BPE trainer with:
    • vocab_size=900 (limits vocabulary to 900 tokens)
    • min_frequency=1 (includes tokens that appear at least once)
    • special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"] (adds special tokens for padding, unknown, beginning of sequence, and end of sequence)
  • Trains the tokenizer on the augmented corpus using line_iter functionTesting the Tokenizer

Finally, the code tests whether the user vocabulary terms are preserved as single tokens by encoding a test sentence and printing the resulting tokens:
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This technique is particularly valuable for domain-specific tokenizers where preserving specialized terminology improves model understanding of the domain content.

If you need strict control, SentencePiece supports user defined symbols via --user_defined_symbols= in training.

6. Add a User Vocabulary (optional but powerful)

Force the tokenizer to treat certain strings as single tokens (common in code, chemistry, finance identifiers). This technique is particularly valuable in domain-specific contexts where preserving the integrity of technical terminology is crucial. For example:

  • In legal text: Terms like "plaintiff," "defendant," or citation formats like "Rule 12(b)(6)" should remain intact
  • In chemistry: Chemical formulas like "C6H12O6" or "NH4Cl" should be recognized as single meaningful units
  • In programming: Function names, API calls, or variable naming conventions should be preserved
  • In finance: Stock tickers, financial metrics, or accounting codes benefit from being treated as atomic units

By specifying a user vocabulary, you prevent the tokenizer from breaking these domain-critical terms into subword pieces, which helps maintain their semantic meaning and improves model understanding of domain-specific concepts. This is especially important when these terms have meaning beyond their constituent parts.

user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

# For 🤗 tokenizers you can bake merges indirectly by seeding corpus or post-filtering:
# A practical trick: augment the training corpus with many occurrences of user_vocab.
with open("data/legal_aug.txt", "w", encoding="utf-8") as f:
    for _ in range(200):
        f.write(" ".join(user_vocab) + "\n")
    for line in corpus:
        f.write(line + "\n")

# Retrain quickly with augmented freq (same code as Section 2)
bpe_tok2 = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok2.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tok2.normalizer = normalizers.NFKC()
trainer2 = trainers.BpeTrainer(
    vocab_size=900,
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)
bpe_tok2.train_from_iterator(line_iter("data/legal_aug.txt"), trainer2)
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This step demonstrates how to create a domain-specific tokenizer with a user vocabulary for the legal text using the Hugging Face tokenizers library. Here's a detailed breakdown:

User Vocabulary Definition

The code begins by defining a list of legal terms that should be treated as single tokens rather than being broken down into subwords:
user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

Training Data Augmentation

The code uses a clever technique to ensure these special terms are preserved as single tokens:

  • It creates a new augmented training file ("data/legal_aug.txt")
  • Writes 200 lines containing just the user vocabulary terms
  • Appends the original corpus to this augmented file

This technique effectively increases the frequency of these terms in the training data, making it more likely that the BPE algorithm will keep them as single tokens.

Tokenizer Configuration and Training

The code then creates and configures a new BPE tokenizer:

  • Initializes a BPE tokenizer with an unknown token "[UNK]"
  • Sets a whitespace pre-tokenizer that splits on spaces
  • Sets NFKC normalization for Unicode character normalization
  • Configures a BPE trainer with:
    • vocab_size=900 (limits vocabulary to 900 tokens)
    • min_frequency=1 (includes tokens that appear at least once)
    • special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"] (adds special tokens for padding, unknown, beginning of sequence, and end of sequence)
  • Trains the tokenizer on the augmented corpus using line_iter functionTesting the Tokenizer

Finally, the code tests whether the user vocabulary terms are preserved as single tokens by encoding a test sentence and printing the resulting tokens:
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This technique is particularly valuable for domain-specific tokenizers where preserving specialized terminology improves model understanding of the domain content.

If you need strict control, SentencePiece supports user defined symbols via --user_defined_symbols= in training.

6. Add a User Vocabulary (optional but powerful)

Force the tokenizer to treat certain strings as single tokens (common in code, chemistry, finance identifiers). This technique is particularly valuable in domain-specific contexts where preserving the integrity of technical terminology is crucial. For example:

  • In legal text: Terms like "plaintiff," "defendant," or citation formats like "Rule 12(b)(6)" should remain intact
  • In chemistry: Chemical formulas like "C6H12O6" or "NH4Cl" should be recognized as single meaningful units
  • In programming: Function names, API calls, or variable naming conventions should be preserved
  • In finance: Stock tickers, financial metrics, or accounting codes benefit from being treated as atomic units

By specifying a user vocabulary, you prevent the tokenizer from breaking these domain-critical terms into subword pieces, which helps maintain their semantic meaning and improves model understanding of domain-specific concepts. This is especially important when these terms have meaning beyond their constituent parts.

user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

# For 🤗 tokenizers you can bake merges indirectly by seeding corpus or post-filtering:
# A practical trick: augment the training corpus with many occurrences of user_vocab.
with open("data/legal_aug.txt", "w", encoding="utf-8") as f:
    for _ in range(200):
        f.write(" ".join(user_vocab) + "\n")
    for line in corpus:
        f.write(line + "\n")

# Retrain quickly with augmented freq (same code as Section 2)
bpe_tok2 = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok2.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tok2.normalizer = normalizers.NFKC()
trainer2 = trainers.BpeTrainer(
    vocab_size=900,
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)
bpe_tok2.train_from_iterator(line_iter("data/legal_aug.txt"), trainer2)
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This step demonstrates how to create a domain-specific tokenizer with a user vocabulary for the legal text using the Hugging Face tokenizers library. Here's a detailed breakdown:

User Vocabulary Definition

The code begins by defining a list of legal terms that should be treated as single tokens rather than being broken down into subwords:
user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

Training Data Augmentation

The code uses a clever technique to ensure these special terms are preserved as single tokens:

  • It creates a new augmented training file ("data/legal_aug.txt")
  • Writes 200 lines containing just the user vocabulary terms
  • Appends the original corpus to this augmented file

This technique effectively increases the frequency of these terms in the training data, making it more likely that the BPE algorithm will keep them as single tokens.

Tokenizer Configuration and Training

The code then creates and configures a new BPE tokenizer:

  • Initializes a BPE tokenizer with an unknown token "[UNK]"
  • Sets a whitespace pre-tokenizer that splits on spaces
  • Sets NFKC normalization for Unicode character normalization
  • Configures a BPE trainer with:
    • vocab_size=900 (limits vocabulary to 900 tokens)
    • min_frequency=1 (includes tokens that appear at least once)
    • special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"] (adds special tokens for padding, unknown, beginning of sequence, and end of sequence)
  • Trains the tokenizer on the augmented corpus using line_iter functionTesting the Tokenizer

Finally, the code tests whether the user vocabulary terms are preserved as single tokens by encoding a test sentence and printing the resulting tokens:
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This technique is particularly valuable for domain-specific tokenizers where preserving specialized terminology improves model understanding of the domain content.

If you need strict control, SentencePiece supports user defined symbols via --user_defined_symbols= in training.

6. Add a User Vocabulary (optional but powerful)

Force the tokenizer to treat certain strings as single tokens (common in code, chemistry, finance identifiers). This technique is particularly valuable in domain-specific contexts where preserving the integrity of technical terminology is crucial. For example:

  • In legal text: Terms like "plaintiff," "defendant," or citation formats like "Rule 12(b)(6)" should remain intact
  • In chemistry: Chemical formulas like "C6H12O6" or "NH4Cl" should be recognized as single meaningful units
  • In programming: Function names, API calls, or variable naming conventions should be preserved
  • In finance: Stock tickers, financial metrics, or accounting codes benefit from being treated as atomic units

By specifying a user vocabulary, you prevent the tokenizer from breaking these domain-critical terms into subword pieces, which helps maintain their semantic meaning and improves model understanding of domain-specific concepts. This is especially important when these terms have meaning beyond their constituent parts.

user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

# For 🤗 tokenizers you can bake merges indirectly by seeding corpus or post-filtering:
# A practical trick: augment the training corpus with many occurrences of user_vocab.
with open("data/legal_aug.txt", "w", encoding="utf-8") as f:
    for _ in range(200):
        f.write(" ".join(user_vocab) + "\n")
    for line in corpus:
        f.write(line + "\n")

# Retrain quickly with augmented freq (same code as Section 2)
bpe_tok2 = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok2.pre_tokenizer = pre_tokenizers.Whitespace()
bpe_tok2.normalizer = normalizers.NFKC()
trainer2 = trainers.BpeTrainer(
    vocab_size=900,
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)
bpe_tok2.train_from_iterator(line_iter("data/legal_aug.txt"), trainer2)
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This step demonstrates how to create a domain-specific tokenizer with a user vocabulary for the legal text using the Hugging Face tokenizers library. Here's a detailed breakdown:

User Vocabulary Definition

The code begins by defining a list of legal terms that should be treated as single tokens rather than being broken down into subwords:
user_vocab = ["plaintiff", "defendant", "MSA", "Rule", "Section", "12(b)(6)"]

Training Data Augmentation

The code uses a clever technique to ensure these special terms are preserved as single tokens:

  • It creates a new augmented training file ("data/legal_aug.txt")
  • Writes 200 lines containing just the user vocabulary terms
  • Appends the original corpus to this augmented file

This technique effectively increases the frequency of these terms in the training data, making it more likely that the BPE algorithm will keep them as single tokens.

Tokenizer Configuration and Training

The code then creates and configures a new BPE tokenizer:

  • Initializes a BPE tokenizer with an unknown token "[UNK]"
  • Sets a whitespace pre-tokenizer that splits on spaces
  • Sets NFKC normalization for Unicode character normalization
  • Configures a BPE trainer with:
    • vocab_size=900 (limits vocabulary to 900 tokens)
    • min_frequency=1 (includes tokens that appear at least once)
    • special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"] (adds special tokens for padding, unknown, beginning of sequence, and end of sequence)
  • Trains the tokenizer on the augmented corpus using line_iter functionTesting the Tokenizer

Finally, the code tests whether the user vocabulary terms are preserved as single tokens by encoding a test sentence and printing the resulting tokens:
print(bpe_tok2.encode("The MSA and Section 2.3 apply under Rule 12(b)(6).").tokens)

This technique is particularly valuable for domain-specific tokenizers where preserving specialized terminology improves model understanding of the domain content.

If you need strict control, SentencePiece supports user defined symbols via --user_defined_symbols= in training.