Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

2. Train a BPE Tokenizer (🤗 tokenizers)

We'll train a small Byte-Pair Encoding (BPE) tokenizer using whitespace pre-tokenization. BPE works by iteratively merging the most frequent pairs of characters or subwords in your corpus, building up a vocabulary of tokens that efficiently represent your text. The vocab_size parameter is crucial - it determines how many unique tokens your tokenizer will learn. You should adjust this parameter carefully:

  • Too small: Important domain terms will be split into multiple subword tokens (e.g., "plaintiff" becomes "plain" + "tiff")
  • Too large: You'll have many rare tokens and risk overfitting to your training data

The ideal size allows your most important domain-specific terminology (like "plaintiff", "pursuant", or "Rule 12(b)") to be represented as single tokens, improving model efficiency when processing these common terms. Start with a modest size and gradually increase it while monitoring how your key legal terms are tokenized.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
from tokenizers.processors import TemplateProcessing

# Build a BPE tokenizer
bpe_tok = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok.normalizer = normalizers.Sequence([
    normalizers.NFKC(),  # canonical unicode normalization
])
bpe_tok.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(
    vocab_size=800,              # start modest; tune later (e.g., 8k-32k for real corpora)
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)

def line_iter(fn):
    with open(fn, "r", encoding="utf-8") as f:
        for line in f:
            yield line.strip()

bpe_tok.train_from_iterator(line_iter("data/legal_demo.txt"), trainer)

# Post-processing to add BOS/EOS if desired
bpe_tok.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    pair="[BOS] $A [EOS] $B:1 [EOS]:1",
    special_tokens=[("[BOS]", bpe_tok.token_to_id("[BOS]")),
                    ("[EOS]", bpe_tok.token_to_id("[EOS]"))]
)

# Save
Path("artifacts").mkdir(exist_ok=True)
bpe_tok.save("artifacts/legal_bpe.json")

This step implements a Byte-Pair Encoding (BPE) tokenizer using the 🤗 tokenizers library. Here's a comprehensive breakdown of what it does:

First, the code imports necessary modules:

  • From the tokenizers library: Tokenizer, models, trainers, pre_tokenizers, normalizers
  • TemplateProcessing from tokenizers.processors for handling special tokens

Then it initializes a BPE tokenizer:

  • Creates a tokenizer with BPE model, specifying "[UNK]" as the unknown token
  • Sets up NFKC normalization, which performs canonical Unicode normalization to standardize character representations
  • Configures a whitespace pre-tokenizer, which splits text on whitespace before applying BPE

Next, it configures the trainer for the BPE algorithm:

  • Sets vocabulary size to 800 tokens (a modest starting point that can be increased for real corpora)
  • Sets minimum frequency to 1, meaning any token that appears at least once will be considered
  • Defines special tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]" for padding, unknown tokens, beginning of sequence, and end of sequence respectively

The code defines a helper function line_iter that:

  • Takes a filename as input
  • Opens the file and yields each line with whitespace stripped
  • This creates an iterator over the text corpus for efficient processing

Then it trains the tokenizer:

  • Uses train_from_iterator with the line iterator pointing to "data/legal_demo.txt"
  • Applies the BPE trainer configuration defined earlier

After training, it sets up post-processing:

  • Configures a template for adding special tokens to sequences
  • For single sequences: Adds "[BOS]" at the start and "[EOS]" at the end
  • For sequence pairs: Adds appropriate markers for both sequences
  • Maps special token strings to their IDs in the vocabulary

Finally, it saves the tokenizer:

  • Creates an "artifacts" directory if it doesn't exist
  • Saves the tokenizer to "artifacts/legal_bpe.json"

Quick sanity check:

test = "The plaintiff moves under Rule 12(b)(6) to dismiss."
enc = bpe_tok.encode(test)
print(enc.tokens)     # inspect splits on "plaintiff", "Rule", "12(b)(6)"
print(len(enc.ids))   # token count

This code block performs the quick validation test of the BPE tokenizer we just created. Let's analyze it line by line:

First, it defines a test sentence containing legal terminology: "The plaintiff moves under Rule 12(b)(6) to dismiss." This example was carefully chosen because it contains domain-specific terms that we want our tokenizer to handle properly.

Next, it encodes this test sentence using our BPE tokenizer with enc = bpe_tok.encode(test). This converts the raw text into tokens according to the BPE algorithm and vocabulary we trained.

The print(enc.tokens) line outputs the actual tokens that result from the encoding process. This is particularly useful for checking how the tokenizer handles important legal terms like "plaintiff", "Rule", and the citation format "12(b)(6)". Ideally, these domain-specific terms would be encoded as single tokens rather than being split into smaller subwords.

Finally, print(len(enc.ids)) displays the total number of tokens generated from the test sentence. This helps evaluate the tokenizer's efficiency - fewer tokens generally means more efficient representation for downstream models.

This validation step is crucial for assessing whether our tokenizer is effectively capturing the linguistic patterns specific to legal text. If important legal terms are being split into multiple tokens, we might need to increase the vocabulary size or use techniques like the user vocabulary described later in the document.

2. Train a BPE Tokenizer (🤗 tokenizers)

We'll train a small Byte-Pair Encoding (BPE) tokenizer using whitespace pre-tokenization. BPE works by iteratively merging the most frequent pairs of characters or subwords in your corpus, building up a vocabulary of tokens that efficiently represent your text. The vocab_size parameter is crucial - it determines how many unique tokens your tokenizer will learn. You should adjust this parameter carefully:

  • Too small: Important domain terms will be split into multiple subword tokens (e.g., "plaintiff" becomes "plain" + "tiff")
  • Too large: You'll have many rare tokens and risk overfitting to your training data

The ideal size allows your most important domain-specific terminology (like "plaintiff", "pursuant", or "Rule 12(b)") to be represented as single tokens, improving model efficiency when processing these common terms. Start with a modest size and gradually increase it while monitoring how your key legal terms are tokenized.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
from tokenizers.processors import TemplateProcessing

# Build a BPE tokenizer
bpe_tok = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok.normalizer = normalizers.Sequence([
    normalizers.NFKC(),  # canonical unicode normalization
])
bpe_tok.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(
    vocab_size=800,              # start modest; tune later (e.g., 8k-32k for real corpora)
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)

def line_iter(fn):
    with open(fn, "r", encoding="utf-8") as f:
        for line in f:
            yield line.strip()

bpe_tok.train_from_iterator(line_iter("data/legal_demo.txt"), trainer)

# Post-processing to add BOS/EOS if desired
bpe_tok.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    pair="[BOS] $A [EOS] $B:1 [EOS]:1",
    special_tokens=[("[BOS]", bpe_tok.token_to_id("[BOS]")),
                    ("[EOS]", bpe_tok.token_to_id("[EOS]"))]
)

# Save
Path("artifacts").mkdir(exist_ok=True)
bpe_tok.save("artifacts/legal_bpe.json")

This step implements a Byte-Pair Encoding (BPE) tokenizer using the 🤗 tokenizers library. Here's a comprehensive breakdown of what it does:

First, the code imports necessary modules:

  • From the tokenizers library: Tokenizer, models, trainers, pre_tokenizers, normalizers
  • TemplateProcessing from tokenizers.processors for handling special tokens

Then it initializes a BPE tokenizer:

  • Creates a tokenizer with BPE model, specifying "[UNK]" as the unknown token
  • Sets up NFKC normalization, which performs canonical Unicode normalization to standardize character representations
  • Configures a whitespace pre-tokenizer, which splits text on whitespace before applying BPE

Next, it configures the trainer for the BPE algorithm:

  • Sets vocabulary size to 800 tokens (a modest starting point that can be increased for real corpora)
  • Sets minimum frequency to 1, meaning any token that appears at least once will be considered
  • Defines special tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]" for padding, unknown tokens, beginning of sequence, and end of sequence respectively

The code defines a helper function line_iter that:

  • Takes a filename as input
  • Opens the file and yields each line with whitespace stripped
  • This creates an iterator over the text corpus for efficient processing

Then it trains the tokenizer:

  • Uses train_from_iterator with the line iterator pointing to "data/legal_demo.txt"
  • Applies the BPE trainer configuration defined earlier

After training, it sets up post-processing:

  • Configures a template for adding special tokens to sequences
  • For single sequences: Adds "[BOS]" at the start and "[EOS]" at the end
  • For sequence pairs: Adds appropriate markers for both sequences
  • Maps special token strings to their IDs in the vocabulary

Finally, it saves the tokenizer:

  • Creates an "artifacts" directory if it doesn't exist
  • Saves the tokenizer to "artifacts/legal_bpe.json"

Quick sanity check:

test = "The plaintiff moves under Rule 12(b)(6) to dismiss."
enc = bpe_tok.encode(test)
print(enc.tokens)     # inspect splits on "plaintiff", "Rule", "12(b)(6)"
print(len(enc.ids))   # token count

This code block performs the quick validation test of the BPE tokenizer we just created. Let's analyze it line by line:

First, it defines a test sentence containing legal terminology: "The plaintiff moves under Rule 12(b)(6) to dismiss." This example was carefully chosen because it contains domain-specific terms that we want our tokenizer to handle properly.

Next, it encodes this test sentence using our BPE tokenizer with enc = bpe_tok.encode(test). This converts the raw text into tokens according to the BPE algorithm and vocabulary we trained.

The print(enc.tokens) line outputs the actual tokens that result from the encoding process. This is particularly useful for checking how the tokenizer handles important legal terms like "plaintiff", "Rule", and the citation format "12(b)(6)". Ideally, these domain-specific terms would be encoded as single tokens rather than being split into smaller subwords.

Finally, print(len(enc.ids)) displays the total number of tokens generated from the test sentence. This helps evaluate the tokenizer's efficiency - fewer tokens generally means more efficient representation for downstream models.

This validation step is crucial for assessing whether our tokenizer is effectively capturing the linguistic patterns specific to legal text. If important legal terms are being split into multiple tokens, we might need to increase the vocabulary size or use techniques like the user vocabulary described later in the document.

2. Train a BPE Tokenizer (🤗 tokenizers)

We'll train a small Byte-Pair Encoding (BPE) tokenizer using whitespace pre-tokenization. BPE works by iteratively merging the most frequent pairs of characters or subwords in your corpus, building up a vocabulary of tokens that efficiently represent your text. The vocab_size parameter is crucial - it determines how many unique tokens your tokenizer will learn. You should adjust this parameter carefully:

  • Too small: Important domain terms will be split into multiple subword tokens (e.g., "plaintiff" becomes "plain" + "tiff")
  • Too large: You'll have many rare tokens and risk overfitting to your training data

The ideal size allows your most important domain-specific terminology (like "plaintiff", "pursuant", or "Rule 12(b)") to be represented as single tokens, improving model efficiency when processing these common terms. Start with a modest size and gradually increase it while monitoring how your key legal terms are tokenized.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
from tokenizers.processors import TemplateProcessing

# Build a BPE tokenizer
bpe_tok = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok.normalizer = normalizers.Sequence([
    normalizers.NFKC(),  # canonical unicode normalization
])
bpe_tok.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(
    vocab_size=800,              # start modest; tune later (e.g., 8k-32k for real corpora)
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)

def line_iter(fn):
    with open(fn, "r", encoding="utf-8") as f:
        for line in f:
            yield line.strip()

bpe_tok.train_from_iterator(line_iter("data/legal_demo.txt"), trainer)

# Post-processing to add BOS/EOS if desired
bpe_tok.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    pair="[BOS] $A [EOS] $B:1 [EOS]:1",
    special_tokens=[("[BOS]", bpe_tok.token_to_id("[BOS]")),
                    ("[EOS]", bpe_tok.token_to_id("[EOS]"))]
)

# Save
Path("artifacts").mkdir(exist_ok=True)
bpe_tok.save("artifacts/legal_bpe.json")

This step implements a Byte-Pair Encoding (BPE) tokenizer using the 🤗 tokenizers library. Here's a comprehensive breakdown of what it does:

First, the code imports necessary modules:

  • From the tokenizers library: Tokenizer, models, trainers, pre_tokenizers, normalizers
  • TemplateProcessing from tokenizers.processors for handling special tokens

Then it initializes a BPE tokenizer:

  • Creates a tokenizer with BPE model, specifying "[UNK]" as the unknown token
  • Sets up NFKC normalization, which performs canonical Unicode normalization to standardize character representations
  • Configures a whitespace pre-tokenizer, which splits text on whitespace before applying BPE

Next, it configures the trainer for the BPE algorithm:

  • Sets vocabulary size to 800 tokens (a modest starting point that can be increased for real corpora)
  • Sets minimum frequency to 1, meaning any token that appears at least once will be considered
  • Defines special tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]" for padding, unknown tokens, beginning of sequence, and end of sequence respectively

The code defines a helper function line_iter that:

  • Takes a filename as input
  • Opens the file and yields each line with whitespace stripped
  • This creates an iterator over the text corpus for efficient processing

Then it trains the tokenizer:

  • Uses train_from_iterator with the line iterator pointing to "data/legal_demo.txt"
  • Applies the BPE trainer configuration defined earlier

After training, it sets up post-processing:

  • Configures a template for adding special tokens to sequences
  • For single sequences: Adds "[BOS]" at the start and "[EOS]" at the end
  • For sequence pairs: Adds appropriate markers for both sequences
  • Maps special token strings to their IDs in the vocabulary

Finally, it saves the tokenizer:

  • Creates an "artifacts" directory if it doesn't exist
  • Saves the tokenizer to "artifacts/legal_bpe.json"

Quick sanity check:

test = "The plaintiff moves under Rule 12(b)(6) to dismiss."
enc = bpe_tok.encode(test)
print(enc.tokens)     # inspect splits on "plaintiff", "Rule", "12(b)(6)"
print(len(enc.ids))   # token count

This code block performs the quick validation test of the BPE tokenizer we just created. Let's analyze it line by line:

First, it defines a test sentence containing legal terminology: "The plaintiff moves under Rule 12(b)(6) to dismiss." This example was carefully chosen because it contains domain-specific terms that we want our tokenizer to handle properly.

Next, it encodes this test sentence using our BPE tokenizer with enc = bpe_tok.encode(test). This converts the raw text into tokens according to the BPE algorithm and vocabulary we trained.

The print(enc.tokens) line outputs the actual tokens that result from the encoding process. This is particularly useful for checking how the tokenizer handles important legal terms like "plaintiff", "Rule", and the citation format "12(b)(6)". Ideally, these domain-specific terms would be encoded as single tokens rather than being split into smaller subwords.

Finally, print(len(enc.ids)) displays the total number of tokens generated from the test sentence. This helps evaluate the tokenizer's efficiency - fewer tokens generally means more efficient representation for downstream models.

This validation step is crucial for assessing whether our tokenizer is effectively capturing the linguistic patterns specific to legal text. If important legal terms are being split into multiple tokens, we might need to increase the vocabulary size or use techniques like the user vocabulary described later in the document.

2. Train a BPE Tokenizer (🤗 tokenizers)

We'll train a small Byte-Pair Encoding (BPE) tokenizer using whitespace pre-tokenization. BPE works by iteratively merging the most frequent pairs of characters or subwords in your corpus, building up a vocabulary of tokens that efficiently represent your text. The vocab_size parameter is crucial - it determines how many unique tokens your tokenizer will learn. You should adjust this parameter carefully:

  • Too small: Important domain terms will be split into multiple subword tokens (e.g., "plaintiff" becomes "plain" + "tiff")
  • Too large: You'll have many rare tokens and risk overfitting to your training data

The ideal size allows your most important domain-specific terminology (like "plaintiff", "pursuant", or "Rule 12(b)") to be represented as single tokens, improving model efficiency when processing these common terms. Start with a modest size and gradually increase it while monitoring how your key legal terms are tokenized.

from tokenizers import Tokenizer, models, trainers, pre_tokenizers, normalizers
from tokenizers.processors import TemplateProcessing

# Build a BPE tokenizer
bpe_tok = Tokenizer(models.BPE(unk_token="[UNK]"))
bpe_tok.normalizer = normalizers.Sequence([
    normalizers.NFKC(),  # canonical unicode normalization
])
bpe_tok.pre_tokenizer = pre_tokenizers.Whitespace()

trainer = trainers.BpeTrainer(
    vocab_size=800,              # start modest; tune later (e.g., 8k-32k for real corpora)
    min_frequency=1,
    special_tokens=["[PAD]", "[UNK]", "[BOS]", "[EOS]"]
)

def line_iter(fn):
    with open(fn, "r", encoding="utf-8") as f:
        for line in f:
            yield line.strip()

bpe_tok.train_from_iterator(line_iter("data/legal_demo.txt"), trainer)

# Post-processing to add BOS/EOS if desired
bpe_tok.post_processor = TemplateProcessing(
    single="[BOS] $A [EOS]",
    pair="[BOS] $A [EOS] $B:1 [EOS]:1",
    special_tokens=[("[BOS]", bpe_tok.token_to_id("[BOS]")),
                    ("[EOS]", bpe_tok.token_to_id("[EOS]"))]
)

# Save
Path("artifacts").mkdir(exist_ok=True)
bpe_tok.save("artifacts/legal_bpe.json")

This step implements a Byte-Pair Encoding (BPE) tokenizer using the 🤗 tokenizers library. Here's a comprehensive breakdown of what it does:

First, the code imports necessary modules:

  • From the tokenizers library: Tokenizer, models, trainers, pre_tokenizers, normalizers
  • TemplateProcessing from tokenizers.processors for handling special tokens

Then it initializes a BPE tokenizer:

  • Creates a tokenizer with BPE model, specifying "[UNK]" as the unknown token
  • Sets up NFKC normalization, which performs canonical Unicode normalization to standardize character representations
  • Configures a whitespace pre-tokenizer, which splits text on whitespace before applying BPE

Next, it configures the trainer for the BPE algorithm:

  • Sets vocabulary size to 800 tokens (a modest starting point that can be increased for real corpora)
  • Sets minimum frequency to 1, meaning any token that appears at least once will be considered
  • Defines special tokens: "[PAD]", "[UNK]", "[BOS]", "[EOS]" for padding, unknown tokens, beginning of sequence, and end of sequence respectively

The code defines a helper function line_iter that:

  • Takes a filename as input
  • Opens the file and yields each line with whitespace stripped
  • This creates an iterator over the text corpus for efficient processing

Then it trains the tokenizer:

  • Uses train_from_iterator with the line iterator pointing to "data/legal_demo.txt"
  • Applies the BPE trainer configuration defined earlier

After training, it sets up post-processing:

  • Configures a template for adding special tokens to sequences
  • For single sequences: Adds "[BOS]" at the start and "[EOS]" at the end
  • For sequence pairs: Adds appropriate markers for both sequences
  • Maps special token strings to their IDs in the vocabulary

Finally, it saves the tokenizer:

  • Creates an "artifacts" directory if it doesn't exist
  • Saves the tokenizer to "artifacts/legal_bpe.json"

Quick sanity check:

test = "The plaintiff moves under Rule 12(b)(6) to dismiss."
enc = bpe_tok.encode(test)
print(enc.tokens)     # inspect splits on "plaintiff", "Rule", "12(b)(6)"
print(len(enc.ids))   # token count

This code block performs the quick validation test of the BPE tokenizer we just created. Let's analyze it line by line:

First, it defines a test sentence containing legal terminology: "The plaintiff moves under Rule 12(b)(6) to dismiss." This example was carefully chosen because it contains domain-specific terms that we want our tokenizer to handle properly.

Next, it encodes this test sentence using our BPE tokenizer with enc = bpe_tok.encode(test). This converts the raw text into tokens according to the BPE algorithm and vocabulary we trained.

The print(enc.tokens) line outputs the actual tokens that result from the encoding process. This is particularly useful for checking how the tokenizer handles important legal terms like "plaintiff", "Rule", and the citation format "12(b)(6)". Ideally, these domain-specific terms would be encoded as single tokens rather than being split into smaller subwords.

Finally, print(len(enc.ids)) displays the total number of tokens generated from the test sentence. This helps evaluate the tokenizer's efficiency - fewer tokens generally means more efficient representation for downstream models.

This validation step is crucial for assessing whether our tokenizer is effectively capturing the linguistic patterns specific to legal text. If important legal terms are being split into multiple tokens, we might need to increase the vocabulary size or use techniques like the user vocabulary described later in the document.