Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)
3. Train a SentencePiece Tokenizer (Unigram or BPE)
SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.
Key advantages of SentencePiece include:
- It works directly on raw text without requiring language-specific segmentation
- It treats whitespace as a regular character, making it suitable for languages without spaces
- It maintains reversibility, allowing perfect reconstruction of the original text
- It supports both Unigram and BPE algorithms under the same framework
For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.
import sentencepiece as spm
with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
for line in corpus:
f.write(line + "\n")
spm.SentencePieceTrainer.train(
input="data/legal_sp.txt",
model_prefix="artifacts/legal_sp",
vocab_size=800,
character_coverage=1.0, # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
model_type="unigram", # or "bpe"
bos_id=1, eos_id=2, unk_id=0, pad_id=3
)
This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:
First, the code imports the SentencePiece library:import sentencepiece as spm
Next, it creates a text file to store the corpus data for training:
- It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
- It iterates through each line in the corpus (a collection of legal text samples defined earlier)
- Each line is written to the file with a newline character appended
Then comes the core of the code - training the SentencePiece tokenizer:
- The
SentencePieceTrainer.train()method is called with several parameters input="data/legal_sp.txt": Specifies the input text file containing the corpusmodel_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this democharacter_coverage=1.0: Ensures all characters in the corpus are covered- A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE- A comment indicates "bpe" is an alternative option
- Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)
This code is particularly valuable for legal text processing because SentencePiece:
- Treats text as a raw stream of Unicode characters without language-specific pre-processing
- Handles whitespace as a regular character
- Maintains reversibility for perfect reconstruction of original text
- Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology
The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.
Check splits:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
Here's a detailed breakdown:
First, the code creates a SentencePiece processor object by loading a previously trained model:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.
Next, the code encodes a sample legal text and prints the result:
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
This line does several things:
- It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
- The
encode()method converts this text into tokens according to the SentencePiece model - The
out_type=strparameter specifies that the output should be a list of string tokens rather than their numerical IDs - The result is printed to allow visual inspection of how the tokenizer splits the legal text
This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.
The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.
3. Train a SentencePiece Tokenizer (Unigram or BPE)
SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.
Key advantages of SentencePiece include:
- It works directly on raw text without requiring language-specific segmentation
- It treats whitespace as a regular character, making it suitable for languages without spaces
- It maintains reversibility, allowing perfect reconstruction of the original text
- It supports both Unigram and BPE algorithms under the same framework
For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.
import sentencepiece as spm
with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
for line in corpus:
f.write(line + "\n")
spm.SentencePieceTrainer.train(
input="data/legal_sp.txt",
model_prefix="artifacts/legal_sp",
vocab_size=800,
character_coverage=1.0, # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
model_type="unigram", # or "bpe"
bos_id=1, eos_id=2, unk_id=0, pad_id=3
)
This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:
First, the code imports the SentencePiece library:import sentencepiece as spm
Next, it creates a text file to store the corpus data for training:
- It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
- It iterates through each line in the corpus (a collection of legal text samples defined earlier)
- Each line is written to the file with a newline character appended
Then comes the core of the code - training the SentencePiece tokenizer:
- The
SentencePieceTrainer.train()method is called with several parameters input="data/legal_sp.txt": Specifies the input text file containing the corpusmodel_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this democharacter_coverage=1.0: Ensures all characters in the corpus are covered- A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE- A comment indicates "bpe" is an alternative option
- Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)
This code is particularly valuable for legal text processing because SentencePiece:
- Treats text as a raw stream of Unicode characters without language-specific pre-processing
- Handles whitespace as a regular character
- Maintains reversibility for perfect reconstruction of original text
- Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology
The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.
Check splits:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
Here's a detailed breakdown:
First, the code creates a SentencePiece processor object by loading a previously trained model:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.
Next, the code encodes a sample legal text and prints the result:
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
This line does several things:
- It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
- The
encode()method converts this text into tokens according to the SentencePiece model - The
out_type=strparameter specifies that the output should be a list of string tokens rather than their numerical IDs - The result is printed to allow visual inspection of how the tokenizer splits the legal text
This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.
The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.
3. Train a SentencePiece Tokenizer (Unigram or BPE)
SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.
Key advantages of SentencePiece include:
- It works directly on raw text without requiring language-specific segmentation
- It treats whitespace as a regular character, making it suitable for languages without spaces
- It maintains reversibility, allowing perfect reconstruction of the original text
- It supports both Unigram and BPE algorithms under the same framework
For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.
import sentencepiece as spm
with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
for line in corpus:
f.write(line + "\n")
spm.SentencePieceTrainer.train(
input="data/legal_sp.txt",
model_prefix="artifacts/legal_sp",
vocab_size=800,
character_coverage=1.0, # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
model_type="unigram", # or "bpe"
bos_id=1, eos_id=2, unk_id=0, pad_id=3
)
This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:
First, the code imports the SentencePiece library:import sentencepiece as spm
Next, it creates a text file to store the corpus data for training:
- It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
- It iterates through each line in the corpus (a collection of legal text samples defined earlier)
- Each line is written to the file with a newline character appended
Then comes the core of the code - training the SentencePiece tokenizer:
- The
SentencePieceTrainer.train()method is called with several parameters input="data/legal_sp.txt": Specifies the input text file containing the corpusmodel_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this democharacter_coverage=1.0: Ensures all characters in the corpus are covered- A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE- A comment indicates "bpe" is an alternative option
- Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)
This code is particularly valuable for legal text processing because SentencePiece:
- Treats text as a raw stream of Unicode characters without language-specific pre-processing
- Handles whitespace as a regular character
- Maintains reversibility for perfect reconstruction of original text
- Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology
The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.
Check splits:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
Here's a detailed breakdown:
First, the code creates a SentencePiece processor object by loading a previously trained model:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.
Next, the code encodes a sample legal text and prints the result:
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
This line does several things:
- It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
- The
encode()method converts this text into tokens according to the SentencePiece model - The
out_type=strparameter specifies that the output should be a list of string tokens rather than their numerical IDs - The result is printed to allow visual inspection of how the tokenizer splits the legal text
This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.
The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.
3. Train a SentencePiece Tokenizer (Unigram or BPE)
SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.
Key advantages of SentencePiece include:
- It works directly on raw text without requiring language-specific segmentation
- It treats whitespace as a regular character, making it suitable for languages without spaces
- It maintains reversibility, allowing perfect reconstruction of the original text
- It supports both Unigram and BPE algorithms under the same framework
For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.
import sentencepiece as spm
with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
for line in corpus:
f.write(line + "\n")
spm.SentencePieceTrainer.train(
input="data/legal_sp.txt",
model_prefix="artifacts/legal_sp",
vocab_size=800,
character_coverage=1.0, # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
model_type="unigram", # or "bpe"
bos_id=1, eos_id=2, unk_id=0, pad_id=3
)
This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:
First, the code imports the SentencePiece library:import sentencepiece as spm
Next, it creates a text file to store the corpus data for training:
- It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
- It iterates through each line in the corpus (a collection of legal text samples defined earlier)
- Each line is written to the file with a newline character appended
Then comes the core of the code - training the SentencePiece tokenizer:
- The
SentencePieceTrainer.train()method is called with several parameters input="data/legal_sp.txt": Specifies the input text file containing the corpusmodel_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this democharacter_coverage=1.0: Ensures all characters in the corpus are covered- A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE- A comment indicates "bpe" is an alternative option
- Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)
This code is particularly valuable for legal text processing because SentencePiece:
- Treats text as a raw stream of Unicode characters without language-specific pre-processing
- Handles whitespace as a regular character
- Maintains reversibility for perfect reconstruction of original text
- Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology
The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.
Check splits:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
Here's a detailed breakdown:
First, the code creates a SentencePiece processor object by loading a previously trained model:
sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.
Next, the code encodes a sample legal text and prints the result:
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))
This line does several things:
- It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
- The
encode()method converts this text into tokens according to the SentencePiece model - The
out_type=strparameter specifies that the output should be a list of string tokens rather than their numerical IDs - The result is printed to allow visual inspection of how the tokenizer splits the legal text
This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.
The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.
