Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

3. Train a SentencePiece Tokenizer (Unigram or BPE)

SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.

Key advantages of SentencePiece include:

It works directly on raw text without requiring language-specific segmentation
It treats whitespace as a regular character, making it suitable for languages without spaces
It maintains reversibility, allowing perfect reconstruction of the original text
It supports both Unigram and BPE algorithms under the same framework

For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.

import sentencepiece as spm

with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

spm.SentencePieceTrainer.train(
    input="data/legal_sp.txt",
    model_prefix="artifacts/legal_sp",
    vocab_size=800,
    character_coverage=1.0,      # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
    model_type="unigram",         # or "bpe"
    bos_id=1, eos_id=2, unk_id=0, pad_id=3
)

This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:

First, the code imports the SentencePiece library:
import sentencepiece as spm

Next, it creates a text file to store the corpus data for training:

It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
It iterates through each line in the corpus (a collection of legal text samples defined earlier)
Each line is written to the file with a newline character appended

Then comes the core of the code - training the SentencePiece tokenizer:

The SentencePieceTrainer.train() method is called with several parameters
input="data/legal_sp.txt": Specifies the input text file containing the corpus
model_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)
vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this demo
character_coverage=1.0: Ensures all characters in the corpus are covered
A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE
A comment indicates "bpe" is an alternative option
Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)

This code is particularly valuable for legal text processing because SentencePiece:

Treats text as a raw stream of Unicode characters without language-specific pre-processing
Handles whitespace as a regular character
Maintains reversibility for perfect reconstruction of original text
Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology

The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.

Check splits:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

Here's a detailed breakdown:

First, the code creates a SentencePiece processor object by loading a previously trained model:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")

The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.

Next, the code encodes a sample legal text and prints the result:

print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

This line does several things:

It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
The encode() method converts this text into tokens according to the SentencePiece model
The out_type=str parameter specifies that the output should be a list of string tokens rather than their numerical IDs
The result is printed to allow visual inspection of how the tokenizer splits the legal text

This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.

The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.

3. Train a SentencePiece Tokenizer (Unigram or BPE)

SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.

Key advantages of SentencePiece include:

It works directly on raw text without requiring language-specific segmentation
It treats whitespace as a regular character, making it suitable for languages without spaces
It maintains reversibility, allowing perfect reconstruction of the original text
It supports both Unigram and BPE algorithms under the same framework

For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.

import sentencepiece as spm

with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

spm.SentencePieceTrainer.train(
    input="data/legal_sp.txt",
    model_prefix="artifacts/legal_sp",
    vocab_size=800,
    character_coverage=1.0,      # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
    model_type="unigram",         # or "bpe"
    bos_id=1, eos_id=2, unk_id=0, pad_id=3
)

This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:

First, the code imports the SentencePiece library:
import sentencepiece as spm

Next, it creates a text file to store the corpus data for training:

It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
It iterates through each line in the corpus (a collection of legal text samples defined earlier)
Each line is written to the file with a newline character appended

Then comes the core of the code - training the SentencePiece tokenizer:

The SentencePieceTrainer.train() method is called with several parameters
input="data/legal_sp.txt": Specifies the input text file containing the corpus
model_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)
vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this demo
character_coverage=1.0: Ensures all characters in the corpus are covered
A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE
A comment indicates "bpe" is an alternative option
Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)

This code is particularly valuable for legal text processing because SentencePiece:

Treats text as a raw stream of Unicode characters without language-specific pre-processing
Handles whitespace as a regular character
Maintains reversibility for perfect reconstruction of original text
Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology

The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.

Check splits:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

Here's a detailed breakdown:

First, the code creates a SentencePiece processor object by loading a previously trained model:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")

The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.

Next, the code encodes a sample legal text and prints the result:

print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

This line does several things:

It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
The encode() method converts this text into tokens according to the SentencePiece model
The out_type=str parameter specifies that the output should be a list of string tokens rather than their numerical IDs
The result is printed to allow visual inspection of how the tokenizer splits the legal text

This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.

The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.

3. Train a SentencePiece Tokenizer (Unigram or BPE)

SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.

Key advantages of SentencePiece include:

It works directly on raw text without requiring language-specific segmentation
It treats whitespace as a regular character, making it suitable for languages without spaces
It maintains reversibility, allowing perfect reconstruction of the original text
It supports both Unigram and BPE algorithms under the same framework

For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.

import sentencepiece as spm

with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

spm.SentencePieceTrainer.train(
    input="data/legal_sp.txt",
    model_prefix="artifacts/legal_sp",
    vocab_size=800,
    character_coverage=1.0,      # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
    model_type="unigram",         # or "bpe"
    bos_id=1, eos_id=2, unk_id=0, pad_id=3
)

This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:

First, the code imports the SentencePiece library:
import sentencepiece as spm

Next, it creates a text file to store the corpus data for training:

It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
It iterates through each line in the corpus (a collection of legal text samples defined earlier)
Each line is written to the file with a newline character appended

Then comes the core of the code - training the SentencePiece tokenizer:

The SentencePieceTrainer.train() method is called with several parameters
input="data/legal_sp.txt": Specifies the input text file containing the corpus
model_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)
vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this demo
character_coverage=1.0: Ensures all characters in the corpus are covered
A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE
A comment indicates "bpe" is an alternative option
Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)

This code is particularly valuable for legal text processing because SentencePiece:

Treats text as a raw stream of Unicode characters without language-specific pre-processing
Handles whitespace as a regular character
Maintains reversibility for perfect reconstruction of original text
Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology

The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.

Check splits:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

Here's a detailed breakdown:

First, the code creates a SentencePiece processor object by loading a previously trained model:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")

The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.

Next, the code encodes a sample legal text and prints the result:

print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

This line does several things:

It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
The encode() method converts this text into tokens according to the SentencePiece model
The out_type=str parameter specifies that the output should be a list of string tokens rather than their numerical IDs
The result is printed to allow visual inspection of how the tokenizer splits the legal text

This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.

The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.

3. Train a SentencePiece Tokenizer (Unigram or BPE)

SentencePiece is a tokenization algorithm that's particularly robust for handling multilingual data and languages without clear word boundaries (like Japanese, Chinese, or Thai). Unlike the BPE tokenizer we just created, SentencePiece treats the input as a raw stream of Unicode characters without relying on language-specific pre-processing.

Key advantages of SentencePiece include:

It works directly on raw text without requiring language-specific segmentation
It treats whitespace as a regular character, making it suitable for languages without spaces
It maintains reversibility, allowing perfect reconstruction of the original text
It supports both Unigram and BPE algorithms under the same framework

For this project, we'll use the Unigram model, which differs from BPE by using a probabilistic approach to tokenization rather than deterministic merging rules. The Unigram model starts with a large vocabulary and iteratively prunes it to optimize a likelihood function, making it particularly effective for capturing the nuances of specialized terminology.

import sentencepiece as spm

with open("data/legal_sp.txt", "w", encoding="utf-8") as f:
    for line in corpus:
        f.write(line + "\n")

spm.SentencePieceTrainer.train(
    input="data/legal_sp.txt",
    model_prefix="artifacts/legal_sp",
    vocab_size=800,
    character_coverage=1.0,      # lower for non-Latin heavy corpora (e.g., 0.9995 for CJK)
    model_type="unigram",         # or "bpe"
    bos_id=1, eos_id=2, unk_id=0, pad_id=3
)

This step implements a SentencePiece tokenizer training process for the legal text. Let’s break it down comprehensively:

First, the code imports the SentencePiece library:
import sentencepiece as spm

Next, it creates a text file to store the corpus data for training:

It opens a file named "data/legal_sp.txt" in write mode with UTF-8 encoding
It iterates through each line in the corpus (a collection of legal text samples defined earlier)
Each line is written to the file with a newline character appended

Then comes the core of the code - training the SentencePiece tokenizer:

The SentencePieceTrainer.train() method is called with several parameters
input="data/legal_sp.txt": Specifies the input text file containing the corpus
model_prefix="artifacts/legal_sp": Sets the prefix for output model files (will create .model and .vocab files)
vocab_size=800: Limits the vocabulary to 800 tokens - a modest size appropriate for this demo
character_coverage=1.0: Ensures all characters in the corpus are covered
A comment notes this should be lower (e.g., 0.9995) for non-Latin scripts like Chinese, Japanese, Korean
model_type="unigram": Uses the Unigram algorithm (probabilistic approach) rather than BPE
A comment indicates "bpe" is an alternative option
Special token IDs are defined: beginning-of-sequence (1), end-of-sequence (2), unknown (0), and padding (3)

This code is particularly valuable for legal text processing because SentencePiece:

Treats text as a raw stream of Unicode characters without language-specific pre-processing
Handles whitespace as a regular character
Maintains reversibility for perfect reconstruction of original text
Uses a probabilistic approach (Unigram model) that can effectively capture specialized terminology

The result will be a trained SentencePiece tokenizer that can efficiently tokenize legal text while preserving domain-specific terms.

Check splits:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")
print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

Here's a detailed breakdown:

First, the code creates a SentencePiece processor object by loading a previously trained model:

sp = spm.SentencePieceProcessor(model_file="artifacts/legal_sp.model")

The SentencePieceProcessor is initialized with the model file "artifacts/legal_sp.model" that was created during the training process. This model contains all the vocabulary and rules needed for tokenization.

Next, the code encodes a sample legal text and prints the result:

print(sp.encode("WHEREAS, the Parties amend the MSA.", out_type=str))

This line does several things:

It takes a legal phrase "WHEREAS, the Parties amend the MSA." which contains domain-specific terminology (like "WHEREAS" and "MSA" which are common in legal documents)
The encode() method converts this text into tokens according to the SentencePiece model
The out_type=str parameter specifies that the output should be a list of string tokens rather than their numerical IDs
The result is printed to allow visual inspection of how the tokenizer splits the legal text

This is a crucial validation step to check whether the tokenizer properly handles domain-specific terminology. For example, ideally terms like "MSA" (which likely stands for Master Service Agreement in this context) would be preserved as single tokens rather than being split into individual letters.

The code serves as a "quick sanity check" to evaluate how well the trained SentencePiece model handles legal terminology and formatting, which is part of the broader process of evaluating tokenizer quality.

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

3. Train a SentencePiece Tokenizer (Unigram or BPE)

3. Train a SentencePiece Tokenizer (Unigram or BPE)

3. Train a SentencePiece Tokenizer (Unigram or BPE)

3. Train a SentencePiece Tokenizer (Unigram or BPE)