Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)
7. Save, Load, and Version
Always version your artifacts so experiments are reproducible. This means:
- Assigning unique version numbers or identifiers to each tokenizer you create
- Documenting the training parameters, corpus statistics, and preprocessing steps used
- Storing both the tokenizer model and its metadata in a consistent location
- Using version control systems (like Git) to track changes to your tokenizer code and configuration
- Creating comprehensive changelogs that explain modifications between versions
This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.
bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
"vocab_size": 900,
"normalization": "NFKC",
"pretokenizer": "Whitespace",
"domain": "legal",
"notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)
This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:
Saving the Tokenizer Model
The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:
bpe_tok2.save("artifacts/legal_bpe_v2.json")This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.
Creating Metadata
The code then creates a Python dictionary containing important metadata about the tokenizer:
- vocab_size: 900 - The size of the vocabulary used during training
- normalization: "NFKC" - The Unicode normalization method applied
- pretokenizer: "Whitespace" - The pre-tokenization strategy used
- domain: "legal" - The specific domain this tokenizer was trained for
- notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained
Saving the Metadata
Finally, the code saves this metadata to a separate JSON file:
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).
Purpose and Importance
This versioning approach is crucial for:
- Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
- Documentation: The metadata provides context about how the tokenizer was created
- Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
- Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary
This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.
7. Save, Load, and Version
Always version your artifacts so experiments are reproducible. This means:
- Assigning unique version numbers or identifiers to each tokenizer you create
- Documenting the training parameters, corpus statistics, and preprocessing steps used
- Storing both the tokenizer model and its metadata in a consistent location
- Using version control systems (like Git) to track changes to your tokenizer code and configuration
- Creating comprehensive changelogs that explain modifications between versions
This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.
bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
"vocab_size": 900,
"normalization": "NFKC",
"pretokenizer": "Whitespace",
"domain": "legal",
"notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)
This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:
Saving the Tokenizer Model
The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:
bpe_tok2.save("artifacts/legal_bpe_v2.json")This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.
Creating Metadata
The code then creates a Python dictionary containing important metadata about the tokenizer:
- vocab_size: 900 - The size of the vocabulary used during training
- normalization: "NFKC" - The Unicode normalization method applied
- pretokenizer: "Whitespace" - The pre-tokenization strategy used
- domain: "legal" - The specific domain this tokenizer was trained for
- notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained
Saving the Metadata
Finally, the code saves this metadata to a separate JSON file:
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).
Purpose and Importance
This versioning approach is crucial for:
- Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
- Documentation: The metadata provides context about how the tokenizer was created
- Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
- Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary
This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.
7. Save, Load, and Version
Always version your artifacts so experiments are reproducible. This means:
- Assigning unique version numbers or identifiers to each tokenizer you create
- Documenting the training parameters, corpus statistics, and preprocessing steps used
- Storing both the tokenizer model and its metadata in a consistent location
- Using version control systems (like Git) to track changes to your tokenizer code and configuration
- Creating comprehensive changelogs that explain modifications between versions
This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.
bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
"vocab_size": 900,
"normalization": "NFKC",
"pretokenizer": "Whitespace",
"domain": "legal",
"notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)
This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:
Saving the Tokenizer Model
The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:
bpe_tok2.save("artifacts/legal_bpe_v2.json")This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.
Creating Metadata
The code then creates a Python dictionary containing important metadata about the tokenizer:
- vocab_size: 900 - The size of the vocabulary used during training
- normalization: "NFKC" - The Unicode normalization method applied
- pretokenizer: "Whitespace" - The pre-tokenization strategy used
- domain: "legal" - The specific domain this tokenizer was trained for
- notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained
Saving the Metadata
Finally, the code saves this metadata to a separate JSON file:
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).
Purpose and Importance
This versioning approach is crucial for:
- Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
- Documentation: The metadata provides context about how the tokenizer was created
- Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
- Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary
This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.
7. Save, Load, and Version
Always version your artifacts so experiments are reproducible. This means:
- Assigning unique version numbers or identifiers to each tokenizer you create
- Documenting the training parameters, corpus statistics, and preprocessing steps used
- Storing both the tokenizer model and its metadata in a consistent location
- Using version control systems (like Git) to track changes to your tokenizer code and configuration
- Creating comprehensive changelogs that explain modifications between versions
This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.
bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
"vocab_size": 900,
"normalization": "NFKC",
"pretokenizer": "Whitespace",
"domain": "legal",
"notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)
This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:
Saving the Tokenizer Model
The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:
bpe_tok2.save("artifacts/legal_bpe_v2.json")This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.
Creating Metadata
The code then creates a Python dictionary containing important metadata about the tokenizer:
- vocab_size: 900 - The size of the vocabulary used during training
- normalization: "NFKC" - The Unicode normalization method applied
- pretokenizer: "Whitespace" - The pre-tokenization strategy used
- domain: "legal" - The specific domain this tokenizer was trained for
- notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained
Saving the Metadata
Finally, the code saves this metadata to a separate JSON file:
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
json.dump(meta, f, indent=2)This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).
Purpose and Importance
This versioning approach is crucial for:
- Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
- Documentation: The metadata provides context about how the tokenizer was created
- Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
- Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary
This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

