Click here to view the next lesson.

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

Assigning unique version numbers or identifiers to each tokenizer you create
Documenting the training parameters, corpus statistics, and preprocessing steps used
Storing both the tokenizer model and its metadata in a consistent location
Using version control systems (like Git) to track changes to your tokenizer code and configuration
Creating comprehensive changelogs that explain modifications between versions

This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

vocab_size: 900 - The size of the vocabulary used during training
normalization: "NFKC" - The Unicode normalization method applied
pretokenizer: "Whitespace" - The pre-tokenization strategy used
domain: "legal" - The specific domain this tokenizer was trained for
notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
Documentation: The metadata provides context about how the tokenizer was created
Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

Assigning unique version numbers or identifiers to each tokenizer you create
Documenting the training parameters, corpus statistics, and preprocessing steps used
Storing both the tokenizer model and its metadata in a consistent location
Using version control systems (like Git) to track changes to your tokenizer code and configuration
Creating comprehensive changelogs that explain modifications between versions

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

vocab_size: 900 - The size of the vocabulary used during training
normalization: "NFKC" - The Unicode normalization method applied
pretokenizer: "Whitespace" - The pre-tokenization strategy used
domain: "legal" - The specific domain this tokenizer was trained for
notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
Documentation: The metadata provides context about how the tokenizer was created
Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

Assigning unique version numbers or identifiers to each tokenizer you create
Documenting the training parameters, corpus statistics, and preprocessing steps used
Storing both the tokenizer model and its metadata in a consistent location
Using version control systems (like Git) to track changes to your tokenizer code and configuration
Creating comprehensive changelogs that explain modifications between versions

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

vocab_size: 900 - The size of the vocabulary used during training
normalization: "NFKC" - The Unicode normalization method applied
pretokenizer: "Whitespace" - The pre-tokenization strategy used
domain: "legal" - The specific domain this tokenizer was trained for
notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
Documentation: The metadata provides context about how the tokenizer was created
Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

Assigning unique version numbers or identifiers to each tokenizer you create
Documenting the training parameters, corpus statistics, and preprocessing steps used
Storing both the tokenizer model and its metadata in a consistent location
Using version control systems (like Git) to track changes to your tokenizer code and configuration
Creating comprehensive changelogs that explain modifications between versions

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

vocab_size: 900 - The size of the vocabulary used during training
normalization: "NFKC" - The Unicode normalization method applied
pretokenizer: "Whitespace" - The pre-tokenization strategy used
domain: "legal" - The specific domain this tokenizer was trained for
notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
Documentation: The metadata provides context about how the tokenizer was created
Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

7. Save, Load, and Version

7. Save, Load, and Version

7. Save, Load, and Version

7. Save, Load, and Version