Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

  • Assigning unique version numbers or identifiers to each tokenizer you create
  • Documenting the training parameters, corpus statistics, and preprocessing steps used
  • Storing both the tokenizer model and its metadata in a consistent location
  • Using version control systems (like Git) to track changes to your tokenizer code and configuration
  • Creating comprehensive changelogs that explain modifications between versions

This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

  • vocab_size: 900 - The size of the vocabulary used during training
  • normalization: "NFKC" - The Unicode normalization method applied
  • pretokenizer: "Whitespace" - The pre-tokenization strategy used
  • domain: "legal" - The specific domain this tokenizer was trained for
  • notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

  • Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
  • Documentation: The metadata provides context about how the tokenizer was created
  • Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
  • Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

  • Assigning unique version numbers or identifiers to each tokenizer you create
  • Documenting the training parameters, corpus statistics, and preprocessing steps used
  • Storing both the tokenizer model and its metadata in a consistent location
  • Using version control systems (like Git) to track changes to your tokenizer code and configuration
  • Creating comprehensive changelogs that explain modifications between versions

This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

  • vocab_size: 900 - The size of the vocabulary used during training
  • normalization: "NFKC" - The Unicode normalization method applied
  • pretokenizer: "Whitespace" - The pre-tokenization strategy used
  • domain: "legal" - The specific domain this tokenizer was trained for
  • notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

  • Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
  • Documentation: The metadata provides context about how the tokenizer was created
  • Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
  • Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

  • Assigning unique version numbers or identifiers to each tokenizer you create
  • Documenting the training parameters, corpus statistics, and preprocessing steps used
  • Storing both the tokenizer model and its metadata in a consistent location
  • Using version control systems (like Git) to track changes to your tokenizer code and configuration
  • Creating comprehensive changelogs that explain modifications between versions

This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

  • vocab_size: 900 - The size of the vocabulary used during training
  • normalization: "NFKC" - The Unicode normalization method applied
  • pretokenizer: "Whitespace" - The pre-tokenization strategy used
  • domain: "legal" - The specific domain this tokenizer was trained for
  • notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

  • Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
  • Documentation: The metadata provides context about how the tokenizer was created
  • Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
  • Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.

7. Save, Load, and Version

Always version your artifacts so experiments are reproducible. This means:

  • Assigning unique version numbers or identifiers to each tokenizer you create
  • Documenting the training parameters, corpus statistics, and preprocessing steps used
  • Storing both the tokenizer model and its metadata in a consistent location
  • Using version control systems (like Git) to track changes to your tokenizer code and configuration
  • Creating comprehensive changelogs that explain modifications between versions

This practice ensures that you can recreate exact experimental conditions later, compare results fairly across different tokenizer versions, and share your work with others in a way that they can reliably build upon.

bpe_tok2.save("artifacts/legal_bpe_v2.json")
meta = {
    "vocab_size": 900,
    "normalization": "NFKC",
    "pretokenizer": "Whitespace",
    "domain": "legal",
    "notes": "Augmented with user vocab to protect key terms."
}
with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This step demonstrates how to save a trained BPE tokenizer and its metadata for future use. Here's a detailed breakdown:

Saving the Tokenizer Model

The first line saves the trained BPE tokenizer (bpe_tok2) to a JSON file:

bpe_tok2.save("artifacts/legal_bpe_v2.json")

This creates a file containing all the necessary information to reconstruct the tokenizer, including vocabulary, merges, and configuration.

Creating Metadata

The code then creates a Python dictionary containing important metadata about the tokenizer:

  • vocab_size: 900 - The size of the vocabulary used during training
  • normalization: "NFKC" - The Unicode normalization method applied
  • pretokenizer: "Whitespace" - The pre-tokenization strategy used
  • domain: "legal" - The specific domain this tokenizer was trained for
  • notes: "Augmented with user vocab to protect key terms" - Additional information about how the tokenizer was trained

Saving the Metadata

Finally, the code saves this metadata to a separate JSON file:

with open("artifacts/legal_bpe_v2.meta.json", "w") as f:
    json.dump(meta, f, indent=2)

This uses Python's json module to write the metadata dictionary to a file with nice formatting (indent=2).

Purpose and Importance

This versioning approach is crucial for:

  • Reproducibility: Anyone can recreate the exact tokenizer with the same parameters
  • Documentation: The metadata provides context about how the tokenizer was created
  • Tracking: The "v2" in the filename suggests this is the second version, enabling proper version control
  • Provenance: The notes field explains that this tokenizer was specially augmented with user vocabulary

This practice aligns with the recommendation to "Always version your artifacts so experiments are reproducible" mentioned in section 7 of the document.