Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

8. Plug Into a Small Model (sanity run)

Use your tokenizer with a tiny model (e.g., DistilGPT-2) to ensure round-trip encoding/decoding works properly. This step serves as a crucial verification that your tokenizer functions correctly in a real model context. The round-trip test involves encoding text into tokens, passing those tokens through the model pipeline, and then decoding them back to text - confirming that the information flows correctly through the tokenization process.

For true training from scratch with a custom tokenizer, you would need to align the model's embedding layer dimensions to match your tokenizer's vocabulary size. This means initializing the model with an embedding matrix of shape [vocab_size, embedding_dimension], where vocab_size matches the number of tokens in your custom tokenizer. Without this alignment, the model would expect a different vocabulary size than what your tokenizer provides, resulting in index errors or undefined behavior during training or inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Using our fast BPE in a transformers-friendly wrapper:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")
tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

# Quick encode/decode round trip
ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

Here's a breakdown of this step, which demonstrates how to use a custom BPE tokenizer with the Hugging Face transformers library:

The code begins by importing necessary libraries from the transformers package:

from transformers import AutoModelForCausalLM, AutoTokenizer

Although AutoModelForCausalLM is imported, it's not actually used in this snippet. This import would typically be used to load a language model that could work with the tokenizer.

The code then imports the PreTrainedTokenizerFast class, which serves as a wrapper to make custom tokenizers compatible with the transformers library:

from transformers import PreTrainedTokenizerFast

Next, it loads the previously saved custom legal BPE tokenizer by specifying the path to the saved tokenizer file:

tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")

The code then sets special tokens for the tokenizer, which are essential for proper functioning with transformer models:

tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

These special tokens serve specific purposes:

[PAD]: Used for padding sequences to a uniform length
[BOS]: Marks the beginning of a sequence
[EOS]: Marks the end of a sequence
[UNK]: Represents unknown tokens not in the vocabulary

Finally, the code performs a round-trip test of the tokenizer with a legal text sample:

ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

This test:

Encodes the legal text "WHEREAS, the Parties amend the MSA." into token IDs, returning PyTorch tensors (return_tensors="pt")
Prints the resulting input_ids (the numerical representation of tokens)
Decodes the first sequence of input_ids back to text to verify the round-trip works correctly

This "sanity check" ensures the tokenizer correctly processes domain-specific legal terminology (like "MSA") and can be integrated with transformer models for further fine-tuning or inference tasks.

When training from scratch, initialize embeddings to len(tok) and ensure special token IDs align. When fine-tuning, you usually stick with the base model’s tokenizer—unless your domain truly demands a custom one.

8. Plug Into a Small Model (sanity run)

Use your tokenizer with a tiny model (e.g., DistilGPT-2) to ensure round-trip encoding/decoding works properly. This step serves as a crucial verification that your tokenizer functions correctly in a real model context. The round-trip test involves encoding text into tokens, passing those tokens through the model pipeline, and then decoding them back to text - confirming that the information flows correctly through the tokenization process.

For true training from scratch with a custom tokenizer, you would need to align the model's embedding layer dimensions to match your tokenizer's vocabulary size. This means initializing the model with an embedding matrix of shape [vocab_size, embedding_dimension], where vocab_size matches the number of tokens in your custom tokenizer. Without this alignment, the model would expect a different vocabulary size than what your tokenizer provides, resulting in index errors or undefined behavior during training or inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Using our fast BPE in a transformers-friendly wrapper:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")
tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

# Quick encode/decode round trip
ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

Here's a breakdown of this step, which demonstrates how to use a custom BPE tokenizer with the Hugging Face transformers library:

The code begins by importing necessary libraries from the transformers package:

from transformers import AutoModelForCausalLM, AutoTokenizer

Although AutoModelForCausalLM is imported, it's not actually used in this snippet. This import would typically be used to load a language model that could work with the tokenizer.

The code then imports the PreTrainedTokenizerFast class, which serves as a wrapper to make custom tokenizers compatible with the transformers library:

from transformers import PreTrainedTokenizerFast

Next, it loads the previously saved custom legal BPE tokenizer by specifying the path to the saved tokenizer file:

tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")

The code then sets special tokens for the tokenizer, which are essential for proper functioning with transformer models:

tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

These special tokens serve specific purposes:

[PAD]: Used for padding sequences to a uniform length
[BOS]: Marks the beginning of a sequence
[EOS]: Marks the end of a sequence
[UNK]: Represents unknown tokens not in the vocabulary

Finally, the code performs a round-trip test of the tokenizer with a legal text sample:

ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

This test:

Encodes the legal text "WHEREAS, the Parties amend the MSA." into token IDs, returning PyTorch tensors (return_tensors="pt")
Prints the resulting input_ids (the numerical representation of tokens)
Decodes the first sequence of input_ids back to text to verify the round-trip works correctly

This "sanity check" ensures the tokenizer correctly processes domain-specific legal terminology (like "MSA") and can be integrated with transformer models for further fine-tuning or inference tasks.

When training from scratch, initialize embeddings to len(tok) and ensure special token IDs align. When fine-tuning, you usually stick with the base model’s tokenizer—unless your domain truly demands a custom one.

8. Plug Into a Small Model (sanity run)

Use your tokenizer with a tiny model (e.g., DistilGPT-2) to ensure round-trip encoding/decoding works properly. This step serves as a crucial verification that your tokenizer functions correctly in a real model context. The round-trip test involves encoding text into tokens, passing those tokens through the model pipeline, and then decoding them back to text - confirming that the information flows correctly through the tokenization process.

For true training from scratch with a custom tokenizer, you would need to align the model's embedding layer dimensions to match your tokenizer's vocabulary size. This means initializing the model with an embedding matrix of shape [vocab_size, embedding_dimension], where vocab_size matches the number of tokens in your custom tokenizer. Without this alignment, the model would expect a different vocabulary size than what your tokenizer provides, resulting in index errors or undefined behavior during training or inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Using our fast BPE in a transformers-friendly wrapper:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")
tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

# Quick encode/decode round trip
ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

Here's a breakdown of this step, which demonstrates how to use a custom BPE tokenizer with the Hugging Face transformers library:

The code begins by importing necessary libraries from the transformers package:

from transformers import AutoModelForCausalLM, AutoTokenizer

Although AutoModelForCausalLM is imported, it's not actually used in this snippet. This import would typically be used to load a language model that could work with the tokenizer.

The code then imports the PreTrainedTokenizerFast class, which serves as a wrapper to make custom tokenizers compatible with the transformers library:

from transformers import PreTrainedTokenizerFast

Next, it loads the previously saved custom legal BPE tokenizer by specifying the path to the saved tokenizer file:

tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")

The code then sets special tokens for the tokenizer, which are essential for proper functioning with transformer models:

tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

These special tokens serve specific purposes:

[PAD]: Used for padding sequences to a uniform length
[BOS]: Marks the beginning of a sequence
[EOS]: Marks the end of a sequence
[UNK]: Represents unknown tokens not in the vocabulary

Finally, the code performs a round-trip test of the tokenizer with a legal text sample:

ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

This test:

Encodes the legal text "WHEREAS, the Parties amend the MSA." into token IDs, returning PyTorch tensors (return_tensors="pt")
Prints the resulting input_ids (the numerical representation of tokens)
Decodes the first sequence of input_ids back to text to verify the round-trip works correctly

This "sanity check" ensures the tokenizer correctly processes domain-specific legal terminology (like "MSA") and can be integrated with transformer models for further fine-tuning or inference tasks.

When training from scratch, initialize embeddings to len(tok) and ensure special token IDs align. When fine-tuning, you usually stick with the base model’s tokenizer—unless your domain truly demands a custom one.

8. Plug Into a Small Model (sanity run)

Use your tokenizer with a tiny model (e.g., DistilGPT-2) to ensure round-trip encoding/decoding works properly. This step serves as a crucial verification that your tokenizer functions correctly in a real model context. The round-trip test involves encoding text into tokens, passing those tokens through the model pipeline, and then decoding them back to text - confirming that the information flows correctly through the tokenization process.

For true training from scratch with a custom tokenizer, you would need to align the model's embedding layer dimensions to match your tokenizer's vocabulary size. This means initializing the model with an embedding matrix of shape [vocab_size, embedding_dimension], where vocab_size matches the number of tokens in your custom tokenizer. Without this alignment, the model would expect a different vocabulary size than what your tokenizer provides, resulting in index errors or undefined behavior during training or inference.

from transformers import AutoModelForCausalLM, AutoTokenizer

# Using our fast BPE in a transformers-friendly wrapper:
from transformers import PreTrainedTokenizerFast
tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")
tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

# Quick encode/decode round trip
ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

Here's a breakdown of this step, which demonstrates how to use a custom BPE tokenizer with the Hugging Face transformers library:

The code begins by importing necessary libraries from the transformers package:

from transformers import AutoModelForCausalLM, AutoTokenizer

Although AutoModelForCausalLM is imported, it's not actually used in this snippet. This import would typically be used to load a language model that could work with the tokenizer.

The code then imports the PreTrainedTokenizerFast class, which serves as a wrapper to make custom tokenizers compatible with the transformers library:

from transformers import PreTrainedTokenizerFast

Next, it loads the previously saved custom legal BPE tokenizer by specifying the path to the saved tokenizer file:

tok = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe_v2.json")

The code then sets special tokens for the tokenizer, which are essential for proper functioning with transformer models:

tok.pad_token = "[PAD]"; tok.bos_token = "[BOS]"; tok.eos_token = "[EOS]"; tok.unk_token = "[UNK]"

These special tokens serve specific purposes:

[PAD]: Used for padding sequences to a uniform length
[BOS]: Marks the beginning of a sequence
[EOS]: Marks the end of a sequence
[UNK]: Represents unknown tokens not in the vocabulary

Finally, the code performs a round-trip test of the tokenizer with a legal text sample:

ex = tok("WHEREAS, the Parties amend the MSA.", return_tensors="pt")
print(ex["input_ids"])
print(tok.decode(ex["input_ids"][0]))

This test:

Encodes the legal text "WHEREAS, the Parties amend the MSA." into token IDs, returning PyTorch tensors (return_tensors="pt")
Prints the resulting input_ids (the numerical representation of tokens)
Decodes the first sequence of input_ids back to text to verify the round-trip works correctly

This "sanity check" ensures the tokenizer correctly processes domain-specific legal terminology (like "MSA") and can be integrated with transformer models for further fine-tuning or inference tasks.

When training from scratch, initialize embeddings to len(tok) and ensure special token IDs align. When fine-tuning, you usually stick with the base model’s tokenizer—unless your domain truly demands a custom one.

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

8. Plug Into a Small Model (sanity run)

8. Plug Into a Small Model (sanity run)

8. Plug Into a Small Model (sanity run)

8. Plug Into a Small Model (sanity run)