Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)
4. Wrap Your Tokenizer for Transformers
This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.
This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.
from transformers import PreTrainedTokenizerFast
fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"
sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])
This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:
- Importing the necessary class:
- The code imports
PreTrainedTokenizerFastfrom the transformers library, which provides a standardized interface for tokenizers
- Loading the custom tokenizer:
- It initializes a
PreTrainedTokenizerFastinstance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json") - This file contains the vocabulary and merges learned during the BPE training process
- Setting special tokens:
- The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
- These special tokens are necessary for transformer models to properly handle sequences
- Testing the tokenizer with a sample text:
- It tokenizes the phrase "This agreement remains in full force."
- The
return_tensors="pt"parameter converts the output to PyTorch tensors, which is the format required by transformer models - The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)
- Printing the results:
- The final line prints both the input_ids tensor and the attention_mask tensor
- This allows visual verification that the tokenizer is working correctly
For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:
from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))
4. Wrap Your Tokenizer for Transformers
This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.
This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.
from transformers import PreTrainedTokenizerFast
fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"
sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])
This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:
- Importing the necessary class:
- The code imports
PreTrainedTokenizerFastfrom the transformers library, which provides a standardized interface for tokenizers
- Loading the custom tokenizer:
- It initializes a
PreTrainedTokenizerFastinstance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json") - This file contains the vocabulary and merges learned during the BPE training process
- Setting special tokens:
- The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
- These special tokens are necessary for transformer models to properly handle sequences
- Testing the tokenizer with a sample text:
- It tokenizes the phrase "This agreement remains in full force."
- The
return_tensors="pt"parameter converts the output to PyTorch tensors, which is the format required by transformer models - The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)
- Printing the results:
- The final line prints both the input_ids tensor and the attention_mask tensor
- This allows visual verification that the tokenizer is working correctly
For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:
from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))
4. Wrap Your Tokenizer for Transformers
This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.
This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.
from transformers import PreTrainedTokenizerFast
fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"
sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])
This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:
- Importing the necessary class:
- The code imports
PreTrainedTokenizerFastfrom the transformers library, which provides a standardized interface for tokenizers
- Loading the custom tokenizer:
- It initializes a
PreTrainedTokenizerFastinstance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json") - This file contains the vocabulary and merges learned during the BPE training process
- Setting special tokens:
- The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
- These special tokens are necessary for transformer models to properly handle sequences
- Testing the tokenizer with a sample text:
- It tokenizes the phrase "This agreement remains in full force."
- The
return_tensors="pt"parameter converts the output to PyTorch tensors, which is the format required by transformer models - The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)
- Printing the results:
- The final line prints both the input_ids tensor and the attention_mask tensor
- This allows visual verification that the tokenizer is working correctly
For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:
from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))
4. Wrap Your Tokenizer for Transformers
This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.
This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.
from transformers import PreTrainedTokenizerFast
fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"
sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])
This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:
- Importing the necessary class:
- The code imports
PreTrainedTokenizerFastfrom the transformers library, which provides a standardized interface for tokenizers
- Loading the custom tokenizer:
- It initializes a
PreTrainedTokenizerFastinstance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json") - This file contains the vocabulary and merges learned during the BPE training process
- Setting special tokens:
- The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
- These special tokens are necessary for transformer models to properly handle sequences
- Testing the tokenizer with a sample text:
- It tokenizes the phrase "This agreement remains in full force."
- The
return_tensors="pt"parameter converts the output to PyTorch tensors, which is the format required by transformer models - The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)
- Printing the results:
- The final line prints both the input_ids tensor and the attention_mask tensor
- This allows visual verification that the tokenizer is working correctly
For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:
from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))
