Project 2: Train a Custom Domain-Specific Tokenizer (e.g., for legal or medical texts)

4. Wrap Your Tokenizer for Transformers

This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.

This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.

from transformers import PreTrainedTokenizerFast

fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"

sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])

This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:

Importing the necessary class:

The code imports PreTrainedTokenizerFast from the transformers library, which provides a standardized interface for tokenizers

Loading the custom tokenizer:

It initializes a PreTrainedTokenizerFast instance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json")
This file contains the vocabulary and merges learned during the BPE training process

Setting special tokens:

The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
These special tokens are necessary for transformer models to properly handle sequences

Testing the tokenizer with a sample text:

It tokenizes the phrase "This agreement remains in full force."
The return_tensors="pt" parameter converts the output to PyTorch tensors, which is the format required by transformer models
The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)

Printing the results:

The final line prints both the input_ids tensor and the attention_mask tensor
This allows visual verification that the tokenizer is working correctly

For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:

from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))

4. Wrap Your Tokenizer for Transformers

This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.

This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.

from transformers import PreTrainedTokenizerFast

fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"

sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])

This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:

Importing the necessary class:

The code imports PreTrainedTokenizerFast from the transformers library, which provides a standardized interface for tokenizers

Loading the custom tokenizer:

It initializes a PreTrainedTokenizerFast instance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json")
This file contains the vocabulary and merges learned during the BPE training process

Setting special tokens:

The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
These special tokens are necessary for transformer models to properly handle sequences

Testing the tokenizer with a sample text:

It tokenizes the phrase "This agreement remains in full force."
The return_tensors="pt" parameter converts the output to PyTorch tensors, which is the format required by transformer models
The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)

Printing the results:

The final line prints both the input_ids tensor and the attention_mask tensor
This allows visual verification that the tokenizer is working correctly

For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:

from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))

4. Wrap Your Tokenizer for Transformers

This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.

This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.

from transformers import PreTrainedTokenizerFast

fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"

sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])

This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:

Importing the necessary class:

The code imports PreTrainedTokenizerFast from the transformers library, which provides a standardized interface for tokenizers

Loading the custom tokenizer:

It initializes a PreTrainedTokenizerFast instance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json")
This file contains the vocabulary and merges learned during the BPE training process

Setting special tokens:

The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
These special tokens are necessary for transformer models to properly handle sequences

Testing the tokenizer with a sample text:

It tokenizes the phrase "This agreement remains in full force."
The return_tensors="pt" parameter converts the output to PyTorch tensors, which is the format required by transformer models
The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)

Printing the results:

The final line prints both the input_ids tensor and the attention_mask tensor
This allows visual verification that the tokenizer is working correctly

For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:

from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))

4. Wrap Your Tokenizer for Transformers

This allows you to integrate your custom tokenizer with Hugging Face's transformer-based models by wrapping it in a standardized interface. The PreTrainedTokenizerFast class provides a consistent API that transformer models expect, handling all the necessary encoding, decoding, padding, and special token management operations.

This compatibility layer means you can seamlessly use your domain-specific tokenizer with pre-trained models for fine-tuning or inference, without having to modify the model architecture. It also ensures your tokenizer supports batching, padding, truncation, and other features needed for efficient model training and inference.

from transformers import PreTrainedTokenizerFast

fast_bpe = PreTrainedTokenizerFast(tokenizer_file="artifacts/legal_bpe.json")
fast_bpe.pad_token = "[PAD]"
fast_bpe.bos_token = "[BOS]"
fast_bpe.eos_token = "[EOS]"
fast_bpe.unk_token = "[UNK]"

sample_ids = fast_bpe("This agreement remains in full force.", return_tensors="pt")
print(sample_ids["input_ids"], sample_ids["attention_mask"])

This step demonstrates how to wrap a custom BPE tokenizer for use with Hugging Face Transformers. The process involves:

Importing the necessary class:

The code imports PreTrainedTokenizerFast from the transformers library, which provides a standardized interface for tokenizers

Loading the custom tokenizer:

It initializes a PreTrainedTokenizerFast instance by loading a previously saved tokenizer file ("artifacts/legal_bpe.json")
This file contains the vocabulary and merges learned during the BPE training process

Setting special tokens:

The code assigns specific tokens for padding, beginning-of-sequence, end-of-sequence, and unknown tokens
These special tokens are necessary for transformer models to properly handle sequences

Testing the tokenizer with a sample text:

It tokenizes the phrase "This agreement remains in full force."
The return_tensors="pt" parameter converts the output to PyTorch tensors, which is the format required by transformer models
The result includes both input_ids (token IDs) and attention_mask (indicates which positions contain actual tokens vs. padding)

Printing the results:

The final line prints both the input_ids tensor and the attention_mask tensor
This allows visual verification that the tokenizer is working correctly

For SentencePiece, use AutoTokenizer.from_pretrained if you package a tokenizer.json or specify .model with T5Tokenizer/XLNetTokenizer styles. For quick use:

from transformers import T5Tokenizer
sp_tok = T5Tokenizer(vocab_file="artifacts/legal_sp.model")
print(sp_tok("Pursuant to Section 2.3"))

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

4. Wrap Your Tokenizer for Transformers

4. Wrap Your Tokenizer for Transformers

4. Wrap Your Tokenizer for Transformers

4. Wrap Your Tokenizer for Transformers