Chapter 2: Tokenization and Embeddings
Practical Exercises – Chapter 2
The following exercises are designed to help you solidify your understanding of tokenization and embeddings.
Exercise 1 – Explore BPE Tokenization
Task:
Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:
["low", "lowest", "lower", "newest"]
Encode the word “lowering” and check the resulting tokens.
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)
# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens) # Example: ['low', 'er', 'ing']
Exercise 2 – Compare WordPiece Tokenization
Task:
Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.
What tokens do you get?
Solution:
from transformers import BertTokenizer
# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")
print(tokens) # Output: ['un', '##happiness']
Exercise 3 – Train a SentencePiece Tokenizer
Task:
Train a SentencePiece tokenizer on a toy dataset containing:
I am a student
I am learning AI
Then tokenize the sentence “I am a student”.
Solution:
import sentencepiece as spm
# Save toy corpus
with open("toy.txt", "w") as f:
f.write("I am a student\n")
f.write("I am learning AI\n")
# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)
# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))
Exercise 4 – Train a Domain-Specific Tokenizer
Task:
Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
corpus = [
"The plaintiff hereby files a motion to dismiss.",
"The defendant shall pay damages as determined by the court."
]
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(corpus, trainer)
encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens) # Expected: 'plaintiff' and 'motion' as whole tokens
Exercise 5 – Subword Embeddings with BERT
Task:
Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state
for token, vector in zip(tokens, embeddings[0]):
print(f"{token}: {vector[:5].detach().numpy()} ...")
Exercise 6 – Character-Level Embeddings
Task:
Create a character-level embedding system in PyTorch and embed the word “play”.
Solution:
import torch
import torch.nn as nn
chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}
embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)
word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)
print(vectors) # 4 vectors, one per character
Exercise 7 – Multimodal Embeddings with CLIP
Task:
Use CLIP to compare an image of a dog with two candidate captions:
- “a photo of a dog”
- “a photo of a cat”
Which one gets the higher probability?
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs) # Higher score should correspond to "a photo of a dog"
Summary of Learning Goals
By completing these exercises, you have:
- Practiced BPE, WordPiece, and SentencePiece tokenization.
- Built a custom tokenizer for domain-specific data.
- Extracted subword embeddings from BERT.
- Implemented character-level embeddings.
- Explored multimodal embeddings with CLIP.
Practical Exercises – Chapter 2
The following exercises are designed to help you solidify your understanding of tokenization and embeddings.
Exercise 1 – Explore BPE Tokenization
Task:
Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:
["low", "lowest", "lower", "newest"]
Encode the word “lowering” and check the resulting tokens.
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)
# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens) # Example: ['low', 'er', 'ing']
Exercise 2 – Compare WordPiece Tokenization
Task:
Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.
What tokens do you get?
Solution:
from transformers import BertTokenizer
# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")
print(tokens) # Output: ['un', '##happiness']
Exercise 3 – Train a SentencePiece Tokenizer
Task:
Train a SentencePiece tokenizer on a toy dataset containing:
I am a student
I am learning AI
Then tokenize the sentence “I am a student”.
Solution:
import sentencepiece as spm
# Save toy corpus
with open("toy.txt", "w") as f:
f.write("I am a student\n")
f.write("I am learning AI\n")
# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)
# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))
Exercise 4 – Train a Domain-Specific Tokenizer
Task:
Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
corpus = [
"The plaintiff hereby files a motion to dismiss.",
"The defendant shall pay damages as determined by the court."
]
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(corpus, trainer)
encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens) # Expected: 'plaintiff' and 'motion' as whole tokens
Exercise 5 – Subword Embeddings with BERT
Task:
Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state
for token, vector in zip(tokens, embeddings[0]):
print(f"{token}: {vector[:5].detach().numpy()} ...")
Exercise 6 – Character-Level Embeddings
Task:
Create a character-level embedding system in PyTorch and embed the word “play”.
Solution:
import torch
import torch.nn as nn
chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}
embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)
word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)
print(vectors) # 4 vectors, one per character
Exercise 7 – Multimodal Embeddings with CLIP
Task:
Use CLIP to compare an image of a dog with two candidate captions:
- “a photo of a dog”
- “a photo of a cat”
Which one gets the higher probability?
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs) # Higher score should correspond to "a photo of a dog"
Summary of Learning Goals
By completing these exercises, you have:
- Practiced BPE, WordPiece, and SentencePiece tokenization.
- Built a custom tokenizer for domain-specific data.
- Extracted subword embeddings from BERT.
- Implemented character-level embeddings.
- Explored multimodal embeddings with CLIP.
Practical Exercises – Chapter 2
The following exercises are designed to help you solidify your understanding of tokenization and embeddings.
Exercise 1 – Explore BPE Tokenization
Task:
Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:
["low", "lowest", "lower", "newest"]
Encode the word “lowering” and check the resulting tokens.
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)
# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens) # Example: ['low', 'er', 'ing']
Exercise 2 – Compare WordPiece Tokenization
Task:
Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.
What tokens do you get?
Solution:
from transformers import BertTokenizer
# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")
print(tokens) # Output: ['un', '##happiness']
Exercise 3 – Train a SentencePiece Tokenizer
Task:
Train a SentencePiece tokenizer on a toy dataset containing:
I am a student
I am learning AI
Then tokenize the sentence “I am a student”.
Solution:
import sentencepiece as spm
# Save toy corpus
with open("toy.txt", "w") as f:
f.write("I am a student\n")
f.write("I am learning AI\n")
# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)
# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))
Exercise 4 – Train a Domain-Specific Tokenizer
Task:
Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
corpus = [
"The plaintiff hereby files a motion to dismiss.",
"The defendant shall pay damages as determined by the court."
]
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(corpus, trainer)
encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens) # Expected: 'plaintiff' and 'motion' as whole tokens
Exercise 5 – Subword Embeddings with BERT
Task:
Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state
for token, vector in zip(tokens, embeddings[0]):
print(f"{token}: {vector[:5].detach().numpy()} ...")
Exercise 6 – Character-Level Embeddings
Task:
Create a character-level embedding system in PyTorch and embed the word “play”.
Solution:
import torch
import torch.nn as nn
chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}
embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)
word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)
print(vectors) # 4 vectors, one per character
Exercise 7 – Multimodal Embeddings with CLIP
Task:
Use CLIP to compare an image of a dog with two candidate captions:
- “a photo of a dog”
- “a photo of a cat”
Which one gets the higher probability?
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs) # Higher score should correspond to "a photo of a dog"
Summary of Learning Goals
By completing these exercises, you have:
- Practiced BPE, WordPiece, and SentencePiece tokenization.
- Built a custom tokenizer for domain-specific data.
- Extracted subword embeddings from BERT.
- Implemented character-level embeddings.
- Explored multimodal embeddings with CLIP.
Practical Exercises – Chapter 2
The following exercises are designed to help you solidify your understanding of tokenization and embeddings.
Exercise 1 – Explore BPE Tokenization
Task:
Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:
["low", "lowest", "lower", "newest"]
Encode the word “lowering” and check the resulting tokens.
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)
# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens) # Example: ['low', 'er', 'ing']
Exercise 2 – Compare WordPiece Tokenization
Task:
Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.
What tokens do you get?
Solution:
from transformers import BertTokenizer
# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
tokens = tokenizer.tokenize("unhappiness")
print(tokens) # Output: ['un', '##happiness']
Exercise 3 – Train a SentencePiece Tokenizer
Task:
Train a SentencePiece tokenizer on a toy dataset containing:
I am a student
I am learning AI
Then tokenize the sentence “I am a student”.
Solution:
import sentencepiece as spm
# Save toy corpus
with open("toy.txt", "w") as f:
f.write("I am a student\n")
f.write("I am learning AI\n")
# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)
# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))
Exercise 4 – Train a Domain-Specific Tokenizer
Task:
Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”
Solution:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers
corpus = [
"The plaintiff hereby files a motion to dismiss.",
"The defendant shall pay damages as determined by the court."
]
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
tokenizer.train_from_iterator(corpus, trainer)
encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens) # Expected: 'plaintiff' and 'motion' as whole tokens
Exercise 5 – Subword Embeddings with BERT
Task:
Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state
for token, vector in zip(tokens, embeddings[0]):
print(f"{token}: {vector[:5].detach().numpy()} ...")
Exercise 6 – Character-Level Embeddings
Task:
Create a character-level embedding system in PyTorch and embed the word “play”.
Solution:
import torch
import torch.nn as nn
chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}
embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)
word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)
print(vectors) # 4 vectors, one per character
Exercise 7 – Multimodal Embeddings with CLIP
Task:
Use CLIP to compare an image of a dog with two candidate captions:
- “a photo of a dog”
- “a photo of a cat”
Which one gets the higher probability?
Solution:
from transformers import CLIPProcessor, CLIPModel
from PIL import Image
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")
image = Image.open("dog.jpg") # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
print(probs) # Higher score should correspond to "a photo of a dog"
Summary of Learning Goals
By completing these exercises, you have:
- Practiced BPE, WordPiece, and SentencePiece tokenization.
- Built a custom tokenizer for domain-specific data.
- Extracted subword embeddings from BERT.
- Implemented character-level embeddings.
- Explored multimodal embeddings with CLIP.
