Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Chapter 2: Tokenization and Embeddings

Practical Exercises – Chapter 2

The following exercises are designed to help you solidify your understanding of tokenization and embeddings.

Exercise 1 – Explore BPE Tokenization

Task:

Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:

["low", "lowest", "lower", "newest"]

Encode the word “lowering” and check the resulting tokens.

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)

# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens)  # Example: ['low', 'er', 'ing']

Exercise 2 – Compare WordPiece Tokenization

Task:

Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.

What tokens do you get?

Solution:

from transformers import BertTokenizer

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokens = tokenizer.tokenize("unhappiness")
print(tokens)  # Output: ['un', '##happiness']

Exercise 3 – Train a SentencePiece Tokenizer

Task:

Train a SentencePiece tokenizer on a toy dataset containing:

I am a student
I am learning AI

Then tokenize the sentence “I am a student”.

Solution:

import sentencepiece as spm

# Save toy corpus
with open("toy.txt", "w") as f:
    f.write("I am a student\n")
    f.write("I am learning AI\n")

# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)

# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))

Exercise 4 – Train a Domain-Specific Tokenizer

Task:

Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = [
    "The plaintiff hereby files a motion to dismiss.",
    "The defendant shall pay damages as determined by the court."
]

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

tokenizer.train_from_iterator(corpus, trainer)

encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens)  # Expected: 'plaintiff' and 'motion' as whole tokens

Exercise 5 – Subword Embeddings with BERT

Task:

Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state

for token, vector in zip(tokens, embeddings[0]):
    print(f"{token}: {vector[:5].detach().numpy()} ...")

Exercise 6 – Character-Level Embeddings

Task:

Create a character-level embedding system in PyTorch and embed the word “play”.

Solution:

import torch
import torch.nn as nn

chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}

embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)

word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)

print(vectors)  # 4 vectors, one per character

Exercise 7 – Multimodal Embeddings with CLIP

Task:

Use CLIP to compare an image of a dog with two candidate captions:

  • “a photo of a dog”
  • “a photo of a cat”

Which one gets the higher probability?

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

print(probs)  # Higher score should correspond to "a photo of a dog"

Summary of Learning Goals

By completing these exercises, you have:

  • Practiced BPE, WordPiece, and SentencePiece tokenization.
  • Built a custom tokenizer for domain-specific data.
  • Extracted subword embeddings from BERT.
  • Implemented character-level embeddings.
  • Explored multimodal embeddings with CLIP.

Practical Exercises – Chapter 2

The following exercises are designed to help you solidify your understanding of tokenization and embeddings.

Exercise 1 – Explore BPE Tokenization

Task:

Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:

["low", "lowest", "lower", "newest"]

Encode the word “lowering” and check the resulting tokens.

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)

# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens)  # Example: ['low', 'er', 'ing']

Exercise 2 – Compare WordPiece Tokenization

Task:

Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.

What tokens do you get?

Solution:

from transformers import BertTokenizer

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokens = tokenizer.tokenize("unhappiness")
print(tokens)  # Output: ['un', '##happiness']

Exercise 3 – Train a SentencePiece Tokenizer

Task:

Train a SentencePiece tokenizer on a toy dataset containing:

I am a student
I am learning AI

Then tokenize the sentence “I am a student”.

Solution:

import sentencepiece as spm

# Save toy corpus
with open("toy.txt", "w") as f:
    f.write("I am a student\n")
    f.write("I am learning AI\n")

# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)

# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))

Exercise 4 – Train a Domain-Specific Tokenizer

Task:

Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = [
    "The plaintiff hereby files a motion to dismiss.",
    "The defendant shall pay damages as determined by the court."
]

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

tokenizer.train_from_iterator(corpus, trainer)

encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens)  # Expected: 'plaintiff' and 'motion' as whole tokens

Exercise 5 – Subword Embeddings with BERT

Task:

Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state

for token, vector in zip(tokens, embeddings[0]):
    print(f"{token}: {vector[:5].detach().numpy()} ...")

Exercise 6 – Character-Level Embeddings

Task:

Create a character-level embedding system in PyTorch and embed the word “play”.

Solution:

import torch
import torch.nn as nn

chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}

embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)

word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)

print(vectors)  # 4 vectors, one per character

Exercise 7 – Multimodal Embeddings with CLIP

Task:

Use CLIP to compare an image of a dog with two candidate captions:

  • “a photo of a dog”
  • “a photo of a cat”

Which one gets the higher probability?

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

print(probs)  # Higher score should correspond to "a photo of a dog"

Summary of Learning Goals

By completing these exercises, you have:

  • Practiced BPE, WordPiece, and SentencePiece tokenization.
  • Built a custom tokenizer for domain-specific data.
  • Extracted subword embeddings from BERT.
  • Implemented character-level embeddings.
  • Explored multimodal embeddings with CLIP.

Practical Exercises – Chapter 2

The following exercises are designed to help you solidify your understanding of tokenization and embeddings.

Exercise 1 – Explore BPE Tokenization

Task:

Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:

["low", "lowest", "lower", "newest"]

Encode the word “lowering” and check the resulting tokens.

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)

# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens)  # Example: ['low', 'er', 'ing']

Exercise 2 – Compare WordPiece Tokenization

Task:

Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.

What tokens do you get?

Solution:

from transformers import BertTokenizer

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokens = tokenizer.tokenize("unhappiness")
print(tokens)  # Output: ['un', '##happiness']

Exercise 3 – Train a SentencePiece Tokenizer

Task:

Train a SentencePiece tokenizer on a toy dataset containing:

I am a student
I am learning AI

Then tokenize the sentence “I am a student”.

Solution:

import sentencepiece as spm

# Save toy corpus
with open("toy.txt", "w") as f:
    f.write("I am a student\n")
    f.write("I am learning AI\n")

# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)

# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))

Exercise 4 – Train a Domain-Specific Tokenizer

Task:

Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = [
    "The plaintiff hereby files a motion to dismiss.",
    "The defendant shall pay damages as determined by the court."
]

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

tokenizer.train_from_iterator(corpus, trainer)

encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens)  # Expected: 'plaintiff' and 'motion' as whole tokens

Exercise 5 – Subword Embeddings with BERT

Task:

Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state

for token, vector in zip(tokens, embeddings[0]):
    print(f"{token}: {vector[:5].detach().numpy()} ...")

Exercise 6 – Character-Level Embeddings

Task:

Create a character-level embedding system in PyTorch and embed the word “play”.

Solution:

import torch
import torch.nn as nn

chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}

embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)

word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)

print(vectors)  # 4 vectors, one per character

Exercise 7 – Multimodal Embeddings with CLIP

Task:

Use CLIP to compare an image of a dog with two candidate captions:

  • “a photo of a dog”
  • “a photo of a cat”

Which one gets the higher probability?

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

print(probs)  # Higher score should correspond to "a photo of a dog"

Summary of Learning Goals

By completing these exercises, you have:

  • Practiced BPE, WordPiece, and SentencePiece tokenization.
  • Built a custom tokenizer for domain-specific data.
  • Extracted subword embeddings from BERT.
  • Implemented character-level embeddings.
  • Explored multimodal embeddings with CLIP.

Practical Exercises – Chapter 2

The following exercises are designed to help you solidify your understanding of tokenization and embeddings.

Exercise 1 – Explore BPE Tokenization

Task:

Train a simple Byte Pair Encoding (BPE) tokenizer on the small corpus:

["low", "lowest", "lower", "newest"]

Encode the word “lowering” and check the resulting tokens.

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

# Initialize BPE tokenizer
tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=50, min_frequency=2)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

# Train on corpus
corpus = ["low", "lowest", "lower", "newest"]
tokenizer.train_from_iterator(corpus, trainer)

# Encode a word
encoded = tokenizer.encode("lowering")
print(encoded.tokens)  # Example: ['low', 'er', 'ing']

Exercise 2 – Compare WordPiece Tokenization

Task:

Use BERT’s WordPiece tokenizer to tokenize the word “unhappiness”.

What tokens do you get?

Solution:

from transformers import BertTokenizer

# Load pre-trained tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

tokens = tokenizer.tokenize("unhappiness")
print(tokens)  # Output: ['un', '##happiness']

Exercise 3 – Train a SentencePiece Tokenizer

Task:

Train a SentencePiece tokenizer on a toy dataset containing:

I am a student
I am learning AI

Then tokenize the sentence “I am a student”.

Solution:

import sentencepiece as spm

# Save toy corpus
with open("toy.txt", "w") as f:
    f.write("I am a student\n")
    f.write("I am learning AI\n")

# Train SentencePiece model
spm.SentencePieceTrainer.train(input="toy.txt", model_prefix="mymodel", vocab_size=50)

# Load trained tokenizer
sp = spm.SentencePieceProcessor(model_file="mymodel.model")
print(sp.encode("I am a student", out_type=str))

Exercise 4 – Train a Domain-Specific Tokenizer

Task:

Create a BPE tokenizer for a legal domain corpus and check how it tokenizes the sentence “The plaintiff filed a motion.”

Solution:

from tokenizers import Tokenizer, models, trainers, pre_tokenizers

corpus = [
    "The plaintiff hereby files a motion to dismiss.",
    "The defendant shall pay damages as determined by the court."
]

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=100, min_frequency=1)
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()

tokenizer.train_from_iterator(corpus, trainer)

encoded = tokenizer.encode("The plaintiff filed a motion.")
print(encoded.tokens)  # Expected: 'plaintiff' and 'motion' as whole tokens

Exercise 5 – Subword Embeddings with BERT

Task:

Use BERT to extract embeddings for the word “playground”. Observe how it is split into subwords and how each has its own vector.

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

text = "playground"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
embeddings = outputs.last_hidden_state

for token, vector in zip(tokens, embeddings[0]):
    print(f"{token}: {vector[:5].detach().numpy()} ...")

Exercise 6 – Character-Level Embeddings

Task:

Create a character-level embedding system in PyTorch and embed the word “play”.

Solution:

import torch
import torch.nn as nn

chars = list("abcdefghijklmnopqrstuvwxyz")
char2idx = {c: i for i, c in enumerate(chars)}

embedding_dim = 8
embedding = nn.Embedding(len(chars), embedding_dim)

word = "play"
indices = torch.tensor([char2idx[c] for c in word])
vectors = embedding(indices)

print(vectors)  # 4 vectors, one per character

Exercise 7 – Multimodal Embeddings with CLIP

Task:

Use CLIP to compare an image of a dog with two candidate captions:

  • “a photo of a dog”
  • “a photo of a cat”

Which one gets the higher probability?

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

image = Image.open("dog.jpg")  # Replace with a real image file
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt", padding=True)

outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

print(probs)  # Higher score should correspond to "a photo of a dog"

Summary of Learning Goals

By completing these exercises, you have:

  • Practiced BPE, WordPiece, and SentencePiece tokenization.
  • Built a custom tokenizer for domain-specific data.
  • Extracted subword embeddings from BERT.
  • Implemented character-level embeddings.
  • Explored multimodal embeddings with CLIP.