Chapter 4: The Transformer Architecture
Practical Exercises for Chapter 4
These practical exercises are designed to reinforce your understanding of the core concepts discussed in Chapter 4, including the foundational principles of the Transformer architecture, its components, and comparisons with traditional architectures. Each exercise includes solutions and code examples for hands-on experience.
Exercise 1: Understanding Positional Encoding
Task: Write a Python function to generate positional encodings for a sequence of length nn and embedding dimension dmodeld_{\text{model}}. Visualize the positional encoding values for a sequence length of 10 and embedding dimension of 16.
Solution:
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(sequence_length, d_model):
"""
Generate positional encoding for a sequence.
sequence_length: Length of the sequence
d_model: Dimensionality of embeddings
"""
pos = np.arange(sequence_length)[:, np.newaxis] # Positions
i = np.arange(d_model)[np.newaxis, :] # Embedding dimensions
angle_rates = 1 / np.power(10000, (2 * (i // 2)) / d_model)
angle_rads = pos * angle_rates
# Apply sine to even indices, cosine to odd indices
pos_encoding = np.zeros_like(angle_rads)
pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
return pos_encoding
# Generate positional encoding
sequence_length = 10
d_model = 16
pos_encoding = positional_encoding(sequence_length, d_model)
# Visualize the positional encoding
plt.figure(figsize=(10, 6))
plt.imshow(pos_encoding, cmap='viridis')
plt.colorbar(label='Encoding Value')
plt.title('Positional Encoding Visualization')
plt.xlabel('Embedding Dimension')
plt.ylabel('Token Position')
plt.show()
Exercise 2: Scaled Dot-Product Attention
Task: Implement a function for scaled dot-product attention and apply it to a small dataset. Print the attention weights and output.
Solution:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Q: Queries
K: Keys
V: Values
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Scaled dot product
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0, 1]])
K = np.array([[1, 0, 1], [0, 1, 0], [1, 1, 0]])
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]])
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights:\n", weights)
print("Attention Output:\n", output)
Expected Output:
Attention Weights:
[[0.57611688 0.21194156 0.21194156]]
Attention Output:
[[0.67394156 0.55611688]]
Exercise 3: Comparing RNN and Transformer Outputs
Task: Create a simple RNN and a Transformer model. Use both models to process the same input sequence and compare their outputs. For simplicity, use PyTorch.
Solution:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
# Define a simple RNN
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
return self.fc(out[:, -1, :])
# RNN parameters
input_size = 10
hidden_size = 20
output_size = 10
sequence_length = 5
batch_size = 1
# Initialize and process input with RNN
rnn_model = SimpleRNN(input_size, hidden_size, output_size)
rnn_input = torch.randn(batch_size, sequence_length, input_size)
rnn_output = rnn_model(rnn_input)
print("RNN Output Shape:", rnn_output.shape)
# Transformer: Use pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
# Input for Transformer
text = "The cat sat on the mat."
inputs = tokenizer(text, return_tensors="pt")
bert_output = bert_model(**inputs)
print("Transformer Output Shape:", bert_output.last_hidden_state.shape)
Exercise 4: Encoder-Decoder Interaction
Task: Simulate an encoder-decoder interaction by implementing simple encoder and decoder components. Pass data through both and print the final output.
Solution:
class Encoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(Encoder, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, x):
return torch.relu(self.fc(x))
class Decoder(nn.Module):
def __init__(self, hidden_dim, output_dim):
super(Decoder, self).__init__()
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, encoder_output):
combined = x + encoder_output # Simple interaction
return torch.sigmoid(self.fc(combined))
# Encoder-Decoder parameters
input_dim = 10
hidden_dim = 20
output_dim = 5
sequence_length = 6
# Initialize models
encoder = Encoder(input_dim, hidden_dim)
decoder = Decoder(hidden_dim, output_dim)
# Dummy input
x = torch.randn(sequence_length, input_dim)
encoder_output = encoder(x)
decoder_output = decoder(x, encoder_output)
print("Encoder Output Shape:", encoder_output.shape)
print("Decoder Output Shape:", decoder_output.shape)
These exercises provide a comprehensive hands-on experience with the concepts covered in Chapter 4, such as positional encoding, attention mechanisms, and encoder-decoder interactions. By completing these tasks, you’ll gain a deeper understanding of the Transformer architecture and its advantages over traditional models.
Practical Exercises for Chapter 4
These practical exercises are designed to reinforce your understanding of the core concepts discussed in Chapter 4, including the foundational principles of the Transformer architecture, its components, and comparisons with traditional architectures. Each exercise includes solutions and code examples for hands-on experience.
Exercise 1: Understanding Positional Encoding
Task: Write a Python function to generate positional encodings for a sequence of length nn and embedding dimension dmodeld_{\text{model}}. Visualize the positional encoding values for a sequence length of 10 and embedding dimension of 16.
Solution:
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(sequence_length, d_model):
"""
Generate positional encoding for a sequence.
sequence_length: Length of the sequence
d_model: Dimensionality of embeddings
"""
pos = np.arange(sequence_length)[:, np.newaxis] # Positions
i = np.arange(d_model)[np.newaxis, :] # Embedding dimensions
angle_rates = 1 / np.power(10000, (2 * (i // 2)) / d_model)
angle_rads = pos * angle_rates
# Apply sine to even indices, cosine to odd indices
pos_encoding = np.zeros_like(angle_rads)
pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
return pos_encoding
# Generate positional encoding
sequence_length = 10
d_model = 16
pos_encoding = positional_encoding(sequence_length, d_model)
# Visualize the positional encoding
plt.figure(figsize=(10, 6))
plt.imshow(pos_encoding, cmap='viridis')
plt.colorbar(label='Encoding Value')
plt.title('Positional Encoding Visualization')
plt.xlabel('Embedding Dimension')
plt.ylabel('Token Position')
plt.show()
Exercise 2: Scaled Dot-Product Attention
Task: Implement a function for scaled dot-product attention and apply it to a small dataset. Print the attention weights and output.
Solution:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Q: Queries
K: Keys
V: Values
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Scaled dot product
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0, 1]])
K = np.array([[1, 0, 1], [0, 1, 0], [1, 1, 0]])
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]])
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights:\n", weights)
print("Attention Output:\n", output)
Expected Output:
Attention Weights:
[[0.57611688 0.21194156 0.21194156]]
Attention Output:
[[0.67394156 0.55611688]]
Exercise 3: Comparing RNN and Transformer Outputs
Task: Create a simple RNN and a Transformer model. Use both models to process the same input sequence and compare their outputs. For simplicity, use PyTorch.
Solution:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
# Define a simple RNN
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
return self.fc(out[:, -1, :])
# RNN parameters
input_size = 10
hidden_size = 20
output_size = 10
sequence_length = 5
batch_size = 1
# Initialize and process input with RNN
rnn_model = SimpleRNN(input_size, hidden_size, output_size)
rnn_input = torch.randn(batch_size, sequence_length, input_size)
rnn_output = rnn_model(rnn_input)
print("RNN Output Shape:", rnn_output.shape)
# Transformer: Use pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
# Input for Transformer
text = "The cat sat on the mat."
inputs = tokenizer(text, return_tensors="pt")
bert_output = bert_model(**inputs)
print("Transformer Output Shape:", bert_output.last_hidden_state.shape)
Exercise 4: Encoder-Decoder Interaction
Task: Simulate an encoder-decoder interaction by implementing simple encoder and decoder components. Pass data through both and print the final output.
Solution:
class Encoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(Encoder, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, x):
return torch.relu(self.fc(x))
class Decoder(nn.Module):
def __init__(self, hidden_dim, output_dim):
super(Decoder, self).__init__()
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, encoder_output):
combined = x + encoder_output # Simple interaction
return torch.sigmoid(self.fc(combined))
# Encoder-Decoder parameters
input_dim = 10
hidden_dim = 20
output_dim = 5
sequence_length = 6
# Initialize models
encoder = Encoder(input_dim, hidden_dim)
decoder = Decoder(hidden_dim, output_dim)
# Dummy input
x = torch.randn(sequence_length, input_dim)
encoder_output = encoder(x)
decoder_output = decoder(x, encoder_output)
print("Encoder Output Shape:", encoder_output.shape)
print("Decoder Output Shape:", decoder_output.shape)
These exercises provide a comprehensive hands-on experience with the concepts covered in Chapter 4, such as positional encoding, attention mechanisms, and encoder-decoder interactions. By completing these tasks, you’ll gain a deeper understanding of the Transformer architecture and its advantages over traditional models.
Practical Exercises for Chapter 4
These practical exercises are designed to reinforce your understanding of the core concepts discussed in Chapter 4, including the foundational principles of the Transformer architecture, its components, and comparisons with traditional architectures. Each exercise includes solutions and code examples for hands-on experience.
Exercise 1: Understanding Positional Encoding
Task: Write a Python function to generate positional encodings for a sequence of length nn and embedding dimension dmodeld_{\text{model}}. Visualize the positional encoding values for a sequence length of 10 and embedding dimension of 16.
Solution:
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(sequence_length, d_model):
"""
Generate positional encoding for a sequence.
sequence_length: Length of the sequence
d_model: Dimensionality of embeddings
"""
pos = np.arange(sequence_length)[:, np.newaxis] # Positions
i = np.arange(d_model)[np.newaxis, :] # Embedding dimensions
angle_rates = 1 / np.power(10000, (2 * (i // 2)) / d_model)
angle_rads = pos * angle_rates
# Apply sine to even indices, cosine to odd indices
pos_encoding = np.zeros_like(angle_rads)
pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
return pos_encoding
# Generate positional encoding
sequence_length = 10
d_model = 16
pos_encoding = positional_encoding(sequence_length, d_model)
# Visualize the positional encoding
plt.figure(figsize=(10, 6))
plt.imshow(pos_encoding, cmap='viridis')
plt.colorbar(label='Encoding Value')
plt.title('Positional Encoding Visualization')
plt.xlabel('Embedding Dimension')
plt.ylabel('Token Position')
plt.show()
Exercise 2: Scaled Dot-Product Attention
Task: Implement a function for scaled dot-product attention and apply it to a small dataset. Print the attention weights and output.
Solution:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Q: Queries
K: Keys
V: Values
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Scaled dot product
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0, 1]])
K = np.array([[1, 0, 1], [0, 1, 0], [1, 1, 0]])
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]])
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights:\n", weights)
print("Attention Output:\n", output)
Expected Output:
Attention Weights:
[[0.57611688 0.21194156 0.21194156]]
Attention Output:
[[0.67394156 0.55611688]]
Exercise 3: Comparing RNN and Transformer Outputs
Task: Create a simple RNN and a Transformer model. Use both models to process the same input sequence and compare their outputs. For simplicity, use PyTorch.
Solution:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
# Define a simple RNN
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
return self.fc(out[:, -1, :])
# RNN parameters
input_size = 10
hidden_size = 20
output_size = 10
sequence_length = 5
batch_size = 1
# Initialize and process input with RNN
rnn_model = SimpleRNN(input_size, hidden_size, output_size)
rnn_input = torch.randn(batch_size, sequence_length, input_size)
rnn_output = rnn_model(rnn_input)
print("RNN Output Shape:", rnn_output.shape)
# Transformer: Use pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
# Input for Transformer
text = "The cat sat on the mat."
inputs = tokenizer(text, return_tensors="pt")
bert_output = bert_model(**inputs)
print("Transformer Output Shape:", bert_output.last_hidden_state.shape)
Exercise 4: Encoder-Decoder Interaction
Task: Simulate an encoder-decoder interaction by implementing simple encoder and decoder components. Pass data through both and print the final output.
Solution:
class Encoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(Encoder, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, x):
return torch.relu(self.fc(x))
class Decoder(nn.Module):
def __init__(self, hidden_dim, output_dim):
super(Decoder, self).__init__()
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, encoder_output):
combined = x + encoder_output # Simple interaction
return torch.sigmoid(self.fc(combined))
# Encoder-Decoder parameters
input_dim = 10
hidden_dim = 20
output_dim = 5
sequence_length = 6
# Initialize models
encoder = Encoder(input_dim, hidden_dim)
decoder = Decoder(hidden_dim, output_dim)
# Dummy input
x = torch.randn(sequence_length, input_dim)
encoder_output = encoder(x)
decoder_output = decoder(x, encoder_output)
print("Encoder Output Shape:", encoder_output.shape)
print("Decoder Output Shape:", decoder_output.shape)
These exercises provide a comprehensive hands-on experience with the concepts covered in Chapter 4, such as positional encoding, attention mechanisms, and encoder-decoder interactions. By completing these tasks, you’ll gain a deeper understanding of the Transformer architecture and its advantages over traditional models.
Practical Exercises for Chapter 4
These practical exercises are designed to reinforce your understanding of the core concepts discussed in Chapter 4, including the foundational principles of the Transformer architecture, its components, and comparisons with traditional architectures. Each exercise includes solutions and code examples for hands-on experience.
Exercise 1: Understanding Positional Encoding
Task: Write a Python function to generate positional encodings for a sequence of length nn and embedding dimension dmodeld_{\text{model}}. Visualize the positional encoding values for a sequence length of 10 and embedding dimension of 16.
Solution:
import numpy as np
import matplotlib.pyplot as plt
def positional_encoding(sequence_length, d_model):
"""
Generate positional encoding for a sequence.
sequence_length: Length of the sequence
d_model: Dimensionality of embeddings
"""
pos = np.arange(sequence_length)[:, np.newaxis] # Positions
i = np.arange(d_model)[np.newaxis, :] # Embedding dimensions
angle_rates = 1 / np.power(10000, (2 * (i // 2)) / d_model)
angle_rads = pos * angle_rates
# Apply sine to even indices, cosine to odd indices
pos_encoding = np.zeros_like(angle_rads)
pos_encoding[:, 0::2] = np.sin(angle_rads[:, 0::2])
pos_encoding[:, 1::2] = np.cos(angle_rads[:, 1::2])
return pos_encoding
# Generate positional encoding
sequence_length = 10
d_model = 16
pos_encoding = positional_encoding(sequence_length, d_model)
# Visualize the positional encoding
plt.figure(figsize=(10, 6))
plt.imshow(pos_encoding, cmap='viridis')
plt.colorbar(label='Encoding Value')
plt.title('Positional Encoding Visualization')
plt.xlabel('Embedding Dimension')
plt.ylabel('Token Position')
plt.show()
Exercise 2: Scaled Dot-Product Attention
Task: Implement a function for scaled dot-product attention and apply it to a small dataset. Print the attention weights and output.
Solution:
import numpy as np
def scaled_dot_product_attention(Q, K, V):
"""
Compute scaled dot-product attention.
Q: Queries
K: Keys
V: Values
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Scaled dot product
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0, 1]])
K = np.array([[1, 0, 1], [0, 1, 0], [1, 1, 0]])
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]])
output, weights = scaled_dot_product_attention(Q, K, V)
print("Attention Weights:\n", weights)
print("Attention Output:\n", output)
Expected Output:
Attention Weights:
[[0.57611688 0.21194156 0.21194156]]
Attention Output:
[[0.67394156 0.55611688]]
Exercise 3: Comparing RNN and Transformer Outputs
Task: Create a simple RNN and a Transformer model. Use both models to process the same input sequence and compare their outputs. For simplicity, use PyTorch.
Solution:
import torch
import torch.nn as nn
from transformers import BertModel, BertTokenizer
# Define a simple RNN
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
return self.fc(out[:, -1, :])
# RNN parameters
input_size = 10
hidden_size = 20
output_size = 10
sequence_length = 5
batch_size = 1
# Initialize and process input with RNN
rnn_model = SimpleRNN(input_size, hidden_size, output_size)
rnn_input = torch.randn(batch_size, sequence_length, input_size)
rnn_output = rnn_model(rnn_input)
print("RNN Output Shape:", rnn_output.shape)
# Transformer: Use pre-trained BERT
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
bert_model = BertModel.from_pretrained("bert-base-uncased")
# Input for Transformer
text = "The cat sat on the mat."
inputs = tokenizer(text, return_tensors="pt")
bert_output = bert_model(**inputs)
print("Transformer Output Shape:", bert_output.last_hidden_state.shape)
Exercise 4: Encoder-Decoder Interaction
Task: Simulate an encoder-decoder interaction by implementing simple encoder and decoder components. Pass data through both and print the final output.
Solution:
class Encoder(nn.Module):
def __init__(self, input_dim, hidden_dim):
super(Encoder, self).__init__()
self.fc = nn.Linear(input_dim, hidden_dim)
def forward(self, x):
return torch.relu(self.fc(x))
class Decoder(nn.Module):
def __init__(self, hidden_dim, output_dim):
super(Decoder, self).__init__()
self.fc = nn.Linear(hidden_dim, output_dim)
def forward(self, x, encoder_output):
combined = x + encoder_output # Simple interaction
return torch.sigmoid(self.fc(combined))
# Encoder-Decoder parameters
input_dim = 10
hidden_dim = 20
output_dim = 5
sequence_length = 6
# Initialize models
encoder = Encoder(input_dim, hidden_dim)
decoder = Decoder(hidden_dim, output_dim)
# Dummy input
x = torch.randn(sequence_length, input_dim)
encoder_output = encoder(x)
decoder_output = decoder(x, encoder_output)
print("Encoder Output Shape:", encoder_output.shape)
print("Decoder Output Shape:", decoder_output.shape)
These exercises provide a comprehensive hands-on experience with the concepts covered in Chapter 4, such as positional encoding, attention mechanisms, and encoder-decoder interactions. By completing these tasks, you’ll gain a deeper understanding of the Transformer architecture and its advantages over traditional models.