Chapter 3: Attention and the Rise of Transformers
Practical Exercises for Chapter 3
The following practical exercises reinforce the key concepts covered in Chapter 3, including the challenges with earlier architectures, self-attention, multi-head attention, and sparse attention. Each exercise is accompanied by a detailed solution and code examples to deepen your understanding.
Exercise 1: Simulating Challenges with RNNs
Task: Create a simple RNN using PyTorch to demonstrate the difficulty of handling long-range dependencies.
Steps:
- Implement an RNN for sequence processing.
- Generate a synthetic dataset with long sequences.
- Observe how the RNN struggles to capture long-term dependencies.
Solution:
import torch
import torch.nn as nn
# Define a simple RNN model
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
out = self.fc(out[:, -1, :]) # Use the last timestep
return out
# Parameters
input_size = 10 # Vocabulary size
hidden_size = 20
output_size = 1
sequence_length = 100
batch_size = 32
# Generate synthetic dataset
X = torch.randn(batch_size, sequence_length, input_size)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # Binary labels
# Initialize and train the model
model = SimpleRNN(input_size, hidden_size, output_size)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(10):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")
Exercise 2: Implementing Self-Attention
Task: Write a Python function to compute self-attention for a sequence of tokens using NumPy.
Solution:
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""
Compute self-attention for a sequence.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
"""
Q = np.dot(X, W_Q) # Compute Queries
K = np.dot(X, W_K) # Compute Keys
V = np.dot(X, W_V) # Compute Values
# Calculate scaled dot-product attention
scores = np.dot(Q, K.T) / np.sqrt(K.shape[1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
return output, weights
# Example inputs
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.array([[0.1, 0.3], [0.5, 0.7]]) # Query weights
W_K = np.array([[0.2, 0.4], [0.6, 0.8]]) # Key weights
W_V = np.array([[0.1, 0.5], [0.3, 0.7]]) # Value weights
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Self-Attention Weights:\n", weights)
print("Self-Attention Output:\n", output)
Exercise 3: Multi-Head Attention
Task: Implement a simplified multi-head attention mechanism using NumPy.
Solution:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
"""
Compute multi-head attention.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
W_O: Output projection matrix
n_heads: Number of attention heads
"""
head_dim = W_Q.shape[1] // n_heads
outputs = []
for i in range(n_heads):
Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])
K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])
V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])
scores = np.dot(Q, K.T) / np.sqrt(head_dim)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
outputs.append(output)
concatenated = np.concatenate(outputs, axis=-1)
final_output = np.dot(concatenated, W_O)
return final_output
# Example parameters
n_heads = 2
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.random.rand(2, 4) # Query weights (2 features, 4 for 2 heads)
W_K = np.random.rand(2, 4) # Key weights
W_V = np.random.rand(2, 4) # Value weights
W_O = np.random.rand(4, 2) # Output projection weights
# Compute multi-head attention
output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print("Multi-Head Attention Output:\n", output)
Exercise 4: Sparse Attention
Task: Implement a sparse attention mechanism using a custom mask to limit token interactions.
Solution:
def sparse_attention(Q, K, V, sparsity_mask):
"""
Compute sparse attention.
Q: Queries
K: Keys
V: Values
sparsity_mask: Binary mask defining allowable token interactions
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Compute scaled dot-product
sparse_scores = scores * sparsity_mask # Apply sparsity mask
weights = np.exp(sparse_scores) / np.sum(np.exp(sparse_scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0], [0, 1], [1, 1]]) # Query
K = np.array([[1, 0], [0, 1], [1, 1]]) # Keys
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]]) # Values
# Sparsity mask (local attention pattern)
sparsity_mask = np.array([
[1, 1, 0], # Token 1 attends to Token 1, 2
[1, 1, 1], # Token 2 attends to all
[0, 1, 1] # Token 3 attends to Token 2, 3
])
output, weights = sparse_attention(Q, K, V, sparsity_mask)
print("Sparse Attention Weights:\n", weights)
print("Sparse Attention Output:\n", output)
These exercises guide you through the practical implementation of key concepts like self-attention, multi-head attention, and sparse attention. Completing them will deepen your understanding of how attention mechanisms address the challenges of earlier architectures and enable the scalability and efficiency of Transformer models.
Practical Exercises for Chapter 3
The following practical exercises reinforce the key concepts covered in Chapter 3, including the challenges with earlier architectures, self-attention, multi-head attention, and sparse attention. Each exercise is accompanied by a detailed solution and code examples to deepen your understanding.
Exercise 1: Simulating Challenges with RNNs
Task: Create a simple RNN using PyTorch to demonstrate the difficulty of handling long-range dependencies.
Steps:
- Implement an RNN for sequence processing.
- Generate a synthetic dataset with long sequences.
- Observe how the RNN struggles to capture long-term dependencies.
Solution:
import torch
import torch.nn as nn
# Define a simple RNN model
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
out = self.fc(out[:, -1, :]) # Use the last timestep
return out
# Parameters
input_size = 10 # Vocabulary size
hidden_size = 20
output_size = 1
sequence_length = 100
batch_size = 32
# Generate synthetic dataset
X = torch.randn(batch_size, sequence_length, input_size)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # Binary labels
# Initialize and train the model
model = SimpleRNN(input_size, hidden_size, output_size)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(10):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")
Exercise 2: Implementing Self-Attention
Task: Write a Python function to compute self-attention for a sequence of tokens using NumPy.
Solution:
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""
Compute self-attention for a sequence.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
"""
Q = np.dot(X, W_Q) # Compute Queries
K = np.dot(X, W_K) # Compute Keys
V = np.dot(X, W_V) # Compute Values
# Calculate scaled dot-product attention
scores = np.dot(Q, K.T) / np.sqrt(K.shape[1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
return output, weights
# Example inputs
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.array([[0.1, 0.3], [0.5, 0.7]]) # Query weights
W_K = np.array([[0.2, 0.4], [0.6, 0.8]]) # Key weights
W_V = np.array([[0.1, 0.5], [0.3, 0.7]]) # Value weights
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Self-Attention Weights:\n", weights)
print("Self-Attention Output:\n", output)
Exercise 3: Multi-Head Attention
Task: Implement a simplified multi-head attention mechanism using NumPy.
Solution:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
"""
Compute multi-head attention.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
W_O: Output projection matrix
n_heads: Number of attention heads
"""
head_dim = W_Q.shape[1] // n_heads
outputs = []
for i in range(n_heads):
Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])
K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])
V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])
scores = np.dot(Q, K.T) / np.sqrt(head_dim)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
outputs.append(output)
concatenated = np.concatenate(outputs, axis=-1)
final_output = np.dot(concatenated, W_O)
return final_output
# Example parameters
n_heads = 2
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.random.rand(2, 4) # Query weights (2 features, 4 for 2 heads)
W_K = np.random.rand(2, 4) # Key weights
W_V = np.random.rand(2, 4) # Value weights
W_O = np.random.rand(4, 2) # Output projection weights
# Compute multi-head attention
output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print("Multi-Head Attention Output:\n", output)
Exercise 4: Sparse Attention
Task: Implement a sparse attention mechanism using a custom mask to limit token interactions.
Solution:
def sparse_attention(Q, K, V, sparsity_mask):
"""
Compute sparse attention.
Q: Queries
K: Keys
V: Values
sparsity_mask: Binary mask defining allowable token interactions
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Compute scaled dot-product
sparse_scores = scores * sparsity_mask # Apply sparsity mask
weights = np.exp(sparse_scores) / np.sum(np.exp(sparse_scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0], [0, 1], [1, 1]]) # Query
K = np.array([[1, 0], [0, 1], [1, 1]]) # Keys
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]]) # Values
# Sparsity mask (local attention pattern)
sparsity_mask = np.array([
[1, 1, 0], # Token 1 attends to Token 1, 2
[1, 1, 1], # Token 2 attends to all
[0, 1, 1] # Token 3 attends to Token 2, 3
])
output, weights = sparse_attention(Q, K, V, sparsity_mask)
print("Sparse Attention Weights:\n", weights)
print("Sparse Attention Output:\n", output)
These exercises guide you through the practical implementation of key concepts like self-attention, multi-head attention, and sparse attention. Completing them will deepen your understanding of how attention mechanisms address the challenges of earlier architectures and enable the scalability and efficiency of Transformer models.
Practical Exercises for Chapter 3
The following practical exercises reinforce the key concepts covered in Chapter 3, including the challenges with earlier architectures, self-attention, multi-head attention, and sparse attention. Each exercise is accompanied by a detailed solution and code examples to deepen your understanding.
Exercise 1: Simulating Challenges with RNNs
Task: Create a simple RNN using PyTorch to demonstrate the difficulty of handling long-range dependencies.
Steps:
- Implement an RNN for sequence processing.
- Generate a synthetic dataset with long sequences.
- Observe how the RNN struggles to capture long-term dependencies.
Solution:
import torch
import torch.nn as nn
# Define a simple RNN model
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
out = self.fc(out[:, -1, :]) # Use the last timestep
return out
# Parameters
input_size = 10 # Vocabulary size
hidden_size = 20
output_size = 1
sequence_length = 100
batch_size = 32
# Generate synthetic dataset
X = torch.randn(batch_size, sequence_length, input_size)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # Binary labels
# Initialize and train the model
model = SimpleRNN(input_size, hidden_size, output_size)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(10):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")
Exercise 2: Implementing Self-Attention
Task: Write a Python function to compute self-attention for a sequence of tokens using NumPy.
Solution:
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""
Compute self-attention for a sequence.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
"""
Q = np.dot(X, W_Q) # Compute Queries
K = np.dot(X, W_K) # Compute Keys
V = np.dot(X, W_V) # Compute Values
# Calculate scaled dot-product attention
scores = np.dot(Q, K.T) / np.sqrt(K.shape[1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
return output, weights
# Example inputs
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.array([[0.1, 0.3], [0.5, 0.7]]) # Query weights
W_K = np.array([[0.2, 0.4], [0.6, 0.8]]) # Key weights
W_V = np.array([[0.1, 0.5], [0.3, 0.7]]) # Value weights
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Self-Attention Weights:\n", weights)
print("Self-Attention Output:\n", output)
Exercise 3: Multi-Head Attention
Task: Implement a simplified multi-head attention mechanism using NumPy.
Solution:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
"""
Compute multi-head attention.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
W_O: Output projection matrix
n_heads: Number of attention heads
"""
head_dim = W_Q.shape[1] // n_heads
outputs = []
for i in range(n_heads):
Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])
K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])
V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])
scores = np.dot(Q, K.T) / np.sqrt(head_dim)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
outputs.append(output)
concatenated = np.concatenate(outputs, axis=-1)
final_output = np.dot(concatenated, W_O)
return final_output
# Example parameters
n_heads = 2
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.random.rand(2, 4) # Query weights (2 features, 4 for 2 heads)
W_K = np.random.rand(2, 4) # Key weights
W_V = np.random.rand(2, 4) # Value weights
W_O = np.random.rand(4, 2) # Output projection weights
# Compute multi-head attention
output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print("Multi-Head Attention Output:\n", output)
Exercise 4: Sparse Attention
Task: Implement a sparse attention mechanism using a custom mask to limit token interactions.
Solution:
def sparse_attention(Q, K, V, sparsity_mask):
"""
Compute sparse attention.
Q: Queries
K: Keys
V: Values
sparsity_mask: Binary mask defining allowable token interactions
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Compute scaled dot-product
sparse_scores = scores * sparsity_mask # Apply sparsity mask
weights = np.exp(sparse_scores) / np.sum(np.exp(sparse_scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0], [0, 1], [1, 1]]) # Query
K = np.array([[1, 0], [0, 1], [1, 1]]) # Keys
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]]) # Values
# Sparsity mask (local attention pattern)
sparsity_mask = np.array([
[1, 1, 0], # Token 1 attends to Token 1, 2
[1, 1, 1], # Token 2 attends to all
[0, 1, 1] # Token 3 attends to Token 2, 3
])
output, weights = sparse_attention(Q, K, V, sparsity_mask)
print("Sparse Attention Weights:\n", weights)
print("Sparse Attention Output:\n", output)
These exercises guide you through the practical implementation of key concepts like self-attention, multi-head attention, and sparse attention. Completing them will deepen your understanding of how attention mechanisms address the challenges of earlier architectures and enable the scalability and efficiency of Transformer models.
Practical Exercises for Chapter 3
The following practical exercises reinforce the key concepts covered in Chapter 3, including the challenges with earlier architectures, self-attention, multi-head attention, and sparse attention. Each exercise is accompanied by a detailed solution and code examples to deepen your understanding.
Exercise 1: Simulating Challenges with RNNs
Task: Create a simple RNN using PyTorch to demonstrate the difficulty of handling long-range dependencies.
Steps:
- Implement an RNN for sequence processing.
- Generate a synthetic dataset with long sequences.
- Observe how the RNN struggles to capture long-term dependencies.
Solution:
import torch
import torch.nn as nn
# Define a simple RNN model
class SimpleRNN(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(SimpleRNN, self).__init__()
self.rnn = nn.RNN(input_size, hidden_size, batch_first=True)
self.fc = nn.Linear(hidden_size, output_size)
def forward(self, x):
out, _ = self.rnn(x)
out = self.fc(out[:, -1, :]) # Use the last timestep
return out
# Parameters
input_size = 10 # Vocabulary size
hidden_size = 20
output_size = 1
sequence_length = 100
batch_size = 32
# Generate synthetic dataset
X = torch.randn(batch_size, sequence_length, input_size)
y = torch.randint(0, 2, (batch_size, 1), dtype=torch.float32) # Binary labels
# Initialize and train the model
model = SimpleRNN(input_size, hidden_size, output_size)
criterion = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for epoch in range(10):
optimizer.zero_grad()
outputs = model(X)
loss = criterion(outputs, y)
loss.backward()
optimizer.step()
print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")
Exercise 2: Implementing Self-Attention
Task: Write a Python function to compute self-attention for a sequence of tokens using NumPy.
Solution:
import numpy as np
def self_attention(X, W_Q, W_K, W_V):
"""
Compute self-attention for a sequence.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
"""
Q = np.dot(X, W_Q) # Compute Queries
K = np.dot(X, W_K) # Compute Keys
V = np.dot(X, W_V) # Compute Values
# Calculate scaled dot-product attention
scores = np.dot(Q, K.T) / np.sqrt(K.shape[1])
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
return output, weights
# Example inputs
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.array([[0.1, 0.3], [0.5, 0.7]]) # Query weights
W_K = np.array([[0.2, 0.4], [0.6, 0.8]]) # Key weights
W_V = np.array([[0.1, 0.5], [0.3, 0.7]]) # Value weights
output, weights = self_attention(X, W_Q, W_K, W_V)
print("Self-Attention Weights:\n", weights)
print("Self-Attention Output:\n", output)
Exercise 3: Multi-Head Attention
Task: Implement a simplified multi-head attention mechanism using NumPy.
Solution:
def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads):
"""
Compute multi-head attention.
X: Input sequence (n_tokens, d_model)
W_Q, W_K, W_V: Weight matrices for Query, Key, Value
W_O: Output projection matrix
n_heads: Number of attention heads
"""
head_dim = W_Q.shape[1] // n_heads
outputs = []
for i in range(n_heads):
Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])
K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])
V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])
scores = np.dot(Q, K.T) / np.sqrt(head_dim)
weights = np.exp(scores) / np.sum(np.exp(scores), axis=-1, keepdims=True)
output = np.dot(weights, V)
outputs.append(output)
concatenated = np.concatenate(outputs, axis=-1)
final_output = np.dot(concatenated, W_O)
return final_output
# Example parameters
n_heads = 2
X = np.array([[1, 0], [0, 1], [1, 1]]) # Input sequence
W_Q = np.random.rand(2, 4) # Query weights (2 features, 4 for 2 heads)
W_K = np.random.rand(2, 4) # Key weights
W_V = np.random.rand(2, 4) # Value weights
W_O = np.random.rand(4, 2) # Output projection weights
# Compute multi-head attention
output = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads)
print("Multi-Head Attention Output:\n", output)
Exercise 4: Sparse Attention
Task: Implement a sparse attention mechanism using a custom mask to limit token interactions.
Solution:
def sparse_attention(Q, K, V, sparsity_mask):
"""
Compute sparse attention.
Q: Queries
K: Keys
V: Values
sparsity_mask: Binary mask defining allowable token interactions
"""
d_k = Q.shape[-1] # Dimension of keys
scores = np.dot(Q, K.T) / np.sqrt(d_k) # Compute scaled dot-product
sparse_scores = scores * sparsity_mask # Apply sparsity mask
weights = np.exp(sparse_scores) / np.sum(np.exp(sparse_scores), axis=-1, keepdims=True) # Softmax
output = np.dot(weights, V) # Weighted sum of values
return output, weights
# Example inputs
Q = np.array([[1, 0], [0, 1], [1, 1]]) # Query
K = np.array([[1, 0], [0, 1], [1, 1]]) # Keys
V = np.array([[0.5, 1.0], [0.2, 0.8], [0.9, 0.3]]) # Values
# Sparsity mask (local attention pattern)
sparsity_mask = np.array([
[1, 1, 0], # Token 1 attends to Token 1, 2
[1, 1, 1], # Token 2 attends to all
[0, 1, 1] # Token 3 attends to Token 2, 3
])
output, weights = sparse_attention(Q, K, V, sparsity_mask)
print("Sparse Attention Weights:\n", weights)
print("Sparse Attention Output:\n", output)
These exercises guide you through the practical implementation of key concepts like self-attention, multi-head attention, and sparse attention. Completing them will deepen your understanding of how attention mechanisms address the challenges of earlier architectures and enable the scalability and efficiency of Transformer models.