Chapter 3: Attention and the Rise of Transformers

3.3 Self-Attention and Multi-Head Attention

Building on the foundation of attention mechanisms, self-attention emerged as a groundbreaking innovation in Natural Language Processing (NLP). This revolutionary approach transformed how models process input sequences by introducing a mechanism where each element in a sequence can directly interact with every other element. This direct interaction enables models to process input sequences with unprecedented efficiency and context-awareness, eliminating the traditional bottlenecks of sequential processing.

Self-attention achieves this by allowing each token to query and attend to all other tokens in the sequence simultaneously. For example, when processing the sentence "The cat sat on the mat," each word can directly assess its relationship with every other word, helping the model understand both local relationships (like "the cat") and long-distance dependencies (connecting "cat" with "sat").

When combined with multi-head attention, this capability becomes even more powerful. Multi-head attention allows the model to maintain multiple different attention patterns simultaneously, each focusing on different aspects of the relationships between tokens. This multi-faceted approach serves as the cornerstone of Transformer models, empowering them to capture complex relationships between tokens in a sequence from multiple perspectives simultaneously.

In this section, we'll explore self-attention and its extension to multi-head attention, examining how these mechanisms work and why they are pivotal in Transformer architectures. We'll dive deep into their internal workings, from the mathematical foundations to practical implementations, and demonstrate their effectiveness through concrete examples and code implementations. This detailed exploration will clarify not just their implementation but also their practical utility in modern NLP applications.

3.3.1 What Is Self-Attention?

In self-attention, every token in an input sequence attends to all other tokens (including itself) to compute a new representation. This revolutionary mechanism works by creating dynamic connections between all elements in a sequence. For instance, when processing a sentence, each word maintains awareness of every other word through attention weights that determine how much influence each word should have on the current word's representation. These weights are learned during training and adapt based on the context and task at hand.

To illustrate this concept more concretely, consider the sentence "The cat chased the mouse." In this example, when processing the word "chased," the self-attention mechanism simultaneously considers all words in the sentence:

It strongly attends to "cat" as the subject performing the action
It maintains strong attention to "mouse" as the object receiving the action
It may give less attention to articles like "the" which contribute less to the semantic meaning
This parallel processing allows the model to construct a rich, contextual understanding of each word's role in the sentence.

Unlike traditional attention mechanisms, which typically work with two separate sequences (like in machine translation where a word in English attends to words in French), self-attention operates entirely within a single sequence. This internal focus represents a significant advancement in natural language processing. When translating between languages, traditional attention might help align words across languages, but self-attention helps understand the intricate relationships within each language first.

The power of this internal focus becomes particularly evident when dealing with complex linguistic phenomena:

Long-distance dependencies (e.g., "The cat, which had a brown collar, chased the mouse")
Coreference resolution (understanding that "it" refers to "the cat")
Semantic role labeling (identifying who did what to whom)
Syntactic structure understanding (grasping the grammatical relationships between words)

This architectural design makes self-attention particularly effective for tasks that require deep understanding of language structure and meaning, such as parsing, sentiment analysis, and question answering. By allowing each element to directly interact with every other element, the model can build sophisticated representations that capture both local and global contexts within the input sequence.

How It Works:

Input Representation: Each token (word or subword) in the sequence is first converted into a numerical vector through an embedding process. These vectors typically have hundreds of dimensions and capture semantic relationships between words. For example, similar words like "cat" and "kitten" will have vectors that are close to each other in this high-dimensional space.
Query, Key, and Value Creation: The model transforms each token's initial vector into three distinct vectors through learned linear transformations:
- Query vector (Q): Acts like a search query, representing what information the current token is looking for in the sequence
- Key vector (K): Functions like a label or index, helping other tokens find this token when relevant
- Value vector (V): Contains the actual meaningful information that will be used in the final representation
Attention Score Computation: The model computes attention scores by taking the dot product between each query and all keys. This creates a matrix of scores where each entry (i,j) represents how relevant token j is to token i. The scores are then scaled by dividing by the square root of the key dimension to prevent the dot products from growing too large, which helps maintain stable gradients during training.
Weight Normalization: The attention scores are converted into probabilities using the softmax function. This ensures all weights for a given token sum to 1 and creates a proper probability distribution. When processing a word like "ate" in "The hungry cat ate fish", the model might assign higher weights to relevant context words like "cat" (0.6) and "fish" (0.3), and lower weights to less important words like "the" (0.02).
Output Computation: The final representation for each token is computed as a weighted sum of all value vectors, using the normalized attention weights. This process allows each token to gather information from all other tokens in the sequence, weighted by relevance. The resulting representations are context-aware and can capture both local grammatical structure and long-range dependencies, enabling the model to understand relationships between words even when they're far apart in the text.

3.3.2 Mathematics of Self-Attention

For an input sequence of tokens X = [x_1, x_2, \dots, x_n]:

Compute Q = XW_Q, K = XW_K, and V = XW_V, where W_Q, W_K, W_V are learnable weight matrices.
Calculate attention scores:
{Scores} = \frac{Q \cdot K^\top}{\sqrt{d_k}}
Here, d_k is the dimension of the key vectors.
Normalize scores with softmax:
{Weights} = \text{softmax}\left(\text{Scores}\right)
Compute the output:
{Output} = \text{Weights} \cdot V

Example: Implementing Self-Attention

Let’s implement self-attention for a simple sequence.

Code Example: Self-Attention in NumPy

import numpy as np

def self_attention(X, W_Q, W_K, W_V, mask=None):
    """
    Compute self-attention for a sequence with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
        
    Returns:
    --------
    output: np.ndarray
        Attended sequence of shape (n_tokens, d_model)
    weights: np.ndarray
        Attention weights of shape (n_tokens, n_tokens)
    """
    # Linear transformations
    Q = np.dot(X, W_Q)  # Shape: (n_tokens, d_k)
    K = np.dot(X, W_K)  # Shape: (n_tokens, d_k)
    V = np.dot(X, W_V)  # Shape: (n_tokens, d_v)
    
    # Calculate scaled dot-product attention
    d_k = K.shape[1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Shape: (n_tokens, n_tokens)
    
    # Apply mask if provided
    if mask is not None:
        scores = scores * mask + -1e9 * (1 - mask)
    
    # Softmax normalization
    weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights /= np.sum(weights, axis=-1, keepdims=True)
    
    # Compute weighted sum
    output = np.dot(weights, V)  # Shape: (n_tokens, d_v)
    
    return output, weights

# Example usage with a more complex sequence
def create_example():
    # Create sample sequence
    X = np.array([
        [1, 0, 0],  # First token
        [0, 1, 0],  # Second token
        [0, 0, 1],  # Third token
        [1, 1, 0]   # Fourth token
    ])
    
    # Create weight matrices
    d_model = 3  # Input dimension
    d_k = 2      # Key/Query dimension
    d_v = 4      # Value dimension
    
    W_Q = np.random.randn(d_model, d_k) * 0.1
    W_K = np.random.randn(d_model, d_k) * 0.1
    W_V = np.random.randn(d_model, d_v) * 0.1
    
    # Create attention mask (optional)
    mask = np.array([
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 0],  # Last token masked for third position
        [1, 1, 1, 1]
    ])
    
    return X, W_Q, W_K, W_V, mask

# Run example
X, W_Q, W_K, W_V, mask = create_example()
output, weights = self_attention(X, W_Q, W_K, W_V, mask)

print("Input Shape:", X.shape)
print("\nQuery Matrix Shape:", W_Q.shape)
print("Key Matrix Shape:", W_K.shape)
print("Value Matrix Shape:", W_V.shape)
print("\nAttention Weights:\n", weights)
print("\nOutput Shape:", output.shape)
print("Output:\n", output)

Code Breakdown Explanation:

Function Definition and Parameters:
- The function takes input sequence X and three weight matrices (W_Q, W_K, W_V)
- Added optional masking parameter for more control over attention
- Includes comprehensive docstring with parameter descriptions
Linear Transformations:
- Converts input tokens into Query (Q), Key (K), and Value (V) representations
- Uses matrix multiplication (np.dot) for efficient computation
- Maintains proper shape transformations throughout
Attention Score Computation:
- Implements scaled dot-product attention with proper scaling factor
- Includes masking functionality for selective attention
- Uses numerically stable softmax implementation
Example Implementation:
- Creates a realistic example with 4 tokens and 3 features
- Demonstrates proper initialization of weight matrices
- Shows how to use optional masking
Shape Information:
- Clearly documents tensor shapes throughout the process
- Helps understand the dimensional transformations
- Makes debugging easier

3.3.3 What Is Multi-Head Attention?

Multi-head attention represents a sophisticated enhancement to the self-attention mechanism by running multiple parallel attention computations, called "heads." Each head operates independently and learns to focus on different aspects of the relationships between tokens in the sequence. For example, one head might learn to focus on syntactic relationships (like subject-verb agreement), while another might capture semantic relationships (like topic relevance), and yet another might detect long-range dependencies (like coreference resolution).

This parallel processing architecture provides several key advantages. First, it allows the model to simultaneously analyze the input sequence from multiple perspectives, much like how humans process language by considering multiple aspects at once. Second, by having multiple specialized attention mechanisms, the model can capture both fine-grained and broad patterns in the data. Finally, the diverse representations learned by different heads combine to create a richer, more nuanced understanding of the input sequence.

The outputs from all heads are ultimately combined through a concatenation operation followed by a linear transformation, allowing the model to synthesize these different perspectives into a cohesive representation. This multi-faceted approach significantly enhances the model's capacity to understand and process complex linguistic patterns, making it particularly effective for tasks requiring sophisticated language understanding.

Steps in Multi-Head Attention

Split the input into multiple heads:
- Divide the input sequence into separate subspaces
- Each head receives a portion of the input's dimensionality
- This splitting allows parallel processing of different feature aspects
Apply self-attention independently to each head:
- Each head computes its own Query (Q), Key (K), and Value (V) matrices
- Calculates attention scores using scaled dot-product attention
- Processes information focusing on different aspects of the input
Concatenate the outputs of all heads:
- Combine the results from each attention head
- Preserves the unique patterns and relationships learned by each head
- Creates a comprehensive representation of the input sequence
Apply a final linear transformation:
- Project the concatenated outputs to the desired dimension
- Integrates information from all heads into a cohesive representation
- Allows the model to weight the importance of different heads' outputs

Benefits of Multi-Head Attention

Diverse Representations: Each attention head specializes in capturing different types of relationships within the data. For example, one head might focus on syntactic dependencies (like subject-verb agreement), while another might detect semantic relationships (like topic relevance), and yet another might identify long-range dependencies (like coreference resolution). This diversity allows the model to build a rich, multi-faceted understanding of the input.
Improved Expressiveness: The model can focus on multiple aspects of the input simultaneously, similar to how humans process language. This parallel processing enables the model to:
- Capture both local and global context
- Process different semantic levels (word-level, phrase-level, sentence-level)
- Learn hierarchical relationships between tokens
- Combine different perspectives into a more comprehensive understanding
Enhanced Learning Capacity: Multiple heads allow the model to distribute attention across different subspaces, effectively increasing its representational power without significantly increasing computational complexity.
Robust Feature Detection: By maintaining multiple independent attention mechanisms, the model becomes more robust as it doesn't rely on a single attention pattern, reducing the impact of noise or misleading patterns in the data.

Example: Multi-Head Attention

Let’s implement a simplified version of multi-head attention.

Code Example: Multi-Head Attention in NumPy

import numpy as np

def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Compute multi-head attention with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    W_O: np.ndarray
        Output projection matrix
    n_heads: int
        Number of attention heads
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
    
    Returns:
    --------
    final_output: np.ndarray
        Transformed sequence of shape (n_tokens, d_model)
    attention_weights: list
        List of attention weights for each head
    """
    d_model = X.shape[1]
    head_dim = W_Q.shape[1] // n_heads
    outputs = []
    attention_weights = []

    # Process each attention head
    for i in range(n_heads):
        # Split weights for current head
        Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)

        # Compute attention scores
        scores = np.dot(Q, K.T) / np.sqrt(head_dim)  # (n_tokens, n_tokens)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores * mask + -1e9 * (1 - mask)
        
        # Apply softmax
        weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = weights / np.sum(weights, axis=-1, keepdims=True)
        
        # Compute weighted sum
        output = np.dot(weights, V)  # (n_tokens, head_dim)
        
        outputs.append(output)
        attention_weights.append(weights)

    # Concatenate all heads
    concatenated = np.concatenate(outputs, axis=-1)  # (n_tokens, d_model)
    
    # Final linear transformation
    final_output = np.dot(concatenated, W_O)  # (n_tokens, d_model)
    
    return final_output, attention_weights

# Example usage with a more realistic sequence
def create_example_inputs(n_tokens=4, d_model=8, n_heads=2):
    """Create example inputs for multi-head attention."""
    # Input sequence
    X = np.random.randn(n_tokens, d_model)
    
    # Weight matrices
    head_dim = d_model // n_heads
    W_Q = np.random.randn(d_model, d_model) * 0.1
    W_K = np.random.randn(d_model, d_model) * 0.1
    W_V = np.random.randn(d_model, d_model) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1
    
    # Optional mask (causal attention)
    mask = np.tril(np.ones((n_tokens, n_tokens)))
    
    return X, W_Q, W_K, W_V, W_O, mask

# Run example
X, W_Q, W_K, W_V, W_O, mask = create_example_inputs()
output, weights = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads=2, mask=mask)

print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights for first head:\n", weights[0])
print("\nAttention weights for second head:\n", weights[1])

Code Breakdown:

Function Architecture
- Implements multi-head attention with comprehensive documentation
- Includes optional masking for causal attention
- Returns both outputs and attention weights for analysis
Key Components
- Head Dimension Calculation: Splits input dimension across heads
- Per-Head Processing: Computes separate attention for each head
- Attention Mechanism: Implements scaled dot-product attention
- Output Aggregation: Concatenates and projects head outputs
Enhanced Features
- Numerical Stability: Uses stable softmax implementation
- Masking Support: Allows for masked attention patterns
- Proper Scaling: Includes attention scaling factor
Helper Functions
- create_example_inputs: Generates realistic test data
- Includes shape information and initialization logic
- Demonstrates proper usage patterns
Output Analysis
- Prints shapes for verification
- Shows attention weights for interpretation
- Demonstrates the multi-head nature of attention

3.3.4 Applications of Self-Attention and Multi-Head Attention

Text Summarization

Models leverage attention mechanisms in sophisticated ways to identify and prioritize the most important parts of a document. The attention mechanism works by assigning different weights to different parts of the input text, essentially creating a hierarchy of importance. These weights are learned during training and are dynamically adjusted based on the specific content being processed.

The attention weights serve as a sophisticated filtering mechanism that helps determine which sentences carry the most critical information. This process involves analyzing various linguistic features, including semantic relevance, syntactic structure, and contextual relationships between different parts of the text. The model can then create concise and meaningful summaries while preserving the core message and maintaining coherence.

For example, in news article summarization, the model employs a multi-layered approach to attention. It might attend strongly to key events (such as main actions or developments), significant quotes from relevant figures, and important statistical data that supports the main narrative. Meanwhile, it assigns lower attention weights to supplementary details, background information, or redundant content. This selective attention process mirrors human summarization behavior, where we naturally focus on crucial information while skimming over less important details.

Code Example: Text Summarization with Self-Attention

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionSummarizer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, max_length=512):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_encoding = nn.Parameter(
            torch.zeros(max_length, embed_dim)
        )
        self.multihead_attention = nn.MultiheadAttention(
            embed_dim, num_heads, batch_first=True
        )
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.output_layer = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x, src_mask=None):
        # Add positional encoding to embeddings
        seq_length = x.size(1)
        x = self.embedding(x) + self.position_encoding[:seq_length]
        
        # Self-attention block
        attention_output, attention_weights = self.multihead_attention(
            x, x, x,
            key_padding_mask=src_mask,
            need_weights=True
        )
        x = self.layer_norm1(x + attention_output)
        
        # Feed-forward block
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        
        # Generate output probabilities
        output = self.output_layer(x)
        return output, attention_weights

def generate_summary(model, input_ids, tokenizer, max_length=150):
    model.eval()
    with torch.no_grad():
        output, attention_weights = model(input_ids)
        
        # Get most attended words for summary
        attention_scores = attention_weights.mean(dim=1)
        top_scores = torch.topk(attention_scores.squeeze(), k=max_length)
        
        # Extract and arrange summary tokens
        summary_indices = top_scores.indices.sort().values
        summary_tokens = input_ids[0, summary_indices]
        
        # Convert to text
        summary = tokenizer.decode(summary_tokens)
        return summary, attention_weights

# Example usage
def summarize_text(text, model, tokenizer):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate summary
    summary, attention = generate_summary(
        model, 
        inputs["input_ids"],
        tokenizer
    )
    return summary, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based summarizer with multi-head self-attention
- Includes positional encoding for sequence awareness
- Uses layer normalization and residual connections for stable training
Key Components
- Embedding Layer: Converts tokens to dense vectors
- Multi-Head Attention: Processes text from multiple perspectives
- Feed-Forward Network: Adds non-linearity and transforms representations
- Output Layer: Generates final token predictions
Summarization Process
- Analyzes attention weights to identify important tokens
- Selects top-attended tokens for summary generation
- Maintains original order of selected tokens for coherence
Advanced Features
- Supports variable length inputs with masking
- Implements efficient batch processing
- Returns attention weights for analysis and visualization

Usage Example:

# Example setup and usage
vocab_size = 30000
embed_dim = 512
num_heads = 8
hidden_dim = 2048

model = SelfAttentionSummarizer(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    num_heads=num_heads,
    hidden_dim=hidden_dim
)

# Example text
text = """
Climate change poses significant challenges to global ecosystems. 
Rising temperatures affect wildlife habitats and agricultural productivity. 
Scientists warn that immediate action is necessary to prevent irreversible damage.
"""

# Generate summary (assuming tokenizer is initialized)
summary, attention = summarize_text(text, model, tokenizer)
print("Summary:", summary)

Machine Translation

Attention mechanisms revolutionize machine translation by creating sophisticated dynamic alignments between words and phrases across languages. This process works by establishing weighted connections between elements in the source and target languages, allowing the model to understand complex linguistic relationships. For example, when translating from English to Japanese, the attention mechanism can handle the significant differences in sentence structure, where English follows Subject-Verb-Object order while Japanese typically uses Subject-Object-Verb order.

The mechanism is particularly powerful in handling three key translation challenges:

First, it manages complex word order variations between languages. For instance, when translating between English and German, where the verb position can vary significantly, the attention mechanism can maintain proper semantic relationships despite syntactic differences.

Second, it handles many-to-one and one-to-many word mappings effectively. For example, when translating the German compound word "Schadenfreude" to English, the mechanism can map it to the phrase "pleasure derived from another's misfortune," maintaining accurate meaning despite the structural difference.

Third, the model maintains contextual awareness across extended sentences through its ability to reference and weight the importance of different parts of the input sequence. This ensures that long sentences retain their meaning and coherence in translation, preventing common issues like losing track of subject-verb relationships or mishandling dependent clauses.

The attention mechanism achieves this by continuously updating its focus based on the current word being translated and its relationship to all other words in the sentence, ensuring that the final translation preserves both meaning and natural language flow.

Code Example: Neural Machine Translation with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class TranslationTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8, 
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048):
        super().__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        
        # Transformer layers
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward
        )
        
        # Output projection
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Create source embedding
        src_embedded = self.positional_encoding(self.src_embedding(src))
        
        # Create target embedding
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt))
        
        # Generate masks if not provided
        if src_mask is None:
            src_mask = self.generate_square_subsequent_mask(src.size(1))
        if tgt_mask is None:
            tgt_mask = self.generate_square_subsequent_mask(tgt.size(1))
            
        # Pass through transformer
        output = self.transformer(
            src_embedded, tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask
        )
        
        # Project to vocabulary
        return self.output_layer(output)
    
    @staticmethod
    def generate_square_subsequent_mask(sz):
        mask = torch.triu(torch.ones(sz, sz), diagonal=1)
        mask = mask.masked_fill(mask==1, float('-inf'))
        return mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Training function
def train_translation_model(model, train_loader, optimizer, criterion, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (src, tgt) in enumerate(train_loader):
            optimizer.zero_grad()
            
            # Forward pass
            output = model(src, tgt[:-1])  # exclude last target token
            
            # Calculate loss
            loss = criterion(
                output.view(-1, output.size(-1)),
                tgt[1:].reshape(-1)  # exclude first target token (BOS)
            )
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch: {epoch+1}, Average Loss: {avg_loss:.4f}')

Code Breakdown:

Model Architecture
- Implements a complete Transformer-based translation model
- Uses both encoder and decoder with multi-head attention
- Includes positional encoding for sequence order awareness
Key Components
- Source and Target Embeddings: Convert tokens to vectors
- Positional Encoding: Adds position information to embeddings
- Transformer Block: Processes sequences using self-attention
- Output Projection: Maps to target vocabulary
Training Process
- Implements teacher forcing during training
- Uses masked attention for autoregressive generation
- Includes loss calculation and optimization steps
Advanced Features
- Supports variable length sequences
- Implements efficient batch processing
- Includes mask generation for causal attention

Usage Example:

# Initialize model and training components
model = TranslationTransformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    d_model=512,
    nhead=8
)

# Setup optimizer and criterion
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

# Example translation
def translate(model, src_sentence, src_tokenizer, tgt_tokenizer, max_len=50):
    model.eval()
    with torch.no_grad():
        # Tokenize source sentence
        src_tokens = src_tokenizer.encode(src_sentence)
        src_tensor = torch.LongTensor(src_tokens).unsqueeze(0)
        
        # Initialize target with BOS token
        tgt_tokens = [tgt_tokenizer.bos_token_id]
        
        # Generate translation
        for _ in range(max_len):
            tgt_tensor = torch.LongTensor(tgt_tokens).unsqueeze(0)
            output = model(src_tensor, tgt_tensor)
            next_token = output[0, -1].argmax().item()
            
            if next_token == tgt_tokenizer.eos_token_id:
                break
                
            tgt_tokens.append(next_token)
        
        # Convert tokens to text
        translation = tgt_tokenizer.decode(tgt_tokens)
        return translation

Question Answering

When processing questions, attention mechanisms employ a sophisticated approach to information processing. These mechanisms help models identify and focus on the specific parts of a passage that contain relevant information through a multi-step process:

First, the model analyzes the question to understand what type of information it needs to look for. Then, it creates attention weights for each word in the passage, giving higher weights to words and phrases that are more likely to contain the answer. This selective focus enables the model to efficiently extract answers while ignoring irrelevant content.

For instance, when answering "When did the event occur?", the model would primarily attend to temporal expressions (such as dates, times, and temporal phrases like "yesterday" or "last week") and their surrounding context in the passage. The attention weights would be highest for these temporal indicators and their immediate context, allowing the model to zero in on the most relevant information. This process is similar to how humans might scan a text for time-related words when looking for when something happened.

Code Example: Question Answering with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class QATransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        
        # Embedding layers
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Multi-head attention layers
        self.question_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        self.context_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        
        # Cross-attention layer
        self.cross_attention = nn.MultiheadAttention(d_model, nhead)
        
        # Output layers for start and end position prediction
        self.start_predictor = nn.Linear(d_model, 1)
        self.end_predictor = nn.Linear(d_model, 1)
        
    def forward(self, question, context):
        # Embed inputs
        q_embed = self.pos_encoder(self.embedding(question))
        c_embed = self.pos_encoder(self.embedding(context))
        
        # Encode question and context
        q_encoded = self.question_encoder(q_embed)
        c_encoded = self.context_encoder(c_embed)
        
        # Cross-attention between question and context
        attn_output, attention_weights = self.cross_attention(
            q_encoded, c_encoded, c_encoded
        )
        
        # Predict answer span
        start_logits = self.start_predictor(attn_output).squeeze(-1)
        end_logits = self.end_predictor(attn_output).squeeze(-1)
        
        return start_logits, end_logits, attention_weights

def train_qa_model(model, train_loader, optimizer, num_epochs=10):
    model.train()
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch in train_loader:
            question, context, start_pos, end_pos = batch
            
            # Forward pass
            start_logits, end_logits, _ = model(question, context)
            
            # Calculate loss
            start_loss = criterion(start_logits, start_pos)
            end_loss = criterion(end_logits, end_pos)
            loss = start_loss + end_loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def predict_answer(model, tokenizer, question, context):
    model.eval()
    with torch.no_grad():
        # Tokenize inputs
        q_tokens = tokenizer.encode(question)
        c_tokens = tokenizer.encode(context)
        
        # Convert to tensors
        q_tensor = torch.tensor(q_tokens).unsqueeze(0)
        c_tensor = torch.tensor(c_tokens).unsqueeze(0)
        
        # Get predictions
        start_logits, end_logits, attention = model(q_tensor, c_tensor)
        
        # Find most likely answer span
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits[start_idx:]) + start_idx
        
        # Extract answer tokens
        answer_tokens = c_tokens[start_idx:end_idx+1]
        
        # Convert back to text
        answer = tokenizer.decode(answer_tokens)
        return answer, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based QA model with separate encoders for questions and context
- Uses multi-head self-attention for both question and context processing
- Includes cross-attention mechanism to relate questions to context
- Features span prediction for answer extraction
Key Components
- Embedding Layer: Converts text tokens to dense vectors
- Positional Encoding: Adds position information to embeddings
- Question/Context Encoders: Process inputs using self-attention
- Cross-Attention: Relates question to context for answer finding
- Span Predictors: Locate answer boundaries in context
Processing Flow
- Embeds and encodes question and context separately
- Applies cross-attention to find relevant context regions
- Predicts start and end positions of answer span
- Returns answer text and attention weights for analysis

Usage Example:

# Initialize model and components
model = QATransformer(
    vocab_size=30000,
    d_model=512,
    nhead=8,
    num_layers=6
)

# Example usage
question = "When was the first computer invented?"
context = "The first general-purpose electronic computer, ENIAC, was completed in 1945."

# Get answer
answer, attention_weights = predict_answer(
    model, tokenizer, question, context
)
print(f"Question: {question}")
print(f"Answer: {answer}")

3.3.5 Key Takeaways

Self-attention enables models to compute context-aware representations by attending to all tokens in a sequence. This means each word in a sentence can directly interact with every other word, allowing the model to understand complex relationships and dependencies. For example, in the sentence "The cat that chased the mouse was black", self-attention helps the model connect "was black" back to "cat" even though they're separated by several words.
Multi-head attention enhances self-attention by capturing diverse relationships simultaneously. While a single attention head might focus on syntactic relationships, another might capture semantic similarities, and yet another might track temporal relationships. This multi-faceted approach allows the model to process information through multiple different "perspectives" at once, leading to richer and more nuanced understanding of the input.
Together, these mechanisms are the foundation of Transformer architectures, allowing for parallelism and long-range dependency modeling. Unlike traditional sequential models that process words one at a time, Transformers can process all words simultaneously, dramatically improving computational efficiency. Additionally, because every token can attend to every other token directly, Transformers excel at capturing relationships between words that are far apart in the text, solving the long-standing challenge of modeling long-range dependencies in natural language processing.

3.3 Self-Attention and Multi-Head Attention

Building on the foundation of attention mechanisms, self-attention emerged as a groundbreaking innovation in Natural Language Processing (NLP). This revolutionary approach transformed how models process input sequences by introducing a mechanism where each element in a sequence can directly interact with every other element. This direct interaction enables models to process input sequences with unprecedented efficiency and context-awareness, eliminating the traditional bottlenecks of sequential processing.

Self-attention achieves this by allowing each token to query and attend to all other tokens in the sequence simultaneously. For example, when processing the sentence "The cat sat on the mat," each word can directly assess its relationship with every other word, helping the model understand both local relationships (like "the cat") and long-distance dependencies (connecting "cat" with "sat").

When combined with multi-head attention, this capability becomes even more powerful. Multi-head attention allows the model to maintain multiple different attention patterns simultaneously, each focusing on different aspects of the relationships between tokens. This multi-faceted approach serves as the cornerstone of Transformer models, empowering them to capture complex relationships between tokens in a sequence from multiple perspectives simultaneously.

In this section, we'll explore self-attention and its extension to multi-head attention, examining how these mechanisms work and why they are pivotal in Transformer architectures. We'll dive deep into their internal workings, from the mathematical foundations to practical implementations, and demonstrate their effectiveness through concrete examples and code implementations. This detailed exploration will clarify not just their implementation but also their practical utility in modern NLP applications.

3.3.1 What Is Self-Attention?

In self-attention, every token in an input sequence attends to all other tokens (including itself) to compute a new representation. This revolutionary mechanism works by creating dynamic connections between all elements in a sequence. For instance, when processing a sentence, each word maintains awareness of every other word through attention weights that determine how much influence each word should have on the current word's representation. These weights are learned during training and adapt based on the context and task at hand.

To illustrate this concept more concretely, consider the sentence "The cat chased the mouse." In this example, when processing the word "chased," the self-attention mechanism simultaneously considers all words in the sentence:

It strongly attends to "cat" as the subject performing the action
It maintains strong attention to "mouse" as the object receiving the action
It may give less attention to articles like "the" which contribute less to the semantic meaning
This parallel processing allows the model to construct a rich, contextual understanding of each word's role in the sentence.

Unlike traditional attention mechanisms, which typically work with two separate sequences (like in machine translation where a word in English attends to words in French), self-attention operates entirely within a single sequence. This internal focus represents a significant advancement in natural language processing. When translating between languages, traditional attention might help align words across languages, but self-attention helps understand the intricate relationships within each language first.

The power of this internal focus becomes particularly evident when dealing with complex linguistic phenomena:

Long-distance dependencies (e.g., "The cat, which had a brown collar, chased the mouse")
Coreference resolution (understanding that "it" refers to "the cat")
Semantic role labeling (identifying who did what to whom)
Syntactic structure understanding (grasping the grammatical relationships between words)

This architectural design makes self-attention particularly effective for tasks that require deep understanding of language structure and meaning, such as parsing, sentiment analysis, and question answering. By allowing each element to directly interact with every other element, the model can build sophisticated representations that capture both local and global contexts within the input sequence.

How It Works:

Input Representation: Each token (word or subword) in the sequence is first converted into a numerical vector through an embedding process. These vectors typically have hundreds of dimensions and capture semantic relationships between words. For example, similar words like "cat" and "kitten" will have vectors that are close to each other in this high-dimensional space.
Query, Key, and Value Creation: The model transforms each token's initial vector into three distinct vectors through learned linear transformations:
- Query vector (Q): Acts like a search query, representing what information the current token is looking for in the sequence
- Key vector (K): Functions like a label or index, helping other tokens find this token when relevant
- Value vector (V): Contains the actual meaningful information that will be used in the final representation
Attention Score Computation: The model computes attention scores by taking the dot product between each query and all keys. This creates a matrix of scores where each entry (i,j) represents how relevant token j is to token i. The scores are then scaled by dividing by the square root of the key dimension to prevent the dot products from growing too large, which helps maintain stable gradients during training.
Weight Normalization: The attention scores are converted into probabilities using the softmax function. This ensures all weights for a given token sum to 1 and creates a proper probability distribution. When processing a word like "ate" in "The hungry cat ate fish", the model might assign higher weights to relevant context words like "cat" (0.6) and "fish" (0.3), and lower weights to less important words like "the" (0.02).
Output Computation: The final representation for each token is computed as a weighted sum of all value vectors, using the normalized attention weights. This process allows each token to gather information from all other tokens in the sequence, weighted by relevance. The resulting representations are context-aware and can capture both local grammatical structure and long-range dependencies, enabling the model to understand relationships between words even when they're far apart in the text.

3.3.2 Mathematics of Self-Attention

For an input sequence of tokens X = [x_1, x_2, \dots, x_n]:

Compute Q = XW_Q, K = XW_K, and V = XW_V, where W_Q, W_K, W_V are learnable weight matrices.
Calculate attention scores:
{Scores} = \frac{Q \cdot K^\top}{\sqrt{d_k}}
Here, d_k is the dimension of the key vectors.
Normalize scores with softmax:
{Weights} = \text{softmax}\left(\text{Scores}\right)
Compute the output:
{Output} = \text{Weights} \cdot V

Example: Implementing Self-Attention

Let’s implement self-attention for a simple sequence.

Code Example: Self-Attention in NumPy

import numpy as np

def self_attention(X, W_Q, W_K, W_V, mask=None):
    """
    Compute self-attention for a sequence with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
        
    Returns:
    --------
    output: np.ndarray
        Attended sequence of shape (n_tokens, d_model)
    weights: np.ndarray
        Attention weights of shape (n_tokens, n_tokens)
    """
    # Linear transformations
    Q = np.dot(X, W_Q)  # Shape: (n_tokens, d_k)
    K = np.dot(X, W_K)  # Shape: (n_tokens, d_k)
    V = np.dot(X, W_V)  # Shape: (n_tokens, d_v)
    
    # Calculate scaled dot-product attention
    d_k = K.shape[1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Shape: (n_tokens, n_tokens)
    
    # Apply mask if provided
    if mask is not None:
        scores = scores * mask + -1e9 * (1 - mask)
    
    # Softmax normalization
    weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights /= np.sum(weights, axis=-1, keepdims=True)
    
    # Compute weighted sum
    output = np.dot(weights, V)  # Shape: (n_tokens, d_v)
    
    return output, weights

# Example usage with a more complex sequence
def create_example():
    # Create sample sequence
    X = np.array([
        [1, 0, 0],  # First token
        [0, 1, 0],  # Second token
        [0, 0, 1],  # Third token
        [1, 1, 0]   # Fourth token
    ])
    
    # Create weight matrices
    d_model = 3  # Input dimension
    d_k = 2      # Key/Query dimension
    d_v = 4      # Value dimension
    
    W_Q = np.random.randn(d_model, d_k) * 0.1
    W_K = np.random.randn(d_model, d_k) * 0.1
    W_V = np.random.randn(d_model, d_v) * 0.1
    
    # Create attention mask (optional)
    mask = np.array([
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 0],  # Last token masked for third position
        [1, 1, 1, 1]
    ])
    
    return X, W_Q, W_K, W_V, mask

# Run example
X, W_Q, W_K, W_V, mask = create_example()
output, weights = self_attention(X, W_Q, W_K, W_V, mask)

print("Input Shape:", X.shape)
print("\nQuery Matrix Shape:", W_Q.shape)
print("Key Matrix Shape:", W_K.shape)
print("Value Matrix Shape:", W_V.shape)
print("\nAttention Weights:\n", weights)
print("\nOutput Shape:", output.shape)
print("Output:\n", output)

Code Breakdown Explanation:

Function Definition and Parameters:
- The function takes input sequence X and three weight matrices (W_Q, W_K, W_V)
- Added optional masking parameter for more control over attention
- Includes comprehensive docstring with parameter descriptions
Linear Transformations:
- Converts input tokens into Query (Q), Key (K), and Value (V) representations
- Uses matrix multiplication (np.dot) for efficient computation
- Maintains proper shape transformations throughout
Attention Score Computation:
- Implements scaled dot-product attention with proper scaling factor
- Includes masking functionality for selective attention
- Uses numerically stable softmax implementation
Example Implementation:
- Creates a realistic example with 4 tokens and 3 features
- Demonstrates proper initialization of weight matrices
- Shows how to use optional masking
Shape Information:
- Clearly documents tensor shapes throughout the process
- Helps understand the dimensional transformations
- Makes debugging easier

3.3.3 What Is Multi-Head Attention?

Multi-head attention represents a sophisticated enhancement to the self-attention mechanism by running multiple parallel attention computations, called "heads." Each head operates independently and learns to focus on different aspects of the relationships between tokens in the sequence. For example, one head might learn to focus on syntactic relationships (like subject-verb agreement), while another might capture semantic relationships (like topic relevance), and yet another might detect long-range dependencies (like coreference resolution).

This parallel processing architecture provides several key advantages. First, it allows the model to simultaneously analyze the input sequence from multiple perspectives, much like how humans process language by considering multiple aspects at once. Second, by having multiple specialized attention mechanisms, the model can capture both fine-grained and broad patterns in the data. Finally, the diverse representations learned by different heads combine to create a richer, more nuanced understanding of the input sequence.

The outputs from all heads are ultimately combined through a concatenation operation followed by a linear transformation, allowing the model to synthesize these different perspectives into a cohesive representation. This multi-faceted approach significantly enhances the model's capacity to understand and process complex linguistic patterns, making it particularly effective for tasks requiring sophisticated language understanding.

Steps in Multi-Head Attention

Split the input into multiple heads:
- Divide the input sequence into separate subspaces
- Each head receives a portion of the input's dimensionality
- This splitting allows parallel processing of different feature aspects
Apply self-attention independently to each head:
- Each head computes its own Query (Q), Key (K), and Value (V) matrices
- Calculates attention scores using scaled dot-product attention
- Processes information focusing on different aspects of the input
Concatenate the outputs of all heads:
- Combine the results from each attention head
- Preserves the unique patterns and relationships learned by each head
- Creates a comprehensive representation of the input sequence
Apply a final linear transformation:
- Project the concatenated outputs to the desired dimension
- Integrates information from all heads into a cohesive representation
- Allows the model to weight the importance of different heads' outputs

Benefits of Multi-Head Attention

Diverse Representations: Each attention head specializes in capturing different types of relationships within the data. For example, one head might focus on syntactic dependencies (like subject-verb agreement), while another might detect semantic relationships (like topic relevance), and yet another might identify long-range dependencies (like coreference resolution). This diversity allows the model to build a rich, multi-faceted understanding of the input.
Improved Expressiveness: The model can focus on multiple aspects of the input simultaneously, similar to how humans process language. This parallel processing enables the model to:
- Capture both local and global context
- Process different semantic levels (word-level, phrase-level, sentence-level)
- Learn hierarchical relationships between tokens
- Combine different perspectives into a more comprehensive understanding
Enhanced Learning Capacity: Multiple heads allow the model to distribute attention across different subspaces, effectively increasing its representational power without significantly increasing computational complexity.
Robust Feature Detection: By maintaining multiple independent attention mechanisms, the model becomes more robust as it doesn't rely on a single attention pattern, reducing the impact of noise or misleading patterns in the data.

Example: Multi-Head Attention

Let’s implement a simplified version of multi-head attention.

Code Example: Multi-Head Attention in NumPy

import numpy as np

def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Compute multi-head attention with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    W_O: np.ndarray
        Output projection matrix
    n_heads: int
        Number of attention heads
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
    
    Returns:
    --------
    final_output: np.ndarray
        Transformed sequence of shape (n_tokens, d_model)
    attention_weights: list
        List of attention weights for each head
    """
    d_model = X.shape[1]
    head_dim = W_Q.shape[1] // n_heads
    outputs = []
    attention_weights = []

    # Process each attention head
    for i in range(n_heads):
        # Split weights for current head
        Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)

        # Compute attention scores
        scores = np.dot(Q, K.T) / np.sqrt(head_dim)  # (n_tokens, n_tokens)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores * mask + -1e9 * (1 - mask)
        
        # Apply softmax
        weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = weights / np.sum(weights, axis=-1, keepdims=True)
        
        # Compute weighted sum
        output = np.dot(weights, V)  # (n_tokens, head_dim)
        
        outputs.append(output)
        attention_weights.append(weights)

    # Concatenate all heads
    concatenated = np.concatenate(outputs, axis=-1)  # (n_tokens, d_model)
    
    # Final linear transformation
    final_output = np.dot(concatenated, W_O)  # (n_tokens, d_model)
    
    return final_output, attention_weights

# Example usage with a more realistic sequence
def create_example_inputs(n_tokens=4, d_model=8, n_heads=2):
    """Create example inputs for multi-head attention."""
    # Input sequence
    X = np.random.randn(n_tokens, d_model)
    
    # Weight matrices
    head_dim = d_model // n_heads
    W_Q = np.random.randn(d_model, d_model) * 0.1
    W_K = np.random.randn(d_model, d_model) * 0.1
    W_V = np.random.randn(d_model, d_model) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1
    
    # Optional mask (causal attention)
    mask = np.tril(np.ones((n_tokens, n_tokens)))
    
    return X, W_Q, W_K, W_V, W_O, mask

# Run example
X, W_Q, W_K, W_V, W_O, mask = create_example_inputs()
output, weights = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads=2, mask=mask)

print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights for first head:\n", weights[0])
print("\nAttention weights for second head:\n", weights[1])

Code Breakdown:

Function Architecture
- Implements multi-head attention with comprehensive documentation
- Includes optional masking for causal attention
- Returns both outputs and attention weights for analysis
Key Components
- Head Dimension Calculation: Splits input dimension across heads
- Per-Head Processing: Computes separate attention for each head
- Attention Mechanism: Implements scaled dot-product attention
- Output Aggregation: Concatenates and projects head outputs
Enhanced Features
- Numerical Stability: Uses stable softmax implementation
- Masking Support: Allows for masked attention patterns
- Proper Scaling: Includes attention scaling factor
Helper Functions
- create_example_inputs: Generates realistic test data
- Includes shape information and initialization logic
- Demonstrates proper usage patterns
Output Analysis
- Prints shapes for verification
- Shows attention weights for interpretation
- Demonstrates the multi-head nature of attention

3.3.4 Applications of Self-Attention and Multi-Head Attention

Text Summarization

Models leverage attention mechanisms in sophisticated ways to identify and prioritize the most important parts of a document. The attention mechanism works by assigning different weights to different parts of the input text, essentially creating a hierarchy of importance. These weights are learned during training and are dynamically adjusted based on the specific content being processed.

The attention weights serve as a sophisticated filtering mechanism that helps determine which sentences carry the most critical information. This process involves analyzing various linguistic features, including semantic relevance, syntactic structure, and contextual relationships between different parts of the text. The model can then create concise and meaningful summaries while preserving the core message and maintaining coherence.

For example, in news article summarization, the model employs a multi-layered approach to attention. It might attend strongly to key events (such as main actions or developments), significant quotes from relevant figures, and important statistical data that supports the main narrative. Meanwhile, it assigns lower attention weights to supplementary details, background information, or redundant content. This selective attention process mirrors human summarization behavior, where we naturally focus on crucial information while skimming over less important details.

Code Example: Text Summarization with Self-Attention

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionSummarizer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, max_length=512):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_encoding = nn.Parameter(
            torch.zeros(max_length, embed_dim)
        )
        self.multihead_attention = nn.MultiheadAttention(
            embed_dim, num_heads, batch_first=True
        )
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.output_layer = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x, src_mask=None):
        # Add positional encoding to embeddings
        seq_length = x.size(1)
        x = self.embedding(x) + self.position_encoding[:seq_length]
        
        # Self-attention block
        attention_output, attention_weights = self.multihead_attention(
            x, x, x,
            key_padding_mask=src_mask,
            need_weights=True
        )
        x = self.layer_norm1(x + attention_output)
        
        # Feed-forward block
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        
        # Generate output probabilities
        output = self.output_layer(x)
        return output, attention_weights

def generate_summary(model, input_ids, tokenizer, max_length=150):
    model.eval()
    with torch.no_grad():
        output, attention_weights = model(input_ids)
        
        # Get most attended words for summary
        attention_scores = attention_weights.mean(dim=1)
        top_scores = torch.topk(attention_scores.squeeze(), k=max_length)
        
        # Extract and arrange summary tokens
        summary_indices = top_scores.indices.sort().values
        summary_tokens = input_ids[0, summary_indices]
        
        # Convert to text
        summary = tokenizer.decode(summary_tokens)
        return summary, attention_weights

# Example usage
def summarize_text(text, model, tokenizer):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate summary
    summary, attention = generate_summary(
        model, 
        inputs["input_ids"],
        tokenizer
    )
    return summary, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based summarizer with multi-head self-attention
- Includes positional encoding for sequence awareness
- Uses layer normalization and residual connections for stable training
Key Components
- Embedding Layer: Converts tokens to dense vectors
- Multi-Head Attention: Processes text from multiple perspectives
- Feed-Forward Network: Adds non-linearity and transforms representations
- Output Layer: Generates final token predictions
Summarization Process
- Analyzes attention weights to identify important tokens
- Selects top-attended tokens for summary generation
- Maintains original order of selected tokens for coherence
Advanced Features
- Supports variable length inputs with masking
- Implements efficient batch processing
- Returns attention weights for analysis and visualization

Usage Example:

# Example setup and usage
vocab_size = 30000
embed_dim = 512
num_heads = 8
hidden_dim = 2048

model = SelfAttentionSummarizer(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    num_heads=num_heads,
    hidden_dim=hidden_dim
)

# Example text
text = """
Climate change poses significant challenges to global ecosystems. 
Rising temperatures affect wildlife habitats and agricultural productivity. 
Scientists warn that immediate action is necessary to prevent irreversible damage.
"""

# Generate summary (assuming tokenizer is initialized)
summary, attention = summarize_text(text, model, tokenizer)
print("Summary:", summary)

Machine Translation

Attention mechanisms revolutionize machine translation by creating sophisticated dynamic alignments between words and phrases across languages. This process works by establishing weighted connections between elements in the source and target languages, allowing the model to understand complex linguistic relationships. For example, when translating from English to Japanese, the attention mechanism can handle the significant differences in sentence structure, where English follows Subject-Verb-Object order while Japanese typically uses Subject-Object-Verb order.

The mechanism is particularly powerful in handling three key translation challenges:

First, it manages complex word order variations between languages. For instance, when translating between English and German, where the verb position can vary significantly, the attention mechanism can maintain proper semantic relationships despite syntactic differences.

Second, it handles many-to-one and one-to-many word mappings effectively. For example, when translating the German compound word "Schadenfreude" to English, the mechanism can map it to the phrase "pleasure derived from another's misfortune," maintaining accurate meaning despite the structural difference.

Third, the model maintains contextual awareness across extended sentences through its ability to reference and weight the importance of different parts of the input sequence. This ensures that long sentences retain their meaning and coherence in translation, preventing common issues like losing track of subject-verb relationships or mishandling dependent clauses.

The attention mechanism achieves this by continuously updating its focus based on the current word being translated and its relationship to all other words in the sentence, ensuring that the final translation preserves both meaning and natural language flow.

Code Example: Neural Machine Translation with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class TranslationTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8, 
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048):
        super().__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        
        # Transformer layers
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward
        )
        
        # Output projection
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Create source embedding
        src_embedded = self.positional_encoding(self.src_embedding(src))
        
        # Create target embedding
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt))
        
        # Generate masks if not provided
        if src_mask is None:
            src_mask = self.generate_square_subsequent_mask(src.size(1))
        if tgt_mask is None:
            tgt_mask = self.generate_square_subsequent_mask(tgt.size(1))
            
        # Pass through transformer
        output = self.transformer(
            src_embedded, tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask
        )
        
        # Project to vocabulary
        return self.output_layer(output)
    
    @staticmethod
    def generate_square_subsequent_mask(sz):
        mask = torch.triu(torch.ones(sz, sz), diagonal=1)
        mask = mask.masked_fill(mask==1, float('-inf'))
        return mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Training function
def train_translation_model(model, train_loader, optimizer, criterion, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (src, tgt) in enumerate(train_loader):
            optimizer.zero_grad()
            
            # Forward pass
            output = model(src, tgt[:-1])  # exclude last target token
            
            # Calculate loss
            loss = criterion(
                output.view(-1, output.size(-1)),
                tgt[1:].reshape(-1)  # exclude first target token (BOS)
            )
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch: {epoch+1}, Average Loss: {avg_loss:.4f}')

Code Breakdown:

Model Architecture
- Implements a complete Transformer-based translation model
- Uses both encoder and decoder with multi-head attention
- Includes positional encoding for sequence order awareness
Key Components
- Source and Target Embeddings: Convert tokens to vectors
- Positional Encoding: Adds position information to embeddings
- Transformer Block: Processes sequences using self-attention
- Output Projection: Maps to target vocabulary
Training Process
- Implements teacher forcing during training
- Uses masked attention for autoregressive generation
- Includes loss calculation and optimization steps
Advanced Features
- Supports variable length sequences
- Implements efficient batch processing
- Includes mask generation for causal attention

Usage Example:

# Initialize model and training components
model = TranslationTransformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    d_model=512,
    nhead=8
)

# Setup optimizer and criterion
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

# Example translation
def translate(model, src_sentence, src_tokenizer, tgt_tokenizer, max_len=50):
    model.eval()
    with torch.no_grad():
        # Tokenize source sentence
        src_tokens = src_tokenizer.encode(src_sentence)
        src_tensor = torch.LongTensor(src_tokens).unsqueeze(0)
        
        # Initialize target with BOS token
        tgt_tokens = [tgt_tokenizer.bos_token_id]
        
        # Generate translation
        for _ in range(max_len):
            tgt_tensor = torch.LongTensor(tgt_tokens).unsqueeze(0)
            output = model(src_tensor, tgt_tensor)
            next_token = output[0, -1].argmax().item()
            
            if next_token == tgt_tokenizer.eos_token_id:
                break
                
            tgt_tokens.append(next_token)
        
        # Convert tokens to text
        translation = tgt_tokenizer.decode(tgt_tokens)
        return translation

Question Answering

When processing questions, attention mechanisms employ a sophisticated approach to information processing. These mechanisms help models identify and focus on the specific parts of a passage that contain relevant information through a multi-step process:

First, the model analyzes the question to understand what type of information it needs to look for. Then, it creates attention weights for each word in the passage, giving higher weights to words and phrases that are more likely to contain the answer. This selective focus enables the model to efficiently extract answers while ignoring irrelevant content.

For instance, when answering "When did the event occur?", the model would primarily attend to temporal expressions (such as dates, times, and temporal phrases like "yesterday" or "last week") and their surrounding context in the passage. The attention weights would be highest for these temporal indicators and their immediate context, allowing the model to zero in on the most relevant information. This process is similar to how humans might scan a text for time-related words when looking for when something happened.

Code Example: Question Answering with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class QATransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        
        # Embedding layers
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Multi-head attention layers
        self.question_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        self.context_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        
        # Cross-attention layer
        self.cross_attention = nn.MultiheadAttention(d_model, nhead)
        
        # Output layers for start and end position prediction
        self.start_predictor = nn.Linear(d_model, 1)
        self.end_predictor = nn.Linear(d_model, 1)
        
    def forward(self, question, context):
        # Embed inputs
        q_embed = self.pos_encoder(self.embedding(question))
        c_embed = self.pos_encoder(self.embedding(context))
        
        # Encode question and context
        q_encoded = self.question_encoder(q_embed)
        c_encoded = self.context_encoder(c_embed)
        
        # Cross-attention between question and context
        attn_output, attention_weights = self.cross_attention(
            q_encoded, c_encoded, c_encoded
        )
        
        # Predict answer span
        start_logits = self.start_predictor(attn_output).squeeze(-1)
        end_logits = self.end_predictor(attn_output).squeeze(-1)
        
        return start_logits, end_logits, attention_weights

def train_qa_model(model, train_loader, optimizer, num_epochs=10):
    model.train()
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch in train_loader:
            question, context, start_pos, end_pos = batch
            
            # Forward pass
            start_logits, end_logits, _ = model(question, context)
            
            # Calculate loss
            start_loss = criterion(start_logits, start_pos)
            end_loss = criterion(end_logits, end_pos)
            loss = start_loss + end_loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def predict_answer(model, tokenizer, question, context):
    model.eval()
    with torch.no_grad():
        # Tokenize inputs
        q_tokens = tokenizer.encode(question)
        c_tokens = tokenizer.encode(context)
        
        # Convert to tensors
        q_tensor = torch.tensor(q_tokens).unsqueeze(0)
        c_tensor = torch.tensor(c_tokens).unsqueeze(0)
        
        # Get predictions
        start_logits, end_logits, attention = model(q_tensor, c_tensor)
        
        # Find most likely answer span
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits[start_idx:]) + start_idx
        
        # Extract answer tokens
        answer_tokens = c_tokens[start_idx:end_idx+1]
        
        # Convert back to text
        answer = tokenizer.decode(answer_tokens)
        return answer, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based QA model with separate encoders for questions and context
- Uses multi-head self-attention for both question and context processing
- Includes cross-attention mechanism to relate questions to context
- Features span prediction for answer extraction
Key Components
- Embedding Layer: Converts text tokens to dense vectors
- Positional Encoding: Adds position information to embeddings
- Question/Context Encoders: Process inputs using self-attention
- Cross-Attention: Relates question to context for answer finding
- Span Predictors: Locate answer boundaries in context
Processing Flow
- Embeds and encodes question and context separately
- Applies cross-attention to find relevant context regions
- Predicts start and end positions of answer span
- Returns answer text and attention weights for analysis

Usage Example:

# Initialize model and components
model = QATransformer(
    vocab_size=30000,
    d_model=512,
    nhead=8,
    num_layers=6
)

# Example usage
question = "When was the first computer invented?"
context = "The first general-purpose electronic computer, ENIAC, was completed in 1945."

# Get answer
answer, attention_weights = predict_answer(
    model, tokenizer, question, context
)
print(f"Question: {question}")
print(f"Answer: {answer}")

3.3.5 Key Takeaways

Self-attention enables models to compute context-aware representations by attending to all tokens in a sequence. This means each word in a sentence can directly interact with every other word, allowing the model to understand complex relationships and dependencies. For example, in the sentence "The cat that chased the mouse was black", self-attention helps the model connect "was black" back to "cat" even though they're separated by several words.
Multi-head attention enhances self-attention by capturing diverse relationships simultaneously. While a single attention head might focus on syntactic relationships, another might capture semantic similarities, and yet another might track temporal relationships. This multi-faceted approach allows the model to process information through multiple different "perspectives" at once, leading to richer and more nuanced understanding of the input.
Together, these mechanisms are the foundation of Transformer architectures, allowing for parallelism and long-range dependency modeling. Unlike traditional sequential models that process words one at a time, Transformers can process all words simultaneously, dramatically improving computational efficiency. Additionally, because every token can attend to every other token directly, Transformers excel at capturing relationships between words that are far apart in the text, solving the long-standing challenge of modeling long-range dependencies in natural language processing.

3.3 Self-Attention and Multi-Head Attention

Building on the foundation of attention mechanisms, self-attention emerged as a groundbreaking innovation in Natural Language Processing (NLP). This revolutionary approach transformed how models process input sequences by introducing a mechanism where each element in a sequence can directly interact with every other element. This direct interaction enables models to process input sequences with unprecedented efficiency and context-awareness, eliminating the traditional bottlenecks of sequential processing.

Self-attention achieves this by allowing each token to query and attend to all other tokens in the sequence simultaneously. For example, when processing the sentence "The cat sat on the mat," each word can directly assess its relationship with every other word, helping the model understand both local relationships (like "the cat") and long-distance dependencies (connecting "cat" with "sat").

When combined with multi-head attention, this capability becomes even more powerful. Multi-head attention allows the model to maintain multiple different attention patterns simultaneously, each focusing on different aspects of the relationships between tokens. This multi-faceted approach serves as the cornerstone of Transformer models, empowering them to capture complex relationships between tokens in a sequence from multiple perspectives simultaneously.

In this section, we'll explore self-attention and its extension to multi-head attention, examining how these mechanisms work and why they are pivotal in Transformer architectures. We'll dive deep into their internal workings, from the mathematical foundations to practical implementations, and demonstrate their effectiveness through concrete examples and code implementations. This detailed exploration will clarify not just their implementation but also their practical utility in modern NLP applications.

3.3.1 What Is Self-Attention?

In self-attention, every token in an input sequence attends to all other tokens (including itself) to compute a new representation. This revolutionary mechanism works by creating dynamic connections between all elements in a sequence. For instance, when processing a sentence, each word maintains awareness of every other word through attention weights that determine how much influence each word should have on the current word's representation. These weights are learned during training and adapt based on the context and task at hand.

To illustrate this concept more concretely, consider the sentence "The cat chased the mouse." In this example, when processing the word "chased," the self-attention mechanism simultaneously considers all words in the sentence:

It strongly attends to "cat" as the subject performing the action
It maintains strong attention to "mouse" as the object receiving the action
It may give less attention to articles like "the" which contribute less to the semantic meaning
This parallel processing allows the model to construct a rich, contextual understanding of each word's role in the sentence.

Unlike traditional attention mechanisms, which typically work with two separate sequences (like in machine translation where a word in English attends to words in French), self-attention operates entirely within a single sequence. This internal focus represents a significant advancement in natural language processing. When translating between languages, traditional attention might help align words across languages, but self-attention helps understand the intricate relationships within each language first.

The power of this internal focus becomes particularly evident when dealing with complex linguistic phenomena:

Long-distance dependencies (e.g., "The cat, which had a brown collar, chased the mouse")
Coreference resolution (understanding that "it" refers to "the cat")
Semantic role labeling (identifying who did what to whom)
Syntactic structure understanding (grasping the grammatical relationships between words)

This architectural design makes self-attention particularly effective for tasks that require deep understanding of language structure and meaning, such as parsing, sentiment analysis, and question answering. By allowing each element to directly interact with every other element, the model can build sophisticated representations that capture both local and global contexts within the input sequence.

How It Works:

Input Representation: Each token (word or subword) in the sequence is first converted into a numerical vector through an embedding process. These vectors typically have hundreds of dimensions and capture semantic relationships between words. For example, similar words like "cat" and "kitten" will have vectors that are close to each other in this high-dimensional space.
Query, Key, and Value Creation: The model transforms each token's initial vector into three distinct vectors through learned linear transformations:
- Query vector (Q): Acts like a search query, representing what information the current token is looking for in the sequence
- Key vector (K): Functions like a label or index, helping other tokens find this token when relevant
- Value vector (V): Contains the actual meaningful information that will be used in the final representation
Attention Score Computation: The model computes attention scores by taking the dot product between each query and all keys. This creates a matrix of scores where each entry (i,j) represents how relevant token j is to token i. The scores are then scaled by dividing by the square root of the key dimension to prevent the dot products from growing too large, which helps maintain stable gradients during training.
Weight Normalization: The attention scores are converted into probabilities using the softmax function. This ensures all weights for a given token sum to 1 and creates a proper probability distribution. When processing a word like "ate" in "The hungry cat ate fish", the model might assign higher weights to relevant context words like "cat" (0.6) and "fish" (0.3), and lower weights to less important words like "the" (0.02).
Output Computation: The final representation for each token is computed as a weighted sum of all value vectors, using the normalized attention weights. This process allows each token to gather information from all other tokens in the sequence, weighted by relevance. The resulting representations are context-aware and can capture both local grammatical structure and long-range dependencies, enabling the model to understand relationships between words even when they're far apart in the text.

3.3.2 Mathematics of Self-Attention

For an input sequence of tokens X = [x_1, x_2, \dots, x_n]:

Compute Q = XW_Q, K = XW_K, and V = XW_V, where W_Q, W_K, W_V are learnable weight matrices.
Calculate attention scores:
{Scores} = \frac{Q \cdot K^\top}{\sqrt{d_k}}
Here, d_k is the dimension of the key vectors.
Normalize scores with softmax:
{Weights} = \text{softmax}\left(\text{Scores}\right)
Compute the output:
{Output} = \text{Weights} \cdot V

Example: Implementing Self-Attention

Let’s implement self-attention for a simple sequence.

Code Example: Self-Attention in NumPy

import numpy as np

def self_attention(X, W_Q, W_K, W_V, mask=None):
    """
    Compute self-attention for a sequence with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
        
    Returns:
    --------
    output: np.ndarray
        Attended sequence of shape (n_tokens, d_model)
    weights: np.ndarray
        Attention weights of shape (n_tokens, n_tokens)
    """
    # Linear transformations
    Q = np.dot(X, W_Q)  # Shape: (n_tokens, d_k)
    K = np.dot(X, W_K)  # Shape: (n_tokens, d_k)
    V = np.dot(X, W_V)  # Shape: (n_tokens, d_v)
    
    # Calculate scaled dot-product attention
    d_k = K.shape[1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Shape: (n_tokens, n_tokens)
    
    # Apply mask if provided
    if mask is not None:
        scores = scores * mask + -1e9 * (1 - mask)
    
    # Softmax normalization
    weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights /= np.sum(weights, axis=-1, keepdims=True)
    
    # Compute weighted sum
    output = np.dot(weights, V)  # Shape: (n_tokens, d_v)
    
    return output, weights

# Example usage with a more complex sequence
def create_example():
    # Create sample sequence
    X = np.array([
        [1, 0, 0],  # First token
        [0, 1, 0],  # Second token
        [0, 0, 1],  # Third token
        [1, 1, 0]   # Fourth token
    ])
    
    # Create weight matrices
    d_model = 3  # Input dimension
    d_k = 2      # Key/Query dimension
    d_v = 4      # Value dimension
    
    W_Q = np.random.randn(d_model, d_k) * 0.1
    W_K = np.random.randn(d_model, d_k) * 0.1
    W_V = np.random.randn(d_model, d_v) * 0.1
    
    # Create attention mask (optional)
    mask = np.array([
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 0],  # Last token masked for third position
        [1, 1, 1, 1]
    ])
    
    return X, W_Q, W_K, W_V, mask

# Run example
X, W_Q, W_K, W_V, mask = create_example()
output, weights = self_attention(X, W_Q, W_K, W_V, mask)

print("Input Shape:", X.shape)
print("\nQuery Matrix Shape:", W_Q.shape)
print("Key Matrix Shape:", W_K.shape)
print("Value Matrix Shape:", W_V.shape)
print("\nAttention Weights:\n", weights)
print("\nOutput Shape:", output.shape)
print("Output:\n", output)

Code Breakdown Explanation:

Function Definition and Parameters:
- The function takes input sequence X and three weight matrices (W_Q, W_K, W_V)
- Added optional masking parameter for more control over attention
- Includes comprehensive docstring with parameter descriptions
Linear Transformations:
- Converts input tokens into Query (Q), Key (K), and Value (V) representations
- Uses matrix multiplication (np.dot) for efficient computation
- Maintains proper shape transformations throughout
Attention Score Computation:
- Implements scaled dot-product attention with proper scaling factor
- Includes masking functionality for selective attention
- Uses numerically stable softmax implementation
Example Implementation:
- Creates a realistic example with 4 tokens and 3 features
- Demonstrates proper initialization of weight matrices
- Shows how to use optional masking
Shape Information:
- Clearly documents tensor shapes throughout the process
- Helps understand the dimensional transformations
- Makes debugging easier

3.3.3 What Is Multi-Head Attention?

Multi-head attention represents a sophisticated enhancement to the self-attention mechanism by running multiple parallel attention computations, called "heads." Each head operates independently and learns to focus on different aspects of the relationships between tokens in the sequence. For example, one head might learn to focus on syntactic relationships (like subject-verb agreement), while another might capture semantic relationships (like topic relevance), and yet another might detect long-range dependencies (like coreference resolution).

This parallel processing architecture provides several key advantages. First, it allows the model to simultaneously analyze the input sequence from multiple perspectives, much like how humans process language by considering multiple aspects at once. Second, by having multiple specialized attention mechanisms, the model can capture both fine-grained and broad patterns in the data. Finally, the diverse representations learned by different heads combine to create a richer, more nuanced understanding of the input sequence.

The outputs from all heads are ultimately combined through a concatenation operation followed by a linear transformation, allowing the model to synthesize these different perspectives into a cohesive representation. This multi-faceted approach significantly enhances the model's capacity to understand and process complex linguistic patterns, making it particularly effective for tasks requiring sophisticated language understanding.

Steps in Multi-Head Attention

Split the input into multiple heads:
- Divide the input sequence into separate subspaces
- Each head receives a portion of the input's dimensionality
- This splitting allows parallel processing of different feature aspects
Apply self-attention independently to each head:
- Each head computes its own Query (Q), Key (K), and Value (V) matrices
- Calculates attention scores using scaled dot-product attention
- Processes information focusing on different aspects of the input
Concatenate the outputs of all heads:
- Combine the results from each attention head
- Preserves the unique patterns and relationships learned by each head
- Creates a comprehensive representation of the input sequence
Apply a final linear transformation:
- Project the concatenated outputs to the desired dimension
- Integrates information from all heads into a cohesive representation
- Allows the model to weight the importance of different heads' outputs

Benefits of Multi-Head Attention

Diverse Representations: Each attention head specializes in capturing different types of relationships within the data. For example, one head might focus on syntactic dependencies (like subject-verb agreement), while another might detect semantic relationships (like topic relevance), and yet another might identify long-range dependencies (like coreference resolution). This diversity allows the model to build a rich, multi-faceted understanding of the input.
Improved Expressiveness: The model can focus on multiple aspects of the input simultaneously, similar to how humans process language. This parallel processing enables the model to:
- Capture both local and global context
- Process different semantic levels (word-level, phrase-level, sentence-level)
- Learn hierarchical relationships between tokens
- Combine different perspectives into a more comprehensive understanding
Enhanced Learning Capacity: Multiple heads allow the model to distribute attention across different subspaces, effectively increasing its representational power without significantly increasing computational complexity.
Robust Feature Detection: By maintaining multiple independent attention mechanisms, the model becomes more robust as it doesn't rely on a single attention pattern, reducing the impact of noise or misleading patterns in the data.

Example: Multi-Head Attention

Let’s implement a simplified version of multi-head attention.

Code Example: Multi-Head Attention in NumPy

import numpy as np

def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Compute multi-head attention with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    W_O: np.ndarray
        Output projection matrix
    n_heads: int
        Number of attention heads
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
    
    Returns:
    --------
    final_output: np.ndarray
        Transformed sequence of shape (n_tokens, d_model)
    attention_weights: list
        List of attention weights for each head
    """
    d_model = X.shape[1]
    head_dim = W_Q.shape[1] // n_heads
    outputs = []
    attention_weights = []

    # Process each attention head
    for i in range(n_heads):
        # Split weights for current head
        Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)

        # Compute attention scores
        scores = np.dot(Q, K.T) / np.sqrt(head_dim)  # (n_tokens, n_tokens)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores * mask + -1e9 * (1 - mask)
        
        # Apply softmax
        weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = weights / np.sum(weights, axis=-1, keepdims=True)
        
        # Compute weighted sum
        output = np.dot(weights, V)  # (n_tokens, head_dim)
        
        outputs.append(output)
        attention_weights.append(weights)

    # Concatenate all heads
    concatenated = np.concatenate(outputs, axis=-1)  # (n_tokens, d_model)
    
    # Final linear transformation
    final_output = np.dot(concatenated, W_O)  # (n_tokens, d_model)
    
    return final_output, attention_weights

# Example usage with a more realistic sequence
def create_example_inputs(n_tokens=4, d_model=8, n_heads=2):
    """Create example inputs for multi-head attention."""
    # Input sequence
    X = np.random.randn(n_tokens, d_model)
    
    # Weight matrices
    head_dim = d_model // n_heads
    W_Q = np.random.randn(d_model, d_model) * 0.1
    W_K = np.random.randn(d_model, d_model) * 0.1
    W_V = np.random.randn(d_model, d_model) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1
    
    # Optional mask (causal attention)
    mask = np.tril(np.ones((n_tokens, n_tokens)))
    
    return X, W_Q, W_K, W_V, W_O, mask

# Run example
X, W_Q, W_K, W_V, W_O, mask = create_example_inputs()
output, weights = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads=2, mask=mask)

print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights for first head:\n", weights[0])
print("\nAttention weights for second head:\n", weights[1])

Code Breakdown:

Function Architecture
- Implements multi-head attention with comprehensive documentation
- Includes optional masking for causal attention
- Returns both outputs and attention weights for analysis
Key Components
- Head Dimension Calculation: Splits input dimension across heads
- Per-Head Processing: Computes separate attention for each head
- Attention Mechanism: Implements scaled dot-product attention
- Output Aggregation: Concatenates and projects head outputs
Enhanced Features
- Numerical Stability: Uses stable softmax implementation
- Masking Support: Allows for masked attention patterns
- Proper Scaling: Includes attention scaling factor
Helper Functions
- create_example_inputs: Generates realistic test data
- Includes shape information and initialization logic
- Demonstrates proper usage patterns
Output Analysis
- Prints shapes for verification
- Shows attention weights for interpretation
- Demonstrates the multi-head nature of attention

3.3.4 Applications of Self-Attention and Multi-Head Attention

Text Summarization

Models leverage attention mechanisms in sophisticated ways to identify and prioritize the most important parts of a document. The attention mechanism works by assigning different weights to different parts of the input text, essentially creating a hierarchy of importance. These weights are learned during training and are dynamically adjusted based on the specific content being processed.

The attention weights serve as a sophisticated filtering mechanism that helps determine which sentences carry the most critical information. This process involves analyzing various linguistic features, including semantic relevance, syntactic structure, and contextual relationships between different parts of the text. The model can then create concise and meaningful summaries while preserving the core message and maintaining coherence.

For example, in news article summarization, the model employs a multi-layered approach to attention. It might attend strongly to key events (such as main actions or developments), significant quotes from relevant figures, and important statistical data that supports the main narrative. Meanwhile, it assigns lower attention weights to supplementary details, background information, or redundant content. This selective attention process mirrors human summarization behavior, where we naturally focus on crucial information while skimming over less important details.

Code Example: Text Summarization with Self-Attention

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionSummarizer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, max_length=512):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_encoding = nn.Parameter(
            torch.zeros(max_length, embed_dim)
        )
        self.multihead_attention = nn.MultiheadAttention(
            embed_dim, num_heads, batch_first=True
        )
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.output_layer = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x, src_mask=None):
        # Add positional encoding to embeddings
        seq_length = x.size(1)
        x = self.embedding(x) + self.position_encoding[:seq_length]
        
        # Self-attention block
        attention_output, attention_weights = self.multihead_attention(
            x, x, x,
            key_padding_mask=src_mask,
            need_weights=True
        )
        x = self.layer_norm1(x + attention_output)
        
        # Feed-forward block
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        
        # Generate output probabilities
        output = self.output_layer(x)
        return output, attention_weights

def generate_summary(model, input_ids, tokenizer, max_length=150):
    model.eval()
    with torch.no_grad():
        output, attention_weights = model(input_ids)
        
        # Get most attended words for summary
        attention_scores = attention_weights.mean(dim=1)
        top_scores = torch.topk(attention_scores.squeeze(), k=max_length)
        
        # Extract and arrange summary tokens
        summary_indices = top_scores.indices.sort().values
        summary_tokens = input_ids[0, summary_indices]
        
        # Convert to text
        summary = tokenizer.decode(summary_tokens)
        return summary, attention_weights

# Example usage
def summarize_text(text, model, tokenizer):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate summary
    summary, attention = generate_summary(
        model, 
        inputs["input_ids"],
        tokenizer
    )
    return summary, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based summarizer with multi-head self-attention
- Includes positional encoding for sequence awareness
- Uses layer normalization and residual connections for stable training
Key Components
- Embedding Layer: Converts tokens to dense vectors
- Multi-Head Attention: Processes text from multiple perspectives
- Feed-Forward Network: Adds non-linearity and transforms representations
- Output Layer: Generates final token predictions
Summarization Process
- Analyzes attention weights to identify important tokens
- Selects top-attended tokens for summary generation
- Maintains original order of selected tokens for coherence
Advanced Features
- Supports variable length inputs with masking
- Implements efficient batch processing
- Returns attention weights for analysis and visualization

Usage Example:

# Example setup and usage
vocab_size = 30000
embed_dim = 512
num_heads = 8
hidden_dim = 2048

model = SelfAttentionSummarizer(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    num_heads=num_heads,
    hidden_dim=hidden_dim
)

# Example text
text = """
Climate change poses significant challenges to global ecosystems. 
Rising temperatures affect wildlife habitats and agricultural productivity. 
Scientists warn that immediate action is necessary to prevent irreversible damage.
"""

# Generate summary (assuming tokenizer is initialized)
summary, attention = summarize_text(text, model, tokenizer)
print("Summary:", summary)

Machine Translation

Attention mechanisms revolutionize machine translation by creating sophisticated dynamic alignments between words and phrases across languages. This process works by establishing weighted connections between elements in the source and target languages, allowing the model to understand complex linguistic relationships. For example, when translating from English to Japanese, the attention mechanism can handle the significant differences in sentence structure, where English follows Subject-Verb-Object order while Japanese typically uses Subject-Object-Verb order.

The mechanism is particularly powerful in handling three key translation challenges:

First, it manages complex word order variations between languages. For instance, when translating between English and German, where the verb position can vary significantly, the attention mechanism can maintain proper semantic relationships despite syntactic differences.

Second, it handles many-to-one and one-to-many word mappings effectively. For example, when translating the German compound word "Schadenfreude" to English, the mechanism can map it to the phrase "pleasure derived from another's misfortune," maintaining accurate meaning despite the structural difference.

Third, the model maintains contextual awareness across extended sentences through its ability to reference and weight the importance of different parts of the input sequence. This ensures that long sentences retain their meaning and coherence in translation, preventing common issues like losing track of subject-verb relationships or mishandling dependent clauses.

The attention mechanism achieves this by continuously updating its focus based on the current word being translated and its relationship to all other words in the sentence, ensuring that the final translation preserves both meaning and natural language flow.

Code Example: Neural Machine Translation with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class TranslationTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8, 
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048):
        super().__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        
        # Transformer layers
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward
        )
        
        # Output projection
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Create source embedding
        src_embedded = self.positional_encoding(self.src_embedding(src))
        
        # Create target embedding
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt))
        
        # Generate masks if not provided
        if src_mask is None:
            src_mask = self.generate_square_subsequent_mask(src.size(1))
        if tgt_mask is None:
            tgt_mask = self.generate_square_subsequent_mask(tgt.size(1))
            
        # Pass through transformer
        output = self.transformer(
            src_embedded, tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask
        )
        
        # Project to vocabulary
        return self.output_layer(output)
    
    @staticmethod
    def generate_square_subsequent_mask(sz):
        mask = torch.triu(torch.ones(sz, sz), diagonal=1)
        mask = mask.masked_fill(mask==1, float('-inf'))
        return mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Training function
def train_translation_model(model, train_loader, optimizer, criterion, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (src, tgt) in enumerate(train_loader):
            optimizer.zero_grad()
            
            # Forward pass
            output = model(src, tgt[:-1])  # exclude last target token
            
            # Calculate loss
            loss = criterion(
                output.view(-1, output.size(-1)),
                tgt[1:].reshape(-1)  # exclude first target token (BOS)
            )
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch: {epoch+1}, Average Loss: {avg_loss:.4f}')

Code Breakdown:

Model Architecture
- Implements a complete Transformer-based translation model
- Uses both encoder and decoder with multi-head attention
- Includes positional encoding for sequence order awareness
Key Components
- Source and Target Embeddings: Convert tokens to vectors
- Positional Encoding: Adds position information to embeddings
- Transformer Block: Processes sequences using self-attention
- Output Projection: Maps to target vocabulary
Training Process
- Implements teacher forcing during training
- Uses masked attention for autoregressive generation
- Includes loss calculation and optimization steps
Advanced Features
- Supports variable length sequences
- Implements efficient batch processing
- Includes mask generation for causal attention

Usage Example:

# Initialize model and training components
model = TranslationTransformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    d_model=512,
    nhead=8
)

# Setup optimizer and criterion
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

# Example translation
def translate(model, src_sentence, src_tokenizer, tgt_tokenizer, max_len=50):
    model.eval()
    with torch.no_grad():
        # Tokenize source sentence
        src_tokens = src_tokenizer.encode(src_sentence)
        src_tensor = torch.LongTensor(src_tokens).unsqueeze(0)
        
        # Initialize target with BOS token
        tgt_tokens = [tgt_tokenizer.bos_token_id]
        
        # Generate translation
        for _ in range(max_len):
            tgt_tensor = torch.LongTensor(tgt_tokens).unsqueeze(0)
            output = model(src_tensor, tgt_tensor)
            next_token = output[0, -1].argmax().item()
            
            if next_token == tgt_tokenizer.eos_token_id:
                break
                
            tgt_tokens.append(next_token)
        
        # Convert tokens to text
        translation = tgt_tokenizer.decode(tgt_tokens)
        return translation

Question Answering

When processing questions, attention mechanisms employ a sophisticated approach to information processing. These mechanisms help models identify and focus on the specific parts of a passage that contain relevant information through a multi-step process:

First, the model analyzes the question to understand what type of information it needs to look for. Then, it creates attention weights for each word in the passage, giving higher weights to words and phrases that are more likely to contain the answer. This selective focus enables the model to efficiently extract answers while ignoring irrelevant content.

For instance, when answering "When did the event occur?", the model would primarily attend to temporal expressions (such as dates, times, and temporal phrases like "yesterday" or "last week") and their surrounding context in the passage. The attention weights would be highest for these temporal indicators and their immediate context, allowing the model to zero in on the most relevant information. This process is similar to how humans might scan a text for time-related words when looking for when something happened.

Code Example: Question Answering with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class QATransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        
        # Embedding layers
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Multi-head attention layers
        self.question_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        self.context_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        
        # Cross-attention layer
        self.cross_attention = nn.MultiheadAttention(d_model, nhead)
        
        # Output layers for start and end position prediction
        self.start_predictor = nn.Linear(d_model, 1)
        self.end_predictor = nn.Linear(d_model, 1)
        
    def forward(self, question, context):
        # Embed inputs
        q_embed = self.pos_encoder(self.embedding(question))
        c_embed = self.pos_encoder(self.embedding(context))
        
        # Encode question and context
        q_encoded = self.question_encoder(q_embed)
        c_encoded = self.context_encoder(c_embed)
        
        # Cross-attention between question and context
        attn_output, attention_weights = self.cross_attention(
            q_encoded, c_encoded, c_encoded
        )
        
        # Predict answer span
        start_logits = self.start_predictor(attn_output).squeeze(-1)
        end_logits = self.end_predictor(attn_output).squeeze(-1)
        
        return start_logits, end_logits, attention_weights

def train_qa_model(model, train_loader, optimizer, num_epochs=10):
    model.train()
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch in train_loader:
            question, context, start_pos, end_pos = batch
            
            # Forward pass
            start_logits, end_logits, _ = model(question, context)
            
            # Calculate loss
            start_loss = criterion(start_logits, start_pos)
            end_loss = criterion(end_logits, end_pos)
            loss = start_loss + end_loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def predict_answer(model, tokenizer, question, context):
    model.eval()
    with torch.no_grad():
        # Tokenize inputs
        q_tokens = tokenizer.encode(question)
        c_tokens = tokenizer.encode(context)
        
        # Convert to tensors
        q_tensor = torch.tensor(q_tokens).unsqueeze(0)
        c_tensor = torch.tensor(c_tokens).unsqueeze(0)
        
        # Get predictions
        start_logits, end_logits, attention = model(q_tensor, c_tensor)
        
        # Find most likely answer span
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits[start_idx:]) + start_idx
        
        # Extract answer tokens
        answer_tokens = c_tokens[start_idx:end_idx+1]
        
        # Convert back to text
        answer = tokenizer.decode(answer_tokens)
        return answer, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based QA model with separate encoders for questions and context
- Uses multi-head self-attention for both question and context processing
- Includes cross-attention mechanism to relate questions to context
- Features span prediction for answer extraction
Key Components
- Embedding Layer: Converts text tokens to dense vectors
- Positional Encoding: Adds position information to embeddings
- Question/Context Encoders: Process inputs using self-attention
- Cross-Attention: Relates question to context for answer finding
- Span Predictors: Locate answer boundaries in context
Processing Flow
- Embeds and encodes question and context separately
- Applies cross-attention to find relevant context regions
- Predicts start and end positions of answer span
- Returns answer text and attention weights for analysis

Usage Example:

# Initialize model and components
model = QATransformer(
    vocab_size=30000,
    d_model=512,
    nhead=8,
    num_layers=6
)

# Example usage
question = "When was the first computer invented?"
context = "The first general-purpose electronic computer, ENIAC, was completed in 1945."

# Get answer
answer, attention_weights = predict_answer(
    model, tokenizer, question, context
)
print(f"Question: {question}")
print(f"Answer: {answer}")

3.3.5 Key Takeaways

Self-attention enables models to compute context-aware representations by attending to all tokens in a sequence. This means each word in a sentence can directly interact with every other word, allowing the model to understand complex relationships and dependencies. For example, in the sentence "The cat that chased the mouse was black", self-attention helps the model connect "was black" back to "cat" even though they're separated by several words.
Multi-head attention enhances self-attention by capturing diverse relationships simultaneously. While a single attention head might focus on syntactic relationships, another might capture semantic similarities, and yet another might track temporal relationships. This multi-faceted approach allows the model to process information through multiple different "perspectives" at once, leading to richer and more nuanced understanding of the input.
Together, these mechanisms are the foundation of Transformer architectures, allowing for parallelism and long-range dependency modeling. Unlike traditional sequential models that process words one at a time, Transformers can process all words simultaneously, dramatically improving computational efficiency. Additionally, because every token can attend to every other token directly, Transformers excel at capturing relationships between words that are far apart in the text, solving the long-standing challenge of modeling long-range dependencies in natural language processing.

3.3 Self-Attention and Multi-Head Attention

Building on the foundation of attention mechanisms, self-attention emerged as a groundbreaking innovation in Natural Language Processing (NLP). This revolutionary approach transformed how models process input sequences by introducing a mechanism where each element in a sequence can directly interact with every other element. This direct interaction enables models to process input sequences with unprecedented efficiency and context-awareness, eliminating the traditional bottlenecks of sequential processing.

Self-attention achieves this by allowing each token to query and attend to all other tokens in the sequence simultaneously. For example, when processing the sentence "The cat sat on the mat," each word can directly assess its relationship with every other word, helping the model understand both local relationships (like "the cat") and long-distance dependencies (connecting "cat" with "sat").

When combined with multi-head attention, this capability becomes even more powerful. Multi-head attention allows the model to maintain multiple different attention patterns simultaneously, each focusing on different aspects of the relationships between tokens. This multi-faceted approach serves as the cornerstone of Transformer models, empowering them to capture complex relationships between tokens in a sequence from multiple perspectives simultaneously.

In this section, we'll explore self-attention and its extension to multi-head attention, examining how these mechanisms work and why they are pivotal in Transformer architectures. We'll dive deep into their internal workings, from the mathematical foundations to practical implementations, and demonstrate their effectiveness through concrete examples and code implementations. This detailed exploration will clarify not just their implementation but also their practical utility in modern NLP applications.

3.3.1 What Is Self-Attention?

In self-attention, every token in an input sequence attends to all other tokens (including itself) to compute a new representation. This revolutionary mechanism works by creating dynamic connections between all elements in a sequence. For instance, when processing a sentence, each word maintains awareness of every other word through attention weights that determine how much influence each word should have on the current word's representation. These weights are learned during training and adapt based on the context and task at hand.

To illustrate this concept more concretely, consider the sentence "The cat chased the mouse." In this example, when processing the word "chased," the self-attention mechanism simultaneously considers all words in the sentence:

It strongly attends to "cat" as the subject performing the action
It maintains strong attention to "mouse" as the object receiving the action
It may give less attention to articles like "the" which contribute less to the semantic meaning
This parallel processing allows the model to construct a rich, contextual understanding of each word's role in the sentence.

Unlike traditional attention mechanisms, which typically work with two separate sequences (like in machine translation where a word in English attends to words in French), self-attention operates entirely within a single sequence. This internal focus represents a significant advancement in natural language processing. When translating between languages, traditional attention might help align words across languages, but self-attention helps understand the intricate relationships within each language first.

The power of this internal focus becomes particularly evident when dealing with complex linguistic phenomena:

Long-distance dependencies (e.g., "The cat, which had a brown collar, chased the mouse")
Coreference resolution (understanding that "it" refers to "the cat")
Semantic role labeling (identifying who did what to whom)
Syntactic structure understanding (grasping the grammatical relationships between words)

This architectural design makes self-attention particularly effective for tasks that require deep understanding of language structure and meaning, such as parsing, sentiment analysis, and question answering. By allowing each element to directly interact with every other element, the model can build sophisticated representations that capture both local and global contexts within the input sequence.

How It Works:

Input Representation: Each token (word or subword) in the sequence is first converted into a numerical vector through an embedding process. These vectors typically have hundreds of dimensions and capture semantic relationships between words. For example, similar words like "cat" and "kitten" will have vectors that are close to each other in this high-dimensional space.
Query, Key, and Value Creation: The model transforms each token's initial vector into three distinct vectors through learned linear transformations:
- Query vector (Q): Acts like a search query, representing what information the current token is looking for in the sequence
- Key vector (K): Functions like a label or index, helping other tokens find this token when relevant
- Value vector (V): Contains the actual meaningful information that will be used in the final representation
Attention Score Computation: The model computes attention scores by taking the dot product between each query and all keys. This creates a matrix of scores where each entry (i,j) represents how relevant token j is to token i. The scores are then scaled by dividing by the square root of the key dimension to prevent the dot products from growing too large, which helps maintain stable gradients during training.
Weight Normalization: The attention scores are converted into probabilities using the softmax function. This ensures all weights for a given token sum to 1 and creates a proper probability distribution. When processing a word like "ate" in "The hungry cat ate fish", the model might assign higher weights to relevant context words like "cat" (0.6) and "fish" (0.3), and lower weights to less important words like "the" (0.02).
Output Computation: The final representation for each token is computed as a weighted sum of all value vectors, using the normalized attention weights. This process allows each token to gather information from all other tokens in the sequence, weighted by relevance. The resulting representations are context-aware and can capture both local grammatical structure and long-range dependencies, enabling the model to understand relationships between words even when they're far apart in the text.

3.3.2 Mathematics of Self-Attention

For an input sequence of tokens X = [x_1, x_2, \dots, x_n]:

Compute Q = XW_Q, K = XW_K, and V = XW_V, where W_Q, W_K, W_V are learnable weight matrices.
Calculate attention scores:
{Scores} = \frac{Q \cdot K^\top}{\sqrt{d_k}}
Here, d_k is the dimension of the key vectors.
Normalize scores with softmax:
{Weights} = \text{softmax}\left(\text{Scores}\right)
Compute the output:
{Output} = \text{Weights} \cdot V

Example: Implementing Self-Attention

Let’s implement self-attention for a simple sequence.

Code Example: Self-Attention in NumPy

import numpy as np

def self_attention(X, W_Q, W_K, W_V, mask=None):
    """
    Compute self-attention for a sequence with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
        
    Returns:
    --------
    output: np.ndarray
        Attended sequence of shape (n_tokens, d_model)
    weights: np.ndarray
        Attention weights of shape (n_tokens, n_tokens)
    """
    # Linear transformations
    Q = np.dot(X, W_Q)  # Shape: (n_tokens, d_k)
    K = np.dot(X, W_K)  # Shape: (n_tokens, d_k)
    V = np.dot(X, W_V)  # Shape: (n_tokens, d_v)
    
    # Calculate scaled dot-product attention
    d_k = K.shape[1]
    scores = np.dot(Q, K.T) / np.sqrt(d_k)  # Shape: (n_tokens, n_tokens)
    
    # Apply mask if provided
    if mask is not None:
        scores = scores * mask + -1e9 * (1 - mask)
    
    # Softmax normalization
    weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
    weights /= np.sum(weights, axis=-1, keepdims=True)
    
    # Compute weighted sum
    output = np.dot(weights, V)  # Shape: (n_tokens, d_v)
    
    return output, weights

# Example usage with a more complex sequence
def create_example():
    # Create sample sequence
    X = np.array([
        [1, 0, 0],  # First token
        [0, 1, 0],  # Second token
        [0, 0, 1],  # Third token
        [1, 1, 0]   # Fourth token
    ])
    
    # Create weight matrices
    d_model = 3  # Input dimension
    d_k = 2      # Key/Query dimension
    d_v = 4      # Value dimension
    
    W_Q = np.random.randn(d_model, d_k) * 0.1
    W_K = np.random.randn(d_model, d_k) * 0.1
    W_V = np.random.randn(d_model, d_v) * 0.1
    
    # Create attention mask (optional)
    mask = np.array([
        [1, 1, 1, 1],
        [1, 1, 1, 1],
        [1, 1, 1, 0],  # Last token masked for third position
        [1, 1, 1, 1]
    ])
    
    return X, W_Q, W_K, W_V, mask

# Run example
X, W_Q, W_K, W_V, mask = create_example()
output, weights = self_attention(X, W_Q, W_K, W_V, mask)

print("Input Shape:", X.shape)
print("\nQuery Matrix Shape:", W_Q.shape)
print("Key Matrix Shape:", W_K.shape)
print("Value Matrix Shape:", W_V.shape)
print("\nAttention Weights:\n", weights)
print("\nOutput Shape:", output.shape)
print("Output:\n", output)

Code Breakdown Explanation:

Function Definition and Parameters:
- The function takes input sequence X and three weight matrices (W_Q, W_K, W_V)
- Added optional masking parameter for more control over attention
- Includes comprehensive docstring with parameter descriptions
Linear Transformations:
- Converts input tokens into Query (Q), Key (K), and Value (V) representations
- Uses matrix multiplication (np.dot) for efficient computation
- Maintains proper shape transformations throughout
Attention Score Computation:
- Implements scaled dot-product attention with proper scaling factor
- Includes masking functionality for selective attention
- Uses numerically stable softmax implementation
Example Implementation:
- Creates a realistic example with 4 tokens and 3 features
- Demonstrates proper initialization of weight matrices
- Shows how to use optional masking
Shape Information:
- Clearly documents tensor shapes throughout the process
- Helps understand the dimensional transformations
- Makes debugging easier

3.3.3 What Is Multi-Head Attention?

Multi-head attention represents a sophisticated enhancement to the self-attention mechanism by running multiple parallel attention computations, called "heads." Each head operates independently and learns to focus on different aspects of the relationships between tokens in the sequence. For example, one head might learn to focus on syntactic relationships (like subject-verb agreement), while another might capture semantic relationships (like topic relevance), and yet another might detect long-range dependencies (like coreference resolution).

This parallel processing architecture provides several key advantages. First, it allows the model to simultaneously analyze the input sequence from multiple perspectives, much like how humans process language by considering multiple aspects at once. Second, by having multiple specialized attention mechanisms, the model can capture both fine-grained and broad patterns in the data. Finally, the diverse representations learned by different heads combine to create a richer, more nuanced understanding of the input sequence.

The outputs from all heads are ultimately combined through a concatenation operation followed by a linear transformation, allowing the model to synthesize these different perspectives into a cohesive representation. This multi-faceted approach significantly enhances the model's capacity to understand and process complex linguistic patterns, making it particularly effective for tasks requiring sophisticated language understanding.

Steps in Multi-Head Attention

Split the input into multiple heads:
- Divide the input sequence into separate subspaces
- Each head receives a portion of the input's dimensionality
- This splitting allows parallel processing of different feature aspects
Apply self-attention independently to each head:
- Each head computes its own Query (Q), Key (K), and Value (V) matrices
- Calculates attention scores using scaled dot-product attention
- Processes information focusing on different aspects of the input
Concatenate the outputs of all heads:
- Combine the results from each attention head
- Preserves the unique patterns and relationships learned by each head
- Creates a comprehensive representation of the input sequence
Apply a final linear transformation:
- Project the concatenated outputs to the desired dimension
- Integrates information from all heads into a cohesive representation
- Allows the model to weight the importance of different heads' outputs

Benefits of Multi-Head Attention

Diverse Representations: Each attention head specializes in capturing different types of relationships within the data. For example, one head might focus on syntactic dependencies (like subject-verb agreement), while another might detect semantic relationships (like topic relevance), and yet another might identify long-range dependencies (like coreference resolution). This diversity allows the model to build a rich, multi-faceted understanding of the input.
Improved Expressiveness: The model can focus on multiple aspects of the input simultaneously, similar to how humans process language. This parallel processing enables the model to:
- Capture both local and global context
- Process different semantic levels (word-level, phrase-level, sentence-level)
- Learn hierarchical relationships between tokens
- Combine different perspectives into a more comprehensive understanding
Enhanced Learning Capacity: Multiple heads allow the model to distribute attention across different subspaces, effectively increasing its representational power without significantly increasing computational complexity.
Robust Feature Detection: By maintaining multiple independent attention mechanisms, the model becomes more robust as it doesn't rely on a single attention pattern, reducing the impact of noise or misleading patterns in the data.

Example: Multi-Head Attention

Let’s implement a simplified version of multi-head attention.

Code Example: Multi-Head Attention in NumPy

import numpy as np

def multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads, mask=None):
    """
    Compute multi-head attention with optional masking.
    
    Parameters:
    -----------
    X: np.ndarray
        Input sequence of shape (n_tokens, d_model)
    W_Q, W_K, W_V: np.ndarray
        Weight matrices for Query, Key, Value transformations
    W_O: np.ndarray
        Output projection matrix
    n_heads: int
        Number of attention heads
    mask: np.ndarray, optional
        Attention mask of shape (n_tokens, n_tokens)
    
    Returns:
    --------
    final_output: np.ndarray
        Transformed sequence of shape (n_tokens, d_model)
    attention_weights: list
        List of attention weights for each head
    """
    d_model = X.shape[1]
    head_dim = W_Q.shape[1] // n_heads
    outputs = []
    attention_weights = []

    # Process each attention head
    for i in range(n_heads):
        # Split weights for current head
        Q = np.dot(X, W_Q[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        K = np.dot(X, W_K[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)
        V = np.dot(X, W_V[:, i*head_dim:(i+1)*head_dim])  # (n_tokens, head_dim)

        # Compute attention scores
        scores = np.dot(Q, K.T) / np.sqrt(head_dim)  # (n_tokens, n_tokens)
        
        # Apply mask if provided
        if mask is not None:
            scores = scores * mask + -1e9 * (1 - mask)
        
        # Apply softmax
        weights = np.exp(scores - np.max(scores, axis=-1, keepdims=True))
        weights = weights / np.sum(weights, axis=-1, keepdims=True)
        
        # Compute weighted sum
        output = np.dot(weights, V)  # (n_tokens, head_dim)
        
        outputs.append(output)
        attention_weights.append(weights)

    # Concatenate all heads
    concatenated = np.concatenate(outputs, axis=-1)  # (n_tokens, d_model)
    
    # Final linear transformation
    final_output = np.dot(concatenated, W_O)  # (n_tokens, d_model)
    
    return final_output, attention_weights

# Example usage with a more realistic sequence
def create_example_inputs(n_tokens=4, d_model=8, n_heads=2):
    """Create example inputs for multi-head attention."""
    # Input sequence
    X = np.random.randn(n_tokens, d_model)
    
    # Weight matrices
    head_dim = d_model // n_heads
    W_Q = np.random.randn(d_model, d_model) * 0.1
    W_K = np.random.randn(d_model, d_model) * 0.1
    W_V = np.random.randn(d_model, d_model) * 0.1
    W_O = np.random.randn(d_model, d_model) * 0.1
    
    # Optional mask (causal attention)
    mask = np.tril(np.ones((n_tokens, n_tokens)))
    
    return X, W_Q, W_K, W_V, W_O, mask

# Run example
X, W_Q, W_K, W_V, W_O, mask = create_example_inputs()
output, weights = multi_head_attention(X, W_Q, W_K, W_V, W_O, n_heads=2, mask=mask)

print("Input shape:", X.shape)
print("Output shape:", output.shape)
print("\nAttention weights for first head:\n", weights[0])
print("\nAttention weights for second head:\n", weights[1])

Code Breakdown:

Function Architecture
- Implements multi-head attention with comprehensive documentation
- Includes optional masking for causal attention
- Returns both outputs and attention weights for analysis
Key Components
- Head Dimension Calculation: Splits input dimension across heads
- Per-Head Processing: Computes separate attention for each head
- Attention Mechanism: Implements scaled dot-product attention
- Output Aggregation: Concatenates and projects head outputs
Enhanced Features
- Numerical Stability: Uses stable softmax implementation
- Masking Support: Allows for masked attention patterns
- Proper Scaling: Includes attention scaling factor
Helper Functions
- create_example_inputs: Generates realistic test data
- Includes shape information and initialization logic
- Demonstrates proper usage patterns
Output Analysis
- Prints shapes for verification
- Shows attention weights for interpretation
- Demonstrates the multi-head nature of attention

3.3.4 Applications of Self-Attention and Multi-Head Attention

Text Summarization

Models leverage attention mechanisms in sophisticated ways to identify and prioritize the most important parts of a document. The attention mechanism works by assigning different weights to different parts of the input text, essentially creating a hierarchy of importance. These weights are learned during training and are dynamically adjusted based on the specific content being processed.

The attention weights serve as a sophisticated filtering mechanism that helps determine which sentences carry the most critical information. This process involves analyzing various linguistic features, including semantic relevance, syntactic structure, and contextual relationships between different parts of the text. The model can then create concise and meaningful summaries while preserving the core message and maintaining coherence.

For example, in news article summarization, the model employs a multi-layered approach to attention. It might attend strongly to key events (such as main actions or developments), significant quotes from relevant figures, and important statistical data that supports the main narrative. Meanwhile, it assigns lower attention weights to supplementary details, background information, or redundant content. This selective attention process mirrors human summarization behavior, where we naturally focus on crucial information while skimming over less important details.

Code Example: Text Summarization with Self-Attention

import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F

class SelfAttentionSummarizer(nn.Module):
    def __init__(self, vocab_size, embed_dim, num_heads, hidden_dim, max_length=512):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.position_encoding = nn.Parameter(
            torch.zeros(max_length, embed_dim)
        )
        self.multihead_attention = nn.MultiheadAttention(
            embed_dim, num_heads, batch_first=True
        )
        self.layer_norm1 = nn.LayerNorm(embed_dim)
        self.feed_forward = nn.Sequential(
            nn.Linear(embed_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, embed_dim)
        )
        self.layer_norm2 = nn.LayerNorm(embed_dim)
        self.output_layer = nn.Linear(embed_dim, vocab_size)
        
    def forward(self, x, src_mask=None):
        # Add positional encoding to embeddings
        seq_length = x.size(1)
        x = self.embedding(x) + self.position_encoding[:seq_length]
        
        # Self-attention block
        attention_output, attention_weights = self.multihead_attention(
            x, x, x,
            key_padding_mask=src_mask,
            need_weights=True
        )
        x = self.layer_norm1(x + attention_output)
        
        # Feed-forward block
        ff_output = self.feed_forward(x)
        x = self.layer_norm2(x + ff_output)
        
        # Generate output probabilities
        output = self.output_layer(x)
        return output, attention_weights

def generate_summary(model, input_ids, tokenizer, max_length=150):
    model.eval()
    with torch.no_grad():
        output, attention_weights = model(input_ids)
        
        # Get most attended words for summary
        attention_scores = attention_weights.mean(dim=1)
        top_scores = torch.topk(attention_scores.squeeze(), k=max_length)
        
        # Extract and arrange summary tokens
        summary_indices = top_scores.indices.sort().values
        summary_tokens = input_ids[0, summary_indices]
        
        # Convert to text
        summary = tokenizer.decode(summary_tokens)
        return summary, attention_weights

# Example usage
def summarize_text(text, model, tokenizer):
    # Tokenize input text
    inputs = tokenizer(text, return_tensors="pt", max_length=512, truncation=True)
    
    # Generate summary
    summary, attention = generate_summary(
        model, 
        inputs["input_ids"],
        tokenizer
    )
    return summary, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based summarizer with multi-head self-attention
- Includes positional encoding for sequence awareness
- Uses layer normalization and residual connections for stable training
Key Components
- Embedding Layer: Converts tokens to dense vectors
- Multi-Head Attention: Processes text from multiple perspectives
- Feed-Forward Network: Adds non-linearity and transforms representations
- Output Layer: Generates final token predictions
Summarization Process
- Analyzes attention weights to identify important tokens
- Selects top-attended tokens for summary generation
- Maintains original order of selected tokens for coherence
Advanced Features
- Supports variable length inputs with masking
- Implements efficient batch processing
- Returns attention weights for analysis and visualization

Usage Example:

# Example setup and usage
vocab_size = 30000
embed_dim = 512
num_heads = 8
hidden_dim = 2048

model = SelfAttentionSummarizer(
    vocab_size=vocab_size,
    embed_dim=embed_dim,
    num_heads=num_heads,
    hidden_dim=hidden_dim
)

# Example text
text = """
Climate change poses significant challenges to global ecosystems. 
Rising temperatures affect wildlife habitats and agricultural productivity. 
Scientists warn that immediate action is necessary to prevent irreversible damage.
"""

# Generate summary (assuming tokenizer is initialized)
summary, attention = summarize_text(text, model, tokenizer)
print("Summary:", summary)

Machine Translation

Attention mechanisms revolutionize machine translation by creating sophisticated dynamic alignments between words and phrases across languages. This process works by establishing weighted connections between elements in the source and target languages, allowing the model to understand complex linguistic relationships. For example, when translating from English to Japanese, the attention mechanism can handle the significant differences in sentence structure, where English follows Subject-Verb-Object order while Japanese typically uses Subject-Object-Verb order.

The mechanism is particularly powerful in handling three key translation challenges:

First, it manages complex word order variations between languages. For instance, when translating between English and German, where the verb position can vary significantly, the attention mechanism can maintain proper semantic relationships despite syntactic differences.

Second, it handles many-to-one and one-to-many word mappings effectively. For example, when translating the German compound word "Schadenfreude" to English, the mechanism can map it to the phrase "pleasure derived from another's misfortune," maintaining accurate meaning despite the structural difference.

Third, the model maintains contextual awareness across extended sentences through its ability to reference and weight the importance of different parts of the input sequence. This ensures that long sentences retain their meaning and coherence in translation, preventing common issues like losing track of subject-verb relationships or mishandling dependent clauses.

The attention mechanism achieves this by continuously updating its focus based on the current word being translated and its relationship to all other words in the sentence, ensuring that the final translation preserves both meaning and natural language flow.

Code Example: Neural Machine Translation with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class TranslationTransformer(nn.Module):
    def __init__(self, src_vocab_size, tgt_vocab_size, d_model=512, nhead=8, 
                 num_encoder_layers=6, num_decoder_layers=6, dim_feedforward=2048):
        super().__init__()
        
        # Embedding layers
        self.src_embedding = nn.Embedding(src_vocab_size, d_model)
        self.tgt_embedding = nn.Embedding(tgt_vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model)
        
        # Transformer layers
        self.transformer = nn.Transformer(
            d_model=d_model,
            nhead=nhead,
            num_encoder_layers=num_encoder_layers,
            num_decoder_layers=num_decoder_layers,
            dim_feedforward=dim_feedforward
        )
        
        # Output projection
        self.output_layer = nn.Linear(d_model, tgt_vocab_size)
        
    def forward(self, src, tgt, src_mask=None, tgt_mask=None):
        # Create source embedding
        src_embedded = self.positional_encoding(self.src_embedding(src))
        
        # Create target embedding
        tgt_embedded = self.positional_encoding(self.tgt_embedding(tgt))
        
        # Generate masks if not provided
        if src_mask is None:
            src_mask = self.generate_square_subsequent_mask(src.size(1))
        if tgt_mask is None:
            tgt_mask = self.generate_square_subsequent_mask(tgt.size(1))
            
        # Pass through transformer
        output = self.transformer(
            src_embedded, tgt_embedded,
            src_mask=src_mask,
            tgt_mask=tgt_mask
        )
        
        # Project to vocabulary
        return self.output_layer(output)
    
    @staticmethod
    def generate_square_subsequent_mask(sz):
        mask = torch.triu(torch.ones(sz, sz), diagonal=1)
        mask = mask.masked_fill(mask==1, float('-inf'))
        return mask

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * 
                           (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)
        self.register_buffer('pe', pe)

    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

# Training function
def train_translation_model(model, train_loader, optimizer, criterion, num_epochs=10):
    model.train()
    for epoch in range(num_epochs):
        total_loss = 0
        for batch_idx, (src, tgt) in enumerate(train_loader):
            optimizer.zero_grad()
            
            # Forward pass
            output = model(src, tgt[:-1])  # exclude last target token
            
            # Calculate loss
            loss = criterion(
                output.view(-1, output.size(-1)),
                tgt[1:].reshape(-1)  # exclude first target token (BOS)
            )
            
            # Backward pass
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
            
        avg_loss = total_loss / len(train_loader)
        print(f'Epoch: {epoch+1}, Average Loss: {avg_loss:.4f}')

Code Breakdown:

Model Architecture
- Implements a complete Transformer-based translation model
- Uses both encoder and decoder with multi-head attention
- Includes positional encoding for sequence order awareness
Key Components
- Source and Target Embeddings: Convert tokens to vectors
- Positional Encoding: Adds position information to embeddings
- Transformer Block: Processes sequences using self-attention
- Output Projection: Maps to target vocabulary
Training Process
- Implements teacher forcing during training
- Uses masked attention for autoregressive generation
- Includes loss calculation and optimization steps
Advanced Features
- Supports variable length sequences
- Implements efficient batch processing
- Includes mask generation for causal attention

Usage Example:

# Initialize model and training components
model = TranslationTransformer(
    src_vocab_size=10000,
    tgt_vocab_size=10000,
    d_model=512,
    nhead=8
)

# Setup optimizer and criterion
optimizer = torch.optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss(ignore_index=pad_idx)

# Example translation
def translate(model, src_sentence, src_tokenizer, tgt_tokenizer, max_len=50):
    model.eval()
    with torch.no_grad():
        # Tokenize source sentence
        src_tokens = src_tokenizer.encode(src_sentence)
        src_tensor = torch.LongTensor(src_tokens).unsqueeze(0)
        
        # Initialize target with BOS token
        tgt_tokens = [tgt_tokenizer.bos_token_id]
        
        # Generate translation
        for _ in range(max_len):
            tgt_tensor = torch.LongTensor(tgt_tokens).unsqueeze(0)
            output = model(src_tensor, tgt_tensor)
            next_token = output[0, -1].argmax().item()
            
            if next_token == tgt_tokenizer.eos_token_id:
                break
                
            tgt_tokens.append(next_token)
        
        # Convert tokens to text
        translation = tgt_tokenizer.decode(tgt_tokens)
        return translation

Question Answering

When processing questions, attention mechanisms employ a sophisticated approach to information processing. These mechanisms help models identify and focus on the specific parts of a passage that contain relevant information through a multi-step process:

First, the model analyzes the question to understand what type of information it needs to look for. Then, it creates attention weights for each word in the passage, giving higher weights to words and phrases that are more likely to contain the answer. This selective focus enables the model to efficiently extract answers while ignoring irrelevant content.

For instance, when answering "When did the event occur?", the model would primarily attend to temporal expressions (such as dates, times, and temporal phrases like "yesterday" or "last week") and their surrounding context in the passage. The attention weights would be highest for these temporal indicators and their immediate context, allowing the model to zero in on the most relevant information. This process is similar to how humans might scan a text for time-related words when looking for when something happened.

Code Example: Question Answering with Self-Attention

import torch
import torch.nn as nn
import torch.nn.functional as F

class QATransformer(nn.Module):
    def __init__(self, vocab_size, d_model=512, nhead=8, num_layers=6):
        super().__init__()
        
        # Embedding layers
        self.embedding = nn.Embedding(vocab_size, d_model)
        self.pos_encoder = PositionalEncoding(d_model)
        
        # Multi-head attention layers
        self.question_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        self.context_encoder = nn.TransformerEncoder(
            nn.TransformerEncoderLayer(d_model, nhead),
            num_layers
        )
        
        # Cross-attention layer
        self.cross_attention = nn.MultiheadAttention(d_model, nhead)
        
        # Output layers for start and end position prediction
        self.start_predictor = nn.Linear(d_model, 1)
        self.end_predictor = nn.Linear(d_model, 1)
        
    def forward(self, question, context):
        # Embed inputs
        q_embed = self.pos_encoder(self.embedding(question))
        c_embed = self.pos_encoder(self.embedding(context))
        
        # Encode question and context
        q_encoded = self.question_encoder(q_embed)
        c_encoded = self.context_encoder(c_embed)
        
        # Cross-attention between question and context
        attn_output, attention_weights = self.cross_attention(
            q_encoded, c_encoded, c_encoded
        )
        
        # Predict answer span
        start_logits = self.start_predictor(attn_output).squeeze(-1)
        end_logits = self.end_predictor(attn_output).squeeze(-1)
        
        return start_logits, end_logits, attention_weights

def train_qa_model(model, train_loader, optimizer, num_epochs=10):
    model.train()
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        for batch in train_loader:
            question, context, start_pos, end_pos = batch
            
            # Forward pass
            start_logits, end_logits, _ = model(question, context)
            
            # Calculate loss
            start_loss = criterion(start_logits, start_pos)
            end_loss = criterion(end_logits, end_pos)
            loss = start_loss + end_loss
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

def predict_answer(model, tokenizer, question, context):
    model.eval()
    with torch.no_grad():
        # Tokenize inputs
        q_tokens = tokenizer.encode(question)
        c_tokens = tokenizer.encode(context)
        
        # Convert to tensors
        q_tensor = torch.tensor(q_tokens).unsqueeze(0)
        c_tensor = torch.tensor(c_tokens).unsqueeze(0)
        
        # Get predictions
        start_logits, end_logits, attention = model(q_tensor, c_tensor)
        
        # Find most likely answer span
        start_idx = torch.argmax(start_logits)
        end_idx = torch.argmax(end_logits[start_idx:]) + start_idx
        
        # Extract answer tokens
        answer_tokens = c_tokens[start_idx:end_idx+1]
        
        # Convert back to text
        answer = tokenizer.decode(answer_tokens)
        return answer, attention

Code Breakdown:

Model Architecture
- Implements a Transformer-based QA model with separate encoders for questions and context
- Uses multi-head self-attention for both question and context processing
- Includes cross-attention mechanism to relate questions to context
- Features span prediction for answer extraction
Key Components
- Embedding Layer: Converts text tokens to dense vectors
- Positional Encoding: Adds position information to embeddings
- Question/Context Encoders: Process inputs using self-attention
- Cross-Attention: Relates question to context for answer finding
- Span Predictors: Locate answer boundaries in context
Processing Flow
- Embeds and encodes question and context separately
- Applies cross-attention to find relevant context regions
- Predicts start and end positions of answer span
- Returns answer text and attention weights for analysis

Usage Example:

# Initialize model and components
model = QATransformer(
    vocab_size=30000,
    d_model=512,
    nhead=8,
    num_layers=6
)

# Example usage
question = "When was the first computer invented?"
context = "The first general-purpose electronic computer, ENIAC, was completed in 1945."

# Get answer
answer, attention_weights = predict_answer(
    model, tokenizer, question, context
)
print(f"Question: {question}")
print(f"Answer: {answer}")

3.3.5 Key Takeaways

Self-attention enables models to compute context-aware representations by attending to all tokens in a sequence. This means each word in a sentence can directly interact with every other word, allowing the model to understand complex relationships and dependencies. For example, in the sentence "The cat that chased the mouse was black", self-attention helps the model connect "was black" back to "cat" even though they're separated by several words.
Multi-head attention enhances self-attention by capturing diverse relationships simultaneously. While a single attention head might focus on syntactic relationships, another might capture semantic similarities, and yet another might track temporal relationships. This multi-faceted approach allows the model to process information through multiple different "perspectives" at once, leading to richer and more nuanced understanding of the input.
Together, these mechanisms are the foundation of Transformer architectures, allowing for parallelism and long-range dependency modeling. Unlike traditional sequential models that process words one at a time, Transformers can process all words simultaneously, dramatically improving computational efficiency. Additionally, because every token can attend to every other token directly, Transformers excel at capturing relationships between words that are far apart in the text, solving the long-standing challenge of modeling long-range dependencies in natural language processing.

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

3.3 Self-Attention and Multi-Head Attention

3.3.1 What Is Self-Attention?

3.3.2 Mathematics of Self-Attention

3.3.3 What Is Multi-Head Attention?

3.3.4 Applications of Self-Attention and Multi-Head Attention

3.3.5 Key Takeaways

3.3 Self-Attention and Multi-Head Attention

3.3.1 What Is Self-Attention?

3.3.2 Mathematics of Self-Attention

3.3.3 What Is Multi-Head Attention?

3.3.4 Applications of Self-Attention and Multi-Head Attention

3.3.5 Key Takeaways

3.3 Self-Attention and Multi-Head Attention

3.3.1 What Is Self-Attention?

3.3.2 Mathematics of Self-Attention

3.3.3 What Is Multi-Head Attention?

3.3.4 Applications of Self-Attention and Multi-Head Attention

3.3.5 Key Takeaways

3.3 Self-Attention and Multi-Head Attention

3.3.1 What Is Self-Attention?

3.3.2 Mathematics of Self-Attention

3.3.3 What Is Multi-Head Attention?

3.3.4 Applications of Self-Attention and Multi-Head Attention

3.3.5 Key Takeaways