Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP with Transformers: Advanced Techniques and Multimodal Applications
NLP with Transformers: Advanced Techniques and Multimodal Applications

Chapter 3: Training and Fine-Tuning Transformers

3.1 Data Preprocessing for Transformer Models

Fine-tuning transformer models has become the industry standard method for adapting pretrained language models to specialized NLP tasks. This process involves taking a model that has been trained on a large corpus of general text data and further training it on task-specific data to optimize its performance. While powerful pretrained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer) demonstrate impressive capabilities in understanding and generating human language, they typically require additional fine-tuning to excel at specific applications, such as sentiment analysis, document classification, or specialized translation tasks. 

The fine-tuning process involves several key components that we'll explore in detail throughout this chapter. We begin with data preprocessing, which is crucial for ensuring your input data is correctly formatted and tokenized for transformer models. This includes cleaning the text, handling special characters, and converting words into the numerical representations that these models can process.

Following preprocessing, we'll examine advanced fine-tuning techniques that have revolutionized the field. These include LoRA (Low-Rank Adaptation), which efficiently adapts large models by updating a small number of parameters, and Prefix Tuning, which prepends learnable tokens to the input while keeping the original model frozen. We'll also cover comprehensive evaluation strategies using industry-standard metrics: BLEU (Bilingual Evaluation Understudy) for measuring translation quality, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for assessing text summarization, and BERTScore for semantic similarity evaluation.

By the end of this chapter, you will possess a comprehensive understanding of the entire fine-tuning pipeline: from preparing your datasets and selecting appropriate training strategies, to implementing effective fine-tuning techniques and rigorously evaluating model performance using multiple metrics. This knowledge will enable you to adapt transformer models to your specific use cases while maintaining efficiency and accuracy.

Data preprocessing is a critical step when working with transformer models, serving as the foundation for successful model training and deployment. This process involves several key transformations of raw text data. First, transformers require text inputs to be tokenized - broken down into smaller units like words or subwords - and converted into numerical representations (typically vectors) that the model can process mathematically. This tokenization process can use different approaches such as WordPiece, Byte-Pair Encoding (BPE), or SentencePiece, each with its own advantages for different languages and use cases.

Beyond basic tokenization, attention masks play a crucial role in efficient processing. These binary masks tell the model which tokens are actual input data and which are padding tokens (used to make all sequences in a batch the same length). This distinction is essential because it prevents the model from wasting computational resources on padding tokens and ensures that padding doesn't influence the model's understanding of the actual content.

Furthermore, proper label encoding is essential for supervised learning tasks. Whether you're working on classification (converting categorical labels to numerical values), sequence labeling (assigning labels to individual tokens), or more complex tasks, the labels must be encoded in a format that aligns with the model's architecture and training objectives.

In this section, we will cover three fundamental aspects of preprocessing:

  1. Tokenization and Padding - Converting text to tokens and ensuring uniform sequence lengths
  2. Handling Long Sequences - Strategies for managing text that exceeds the model's maximum input length
  3. Preprocessing for Specific Tasks - Task-specific considerations and requirements

3.1.1 Tokenization and Padding

Tokenization is a fundamental preprocessing step that transforms raw text into a format that transformer models can process. This process breaks down text into smaller units called tokens, which can be:

  • Words (e.g., "hello", "world")
  • Subwords (e.g., "play", "##ing", where "##" indicates a continuation)
  • Individual characters (particularly useful for character-based languages)

For example, consider the sentence "transformers are amazing". Using subword tokenization, it might be broken down as:

  1. "transform" (root word)
  2. "##ers" (suffix)
  3. "are" (complete word)
  4. "amazing" (complete word)

These tokens are then mapped to unique numerical IDs using a vocabulary lookup table. For instance:

  • "transform" → 19081
  • "##ers" → 2024
  • "are" → 2003
  • "amazing" → 6429

This numerical representation is essential because neural networks can only process numbers, not text directly.

Padding is another crucial preprocessing step that addresses a technical requirement of transformer models: batch processing. Since neural networks process multiple sequences simultaneously for efficiency, all sequences in a batch must have the same length. Here's how padding works:

  1. First, identify the longest sequence in your batch
  2. Add special padding tokens ([PAD] or 0) to shorter sequences
  3. Create an attention mask to tell the model which tokens are real and which are padding

For example, if we have these sequences:

  • "Hello world" (2 tokens)
  • "The quick brown fox jumps" (5 tokens)

The padding process would:

  1. Make both sequences 5 tokens long
  2. "Hello world [PAD] [PAD] [PAD]"
  3. "The quick brown fox jumps"

This ensures uniform processing while maintaining the integrity of the original sequences through attention masks that tell the model to ignore the padding tokens during computation.

Example: Tokenization and Padding with BERT

from transformers import BertTokenizer
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define sample texts of different lengths
texts = [
    "Transformers are amazing!",
    "They are used for many NLP tasks.",
    "This is a longer sentence that will show padding in action."
]

# Tokenize the texts with different parameters
# 1. Basic tokenization
basic_tokens = tokenizer(texts[0])
print("\n1. Basic tokenization:")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(basic_tokens['input_ids'])}")

# 2. Batch tokenization with padding
batch_tokenized = tokenizer(
    texts,
    padding=True,  # Add padding
    truncation=True,  # Enable truncation
    max_length=12,  # Set maximum length
    return_tensors="pt"  # Return PyTorch tensors
)

print("\n2. Batch tokenization results:")
print("Input IDs:")
print(batch_tokenized["input_ids"])
print("\nAttention Masks:")
print(batch_tokenized["attention_mask"])

# 3. Decode back to text
print("\n3. Decoded text from tokens:")
for i in range(len(texts)):
    decoded = tokenizer.decode(batch_tokenized["input_ids"][i])
    print(f"Original: {texts[i]}")
    print(f"Decoded:  {decoded}\n")

Detailed Breakdown:

  1. Importing and Setup:
    • We import both the BertTokenizer and torch
    • Initialize the BERT tokenizer with the uncased model variant
  2. Basic Tokenization:
    • Shows how a single sentence is tokenized
    • Demonstrates token-to-text conversion for better understanding
  3. Batch Processing:
    • Processes multiple sentences of different lengths
    • Uses padding to make all sequences uniform length
    • Sets max_length=12 to demonstrate truncation
  4. Key Parameters:
    • padding=True: Adds padding tokens to shorter sequences
    • truncation=True: Cuts longer sequences to max_length
    • return_tensors="pt": Returns PyTorch tensors instead of lists
  5. Output Explanation:
    • input_ids: Numerical representations of tokens
    • attention_mask: 1s for real tokens, 0s for padding
    • Decoded text shows how the model reconstructs the original input

Explanation:

  • input_ids: Tokenized representation of the input text.
  • attention_mask: Binary mask indicating which tokens are actual input (1) and which are padding (0).

Output:

1. Basic tokenization:
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']

2. Batch tokenization results:
Input IDs:
tensor([[  101,  2234,  2024,  6429,   999,   102,     0,     0,     0,     0,
            0,     0],
        [  101,  2027,  2024,  2107,  2005,  2116,  3319,  2202,   999,   102,
            0,     0],
        [  101,  2023,  2003,  1037,  2208,  6251,  2008,  2097,  4058,  1999,
         2039,   102]])

Attention Masks:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

3. Decoded text from tokens:
Original: Transformers are amazing!
Decoded: [CLS] transformers are amazing ! [SEP]

Original: They are used for many NLP tasks.
Decoded: [CLS] they are used for many nlp tasks . [SEP]

Original: This is a longer sentence that will show padding in action.
Decoded: [CLS] this is a longer sentence that will show [SEP]

3.1.2 Handling Long Sequences

Transformers have a maximum input length limitation due to their self-attention mechanism, which grows quadratically with sequence length. This limitation exists because the self-attention mechanism computes attention scores between every pair of tokens in the sequence, resulting in a computational complexity of O(n²), where n is the sequence length. As the sequence length grows, both memory usage and computational requirements increase dramatically.

For example, BERT has a maximum sequence length of 512 tokens, while GPT models typically handle 1024 or 2048 tokens. This means that for BERT, processing a sequence of 512 tokens requires computing and storing a 512 x 512 attention matrix for each attention head in each transformer layer. GPT models can handle longer sequences but still face similar computational constraints.

When dealing with texts that exceed these limits, there are two main approaches:

  1. Truncation: Simply cutting off the text at the maximum length. While straightforward, this may lose important information. This approach works best when:
    • The most relevant information appears at the beginning of the text
    • The task only requires understanding the general context rather than specific details
    • Processing speed is a priority over completeness
  2. Chunking: Splitting the text into overlapping or non-overlapping segments that fit within the length limit. This preserves all information but requires strategies for combining the results from multiple chunks. Common chunking strategies include:
    • Sliding window: Creating overlapping chunks with a fixed stride length
    • Sentence-based splitting: Breaking text at natural sentence boundaries
    • Hierarchical processing: Processing chunks individually and then combining results

The choice between these approaches depends on your specific task - truncation might work well for classification, while chunking is often necessary for tasks like document summarization or question answering. For example, in sentiment analysis, the overall sentiment might be captured well enough in the first few hundred tokens, making truncation acceptable. However, for tasks like document summarization or question answering where important information could be anywhere in the text, chunking becomes essential to ensure no critical information is lost.

Example: Truncating Long Sequences

# Define a long text sample
long_text = "Transformers are incredibly versatile models that have revolutionized the field of NLP. " * 20

# Initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 1. Basic tokenization without truncation
tokenized_full = tokenizer(long_text, truncation=False, return_tensors="pt")
print("\n1. Full text tokenization:")
print(f"Original sequence length: {tokenized_full['input_ids'].shape[1]} tokens")

# 2. Tokenization with truncation
tokenized_truncated = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print("\n2. Truncated tokenization:")
print(f"Truncated sequence length: {tokenized_truncated['input_ids'].shape[1]} tokens")

# 3. Sliding window approach
def create_sliding_windows(text, window_size=256, stride=128):
    tokenized = tokenizer(text, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        if len(window) < window_size:  # Pad last window if needed
            padding = window_size - len(window)
            window = torch.cat([window, torch.zeros(padding, dtype=torch.long)])
        windows.append(window)
    
    return torch.stack(windows)

# Apply sliding window
sliding_windows = create_sliding_windows(long_text)
print("\n3. Sliding window approach:")
print(f"Number of windows: {len(sliding_windows)}")
print(f"Window shape: {sliding_windows.shape}")

# 4. Demonstrate window content
print("\n4. Content of first window:")
first_window_text = tokenizer.decode(sliding_windows[0])
print(first_window_text[:100] + "...")

Code Breakdown:

  1. Text Preparation:
    • Creates a long text sample by repeating a sentence 20 times
    • Initializes the BERT tokenizer for processing
  2. Full Tokenization:
    • Shows the original sequence length without truncation
    • Helps understand how much text exceeds the model's limits
  3. Truncation Approach:
    • Implements standard truncation at 512 tokens (BERT's limit)
    • Demonstrates the basic way to handle long sequences
  4. Sliding Window Implementation:
    • Creates overlapping windows of text (window_size=256, stride=128)
    • Allows processing of the entire text in manageable chunks
    • Includes padding for the last window if needed
  5. Window Content Display:
    • Shows the actual content of the first window
    • Helps verify the windowing process works correctly

Output:

1. Full text tokenization:
Original sequence length: ~400 tokens

2. Truncated tokenization:
Truncated sequence length: 512 tokens

3. Sliding window approach:
Number of windows: ~4
Window shape: torch.Size([4, 256])

4. Content of first window:
[CLS] transformers are incredibly versatile models that have revolutionized the field of nlp. transformers are...

Note: The exact numbers would vary based on the actual tokenization of the repeated sentence, but this represents the expected structure of the output given the code's logic.

Example: Splitting Long Text into Chunks

# Function to split long text into chunks with overlap
def split_text_into_chunks(text, max_length=128, overlap=20):
    # Tokenize the text
    tokenized = tokenizer(text, truncation=False, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    attention_mask = tokenized["attention_mask"][0]
    
    chunks = []
    chunk_masks = []
    
    # Create chunks with overlap
    for i in range(0, len(input_ids), max_length - overlap):
        # Extract chunk
        chunk = input_ids[i:i + max_length]
        mask = attention_mask[i:i + max_length]
        
        # Pad if necessary
        if len(chunk) < max_length:
            padding_length = max_length - len(chunk)
            chunk = torch.cat([chunk, torch.zeros(padding_length, dtype=torch.long)])
            mask = torch.cat([mask, torch.zeros(padding_length, dtype=torch.long)])
        
        chunks.append(chunk)
        chunk_masks.append(mask)
    
    return {
        "input_ids": torch.stack(chunks),
        "attention_mask": torch.stack(chunk_masks)
    }

# Example usage
long_text = "This is a very long text that needs to be split into chunks. " * 20
chunks = split_text_into_chunks(long_text, max_length=128, overlap=20)

# Print information about chunks
print(f"Number of chunks: {len(chunks['input_ids'])}")
print(f"Chunk size: {chunks['input_ids'].shape}")

# Decode and print first chunk to verify content
first_chunk = tokenizer.decode(chunks['input_ids'][0])
print("\nFirst chunk content:")
print(first_chunk[:100], "...")

# Print overlap between chunks to verify
if len(chunks['input_ids']) > 1:
    overlap_first = tokenizer.decode(chunks['input_ids'][0][-20:])
    overlap_second = tokenizer.decode(chunks['input_ids'][1][:20])
    print("\nOverlap demonstration:")
    print("End of first chunk:", overlap_first)
    print("Start of second chunk:", overlap_second)

Code Breakdown:

  1. Function Parameters:
    • max_length: Maximum number of tokens per chunk (default: 128)
    • overlap: Number of overlapping tokens between chunks (default: 20)
  2. Key Components:
    • Tokenization: Converts input text to token IDs and attention masks
    • Chunk Creation: Creates overlapping chunks of specified length
    • Padding: Ensures all chunks are of equal length
    • Return Format: Dictionary with input_ids and attention_mask tensors
  3. Important Features:
    • Overlap handling prevents loss of context between chunks
    • Attention masks track valid tokens vs padding
    • Maintains compatibility with transformer model input requirements
  4. Verification Steps:
    • Prints number and size of chunks
    • Shows content of first chunk
    • Demonstrates overlap between consecutive chunks

Output:

Number of chunks: 3
Chunk size: torch.Size([3, 128])

First chunk content:
This is a very long text that needs to be split into chunks. This is a very long text that needs to be split...

Overlap demonstration:
End of first chunk: split into chunks.
Start of second chunk: chunks. This is a v

The exact number of chunks and content may vary depending on the actual tokenization, but this demonstrates the key output components showing:

  • The number of chunks created from the input text
  • The dimension of the chunks tensor
  • A sample of the first chunk's content
  • The overlapping region between consecutive chunks

3.1.3 Preprocessing for Specific Tasks

Different NLP tasks require specific preprocessing steps to ensure optimal model performance. This preprocessing phase is crucial as it transforms raw text data into a format that transformer models can effectively process. The preprocessing pipeline must be carefully designed to handle the unique characteristics of each task while maintaining data integrity and model compatibility.

The exact preprocessing steps vary significantly depending on several key factors:

  • Task Type:
    • Classification tasks require balanced datasets and appropriate label encoding
    • Generation tasks need careful handling of start/end tokens and sequence formatting
    • Translation tasks must align source and target language pairs effectively
    • Question-answering tasks require proper context and question formatting
  • Model Architecture:
    • BERT-based models need special [CLS] and [SEP] tokens
    • GPT models require specific attention to end-of-sequence tokens
    • T5 models need task-specific prefixes
  • Dataset Requirements:
    • Data cleaning and normalization standards
    • Handling of special characters and formatting
    • Domain-specific terminology processing

Common preprocessing steps form the foundation of any NLP pipeline:

  • Tokenization: Converting text into tokens that the model can process
    • Word-level: Splitting text into individual words
    • Subword-level: Breaking words into meaningful subunits
    • Character-level: Processing text as individual characters
  • Sequence Length Adjustment:
    • Padding shorter sequences to a fixed length
    • Truncating longer sequences to fit model constraints
    • Implementing dynamic batching strategies
  • Label Encoding:
    • Converting categorical labels to numerical format
    • Implementing one-hot encoding where appropriate
    • Handling multi-label scenarios
  • Special Token Handling:
    • Adding task-specific tokens
    • Managing separator and classification tokens
    • Implementing masking strategies

Additionally, task-specific considerations require careful attention:

  • Classification Tasks:
    • Handling class imbalance through sampling or weighting
    • Implementing stratification strategies
  • Long Document Processing:
    • Implementing sliding windows with appropriate overlap
    • Managing document segmentation
    • Maintaining context across segments

Here are examples for two common tasks:

Text Classification:

For classification, text inputs need to be tokenized, and their corresponding labels should be encoded.

from sklearn.preprocessing import LabelEncoder
import torch
from transformers import AutoTokenizer
import numpy as np

# Initialize tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sample data
texts = [
    "This movie was amazing!",
    "I did not like the ending.",
    "A masterpiece of modern cinema",
    "Waste of time and money",
    "It was just okay, nothing special"
]
labels = ["positive", "negative", "positive", "negative", "neutral"]

# Tokenize the texts with attention masks
tokenized_texts = tokenizer(
    texts,
    padding="max_length",
    truncation=True,
    max_length=32,
    return_tensors="pt",
    return_attention_mask=True
)

# Encode the labels
label_encoder = LabelEncoder()
encoded_labels = torch.tensor(label_encoder.fit_transform(labels))

# Create dataset dictionary
dataset = {
    'input_ids': tokenized_texts['input_ids'],
    'attention_mask': tokenized_texts['attention_mask'],
    'labels': encoded_labels
}

# Print dataset information
print("Dataset Structure:")
print(f"Number of examples: {len(texts)}")
print(f"Input shape: {dataset['input_ids'].shape}")
print(f"Attention mask shape: {dataset['attention_mask'].shape}")
print(f"Labels shape: {dataset['labels'].shape}")
print("\nLabel mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# Example of accessing first sample
first_text = tokenizer.decode(dataset['input_ids'][0])
print(f"\nFirst example:")
print(f"Text: {first_text}")
print(f"Label: {labels[0]} (encoded: {dataset['labels'][0]})")
print(f"Attention mask: {dataset['attention_mask'][0][:10]}...")

Code Breakdown:

  1. Imports and Setup:
    • sklearn.preprocessing.LabelEncoder for converting text labels to numbers
    • torch for tensor operations
    • transformers for the tokenizer
    • numpy for numerical operations
  2. Data Preparation:
    • Expanded dataset with 5 examples covering different sentiments
    • Added a "neutral" class to demonstrate multi-class capability
    • Structured text and label pairs
  3. Tokenization:
    • Uses BERT tokenizer with increased max_length (32 tokens)
    • Includes padding and truncation for consistent lengths
    • Returns attention masks for proper transformer input
  4. Label Processing:
    • Converts text labels to numerical format
    • Creates a mapping between original labels and encoded values
    • Stores labels as PyTorch tensors
  5. Dataset Creation:
    • Combines input_ids, attention_masks, and labels
    • Organizes data in a format ready for model training
    • Maintains alignment between inputs and labels
  6. Information Display:
    • Shows dataset structure and dimensions
    • Displays label encoding mapping
    • Demonstrates how to access and decode individual examples

Expected Output:

Dataset Structure:
Number of examples: 5
Input shape: torch.Size([5, 32])
Attention mask shape: torch.Size([5, 32])
Labels shape: torch.Size([5])

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}

First example:
Text: [CLS] this movie was amazing! [SEP] [PAD]...
Label: positive (encoded: 2)
Attention mask: tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0]...)

Token Classification (e.g., Named Entity Recognition):

For token classification tasks, each token in the input sequence must be assigned a label.

# Sample data for Named Entity Recognition (NER)
text = "Hugging Face is based in New York City."
labels = ["B-ORG", "I-ORG", "O", "O", "O", "B-LOC", "I-LOC"]

# Tokenize the text using the pre-initialized tokenizer
# padding="max_length": Ensures all sequences have the same length
# truncation=True: Cuts off text that exceeds max_length
# max_length=20: Maximum number of tokens allowed
# return_tensors="pt": Returns PyTorch tensors
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=20, return_tensors="pt")

# Align labels with tokenized text
# This is crucial because tokenization might split words into subwords
# - If a token starts with "##", it's a subword token (BERT-specific)
# - We assign -100 to subword tokens to ignore them in loss calculation
# - Other tokens retain their original NER labels
aligned_labels = [-100 if token.startswith("##") else label for token, label in zip(tokenized_text["input_ids"][0], labels)]

# Print the aligned labels to verify the alignment
print("Aligned Labels:", aligned_labels)

# Label explanation:
# B-ORG: Beginning of Organization entity (Hugging Face)
# I-ORG: Inside of Organization entity (Face)
# O: Outside any entity (is, based, in)
# B-LOC: Beginning of Location entity (New York)
# I-LOC: Inside of Location entity (York City)

Here's what the output would look like:

Aligned Labels: ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC', 'I-LOC']

This output shows the NER (Named Entity Recognition) labels aligned with the tokens from the input text "Hugging Face is based in New York City", where:

  • Hugging Face is labeled as an organization (B-ORG, I-ORG)
  • New York City is labeled as a location (B-LOC, I-LOC)
  • The remaining words (is, based, in) are labeled as outside entities (O)

Data preprocessing is a crucial step in preparing text for transformer models, serving as the foundation for successful model training and deployment. This phase involves several critical components:

First, proper tokenization breaks down text into meaningful units that the model can process. This includes handling word boundaries, special characters, and subword tokenization strategies that help manage vocabulary size while preserving semantic meaning.

Second, padding and truncation ensure consistent input sizes. Padding adds special tokens to shorter sequences to match a target length, while truncation carefully removes excess tokens from longer sequences while preserving essential information.

Third, the alignment of labels with tokenized input is essential for supervised learning tasks. This process requires careful attention to maintain the relationship between input tokens and their corresponding labels, especially when dealing with subword tokenization.

Additionally, preprocessing includes crucial steps like handling out-of-vocabulary words, managing special tokens (such as [CLS] and [SEP] for BERT models), and implementing appropriate masking strategies for different model architectures.

Mastering these preprocessing techniques is vital as they directly impact model performance. Proper implementation helps avoid common pitfalls like misaligned labels, inconsistent sequence lengths, or lost contextual information. When done correctly, these steps create clean, well-structured input that allows transformer models to achieve their optimal performance.

3.1 Data Preprocessing for Transformer Models

Fine-tuning transformer models has become the industry standard method for adapting pretrained language models to specialized NLP tasks. This process involves taking a model that has been trained on a large corpus of general text data and further training it on task-specific data to optimize its performance. While powerful pretrained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer) demonstrate impressive capabilities in understanding and generating human language, they typically require additional fine-tuning to excel at specific applications, such as sentiment analysis, document classification, or specialized translation tasks. 

The fine-tuning process involves several key components that we'll explore in detail throughout this chapter. We begin with data preprocessing, which is crucial for ensuring your input data is correctly formatted and tokenized for transformer models. This includes cleaning the text, handling special characters, and converting words into the numerical representations that these models can process.

Following preprocessing, we'll examine advanced fine-tuning techniques that have revolutionized the field. These include LoRA (Low-Rank Adaptation), which efficiently adapts large models by updating a small number of parameters, and Prefix Tuning, which prepends learnable tokens to the input while keeping the original model frozen. We'll also cover comprehensive evaluation strategies using industry-standard metrics: BLEU (Bilingual Evaluation Understudy) for measuring translation quality, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for assessing text summarization, and BERTScore for semantic similarity evaluation.

By the end of this chapter, you will possess a comprehensive understanding of the entire fine-tuning pipeline: from preparing your datasets and selecting appropriate training strategies, to implementing effective fine-tuning techniques and rigorously evaluating model performance using multiple metrics. This knowledge will enable you to adapt transformer models to your specific use cases while maintaining efficiency and accuracy.

Data preprocessing is a critical step when working with transformer models, serving as the foundation for successful model training and deployment. This process involves several key transformations of raw text data. First, transformers require text inputs to be tokenized - broken down into smaller units like words or subwords - and converted into numerical representations (typically vectors) that the model can process mathematically. This tokenization process can use different approaches such as WordPiece, Byte-Pair Encoding (BPE), or SentencePiece, each with its own advantages for different languages and use cases.

Beyond basic tokenization, attention masks play a crucial role in efficient processing. These binary masks tell the model which tokens are actual input data and which are padding tokens (used to make all sequences in a batch the same length). This distinction is essential because it prevents the model from wasting computational resources on padding tokens and ensures that padding doesn't influence the model's understanding of the actual content.

Furthermore, proper label encoding is essential for supervised learning tasks. Whether you're working on classification (converting categorical labels to numerical values), sequence labeling (assigning labels to individual tokens), or more complex tasks, the labels must be encoded in a format that aligns with the model's architecture and training objectives.

In this section, we will cover three fundamental aspects of preprocessing:

  1. Tokenization and Padding - Converting text to tokens and ensuring uniform sequence lengths
  2. Handling Long Sequences - Strategies for managing text that exceeds the model's maximum input length
  3. Preprocessing for Specific Tasks - Task-specific considerations and requirements

3.1.1 Tokenization and Padding

Tokenization is a fundamental preprocessing step that transforms raw text into a format that transformer models can process. This process breaks down text into smaller units called tokens, which can be:

  • Words (e.g., "hello", "world")
  • Subwords (e.g., "play", "##ing", where "##" indicates a continuation)
  • Individual characters (particularly useful for character-based languages)

For example, consider the sentence "transformers are amazing". Using subword tokenization, it might be broken down as:

  1. "transform" (root word)
  2. "##ers" (suffix)
  3. "are" (complete word)
  4. "amazing" (complete word)

These tokens are then mapped to unique numerical IDs using a vocabulary lookup table. For instance:

  • "transform" → 19081
  • "##ers" → 2024
  • "are" → 2003
  • "amazing" → 6429

This numerical representation is essential because neural networks can only process numbers, not text directly.

Padding is another crucial preprocessing step that addresses a technical requirement of transformer models: batch processing. Since neural networks process multiple sequences simultaneously for efficiency, all sequences in a batch must have the same length. Here's how padding works:

  1. First, identify the longest sequence in your batch
  2. Add special padding tokens ([PAD] or 0) to shorter sequences
  3. Create an attention mask to tell the model which tokens are real and which are padding

For example, if we have these sequences:

  • "Hello world" (2 tokens)
  • "The quick brown fox jumps" (5 tokens)

The padding process would:

  1. Make both sequences 5 tokens long
  2. "Hello world [PAD] [PAD] [PAD]"
  3. "The quick brown fox jumps"

This ensures uniform processing while maintaining the integrity of the original sequences through attention masks that tell the model to ignore the padding tokens during computation.

Example: Tokenization and Padding with BERT

from transformers import BertTokenizer
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define sample texts of different lengths
texts = [
    "Transformers are amazing!",
    "They are used for many NLP tasks.",
    "This is a longer sentence that will show padding in action."
]

# Tokenize the texts with different parameters
# 1. Basic tokenization
basic_tokens = tokenizer(texts[0])
print("\n1. Basic tokenization:")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(basic_tokens['input_ids'])}")

# 2. Batch tokenization with padding
batch_tokenized = tokenizer(
    texts,
    padding=True,  # Add padding
    truncation=True,  # Enable truncation
    max_length=12,  # Set maximum length
    return_tensors="pt"  # Return PyTorch tensors
)

print("\n2. Batch tokenization results:")
print("Input IDs:")
print(batch_tokenized["input_ids"])
print("\nAttention Masks:")
print(batch_tokenized["attention_mask"])

# 3. Decode back to text
print("\n3. Decoded text from tokens:")
for i in range(len(texts)):
    decoded = tokenizer.decode(batch_tokenized["input_ids"][i])
    print(f"Original: {texts[i]}")
    print(f"Decoded:  {decoded}\n")

Detailed Breakdown:

  1. Importing and Setup:
    • We import both the BertTokenizer and torch
    • Initialize the BERT tokenizer with the uncased model variant
  2. Basic Tokenization:
    • Shows how a single sentence is tokenized
    • Demonstrates token-to-text conversion for better understanding
  3. Batch Processing:
    • Processes multiple sentences of different lengths
    • Uses padding to make all sequences uniform length
    • Sets max_length=12 to demonstrate truncation
  4. Key Parameters:
    • padding=True: Adds padding tokens to shorter sequences
    • truncation=True: Cuts longer sequences to max_length
    • return_tensors="pt": Returns PyTorch tensors instead of lists
  5. Output Explanation:
    • input_ids: Numerical representations of tokens
    • attention_mask: 1s for real tokens, 0s for padding
    • Decoded text shows how the model reconstructs the original input

Explanation:

  • input_ids: Tokenized representation of the input text.
  • attention_mask: Binary mask indicating which tokens are actual input (1) and which are padding (0).

Output:

1. Basic tokenization:
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']

2. Batch tokenization results:
Input IDs:
tensor([[  101,  2234,  2024,  6429,   999,   102,     0,     0,     0,     0,
            0,     0],
        [  101,  2027,  2024,  2107,  2005,  2116,  3319,  2202,   999,   102,
            0,     0],
        [  101,  2023,  2003,  1037,  2208,  6251,  2008,  2097,  4058,  1999,
         2039,   102]])

Attention Masks:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

3. Decoded text from tokens:
Original: Transformers are amazing!
Decoded: [CLS] transformers are amazing ! [SEP]

Original: They are used for many NLP tasks.
Decoded: [CLS] they are used for many nlp tasks . [SEP]

Original: This is a longer sentence that will show padding in action.
Decoded: [CLS] this is a longer sentence that will show [SEP]

3.1.2 Handling Long Sequences

Transformers have a maximum input length limitation due to their self-attention mechanism, which grows quadratically with sequence length. This limitation exists because the self-attention mechanism computes attention scores between every pair of tokens in the sequence, resulting in a computational complexity of O(n²), where n is the sequence length. As the sequence length grows, both memory usage and computational requirements increase dramatically.

For example, BERT has a maximum sequence length of 512 tokens, while GPT models typically handle 1024 or 2048 tokens. This means that for BERT, processing a sequence of 512 tokens requires computing and storing a 512 x 512 attention matrix for each attention head in each transformer layer. GPT models can handle longer sequences but still face similar computational constraints.

When dealing with texts that exceed these limits, there are two main approaches:

  1. Truncation: Simply cutting off the text at the maximum length. While straightforward, this may lose important information. This approach works best when:
    • The most relevant information appears at the beginning of the text
    • The task only requires understanding the general context rather than specific details
    • Processing speed is a priority over completeness
  2. Chunking: Splitting the text into overlapping or non-overlapping segments that fit within the length limit. This preserves all information but requires strategies for combining the results from multiple chunks. Common chunking strategies include:
    • Sliding window: Creating overlapping chunks with a fixed stride length
    • Sentence-based splitting: Breaking text at natural sentence boundaries
    • Hierarchical processing: Processing chunks individually and then combining results

The choice between these approaches depends on your specific task - truncation might work well for classification, while chunking is often necessary for tasks like document summarization or question answering. For example, in sentiment analysis, the overall sentiment might be captured well enough in the first few hundred tokens, making truncation acceptable. However, for tasks like document summarization or question answering where important information could be anywhere in the text, chunking becomes essential to ensure no critical information is lost.

Example: Truncating Long Sequences

# Define a long text sample
long_text = "Transformers are incredibly versatile models that have revolutionized the field of NLP. " * 20

# Initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 1. Basic tokenization without truncation
tokenized_full = tokenizer(long_text, truncation=False, return_tensors="pt")
print("\n1. Full text tokenization:")
print(f"Original sequence length: {tokenized_full['input_ids'].shape[1]} tokens")

# 2. Tokenization with truncation
tokenized_truncated = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print("\n2. Truncated tokenization:")
print(f"Truncated sequence length: {tokenized_truncated['input_ids'].shape[1]} tokens")

# 3. Sliding window approach
def create_sliding_windows(text, window_size=256, stride=128):
    tokenized = tokenizer(text, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        if len(window) < window_size:  # Pad last window if needed
            padding = window_size - len(window)
            window = torch.cat([window, torch.zeros(padding, dtype=torch.long)])
        windows.append(window)
    
    return torch.stack(windows)

# Apply sliding window
sliding_windows = create_sliding_windows(long_text)
print("\n3. Sliding window approach:")
print(f"Number of windows: {len(sliding_windows)}")
print(f"Window shape: {sliding_windows.shape}")

# 4. Demonstrate window content
print("\n4. Content of first window:")
first_window_text = tokenizer.decode(sliding_windows[0])
print(first_window_text[:100] + "...")

Code Breakdown:

  1. Text Preparation:
    • Creates a long text sample by repeating a sentence 20 times
    • Initializes the BERT tokenizer for processing
  2. Full Tokenization:
    • Shows the original sequence length without truncation
    • Helps understand how much text exceeds the model's limits
  3. Truncation Approach:
    • Implements standard truncation at 512 tokens (BERT's limit)
    • Demonstrates the basic way to handle long sequences
  4. Sliding Window Implementation:
    • Creates overlapping windows of text (window_size=256, stride=128)
    • Allows processing of the entire text in manageable chunks
    • Includes padding for the last window if needed
  5. Window Content Display:
    • Shows the actual content of the first window
    • Helps verify the windowing process works correctly

Output:

1. Full text tokenization:
Original sequence length: ~400 tokens

2. Truncated tokenization:
Truncated sequence length: 512 tokens

3. Sliding window approach:
Number of windows: ~4
Window shape: torch.Size([4, 256])

4. Content of first window:
[CLS] transformers are incredibly versatile models that have revolutionized the field of nlp. transformers are...

Note: The exact numbers would vary based on the actual tokenization of the repeated sentence, but this represents the expected structure of the output given the code's logic.

Example: Splitting Long Text into Chunks

# Function to split long text into chunks with overlap
def split_text_into_chunks(text, max_length=128, overlap=20):
    # Tokenize the text
    tokenized = tokenizer(text, truncation=False, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    attention_mask = tokenized["attention_mask"][0]
    
    chunks = []
    chunk_masks = []
    
    # Create chunks with overlap
    for i in range(0, len(input_ids), max_length - overlap):
        # Extract chunk
        chunk = input_ids[i:i + max_length]
        mask = attention_mask[i:i + max_length]
        
        # Pad if necessary
        if len(chunk) < max_length:
            padding_length = max_length - len(chunk)
            chunk = torch.cat([chunk, torch.zeros(padding_length, dtype=torch.long)])
            mask = torch.cat([mask, torch.zeros(padding_length, dtype=torch.long)])
        
        chunks.append(chunk)
        chunk_masks.append(mask)
    
    return {
        "input_ids": torch.stack(chunks),
        "attention_mask": torch.stack(chunk_masks)
    }

# Example usage
long_text = "This is a very long text that needs to be split into chunks. " * 20
chunks = split_text_into_chunks(long_text, max_length=128, overlap=20)

# Print information about chunks
print(f"Number of chunks: {len(chunks['input_ids'])}")
print(f"Chunk size: {chunks['input_ids'].shape}")

# Decode and print first chunk to verify content
first_chunk = tokenizer.decode(chunks['input_ids'][0])
print("\nFirst chunk content:")
print(first_chunk[:100], "...")

# Print overlap between chunks to verify
if len(chunks['input_ids']) > 1:
    overlap_first = tokenizer.decode(chunks['input_ids'][0][-20:])
    overlap_second = tokenizer.decode(chunks['input_ids'][1][:20])
    print("\nOverlap demonstration:")
    print("End of first chunk:", overlap_first)
    print("Start of second chunk:", overlap_second)

Code Breakdown:

  1. Function Parameters:
    • max_length: Maximum number of tokens per chunk (default: 128)
    • overlap: Number of overlapping tokens between chunks (default: 20)
  2. Key Components:
    • Tokenization: Converts input text to token IDs and attention masks
    • Chunk Creation: Creates overlapping chunks of specified length
    • Padding: Ensures all chunks are of equal length
    • Return Format: Dictionary with input_ids and attention_mask tensors
  3. Important Features:
    • Overlap handling prevents loss of context between chunks
    • Attention masks track valid tokens vs padding
    • Maintains compatibility with transformer model input requirements
  4. Verification Steps:
    • Prints number and size of chunks
    • Shows content of first chunk
    • Demonstrates overlap between consecutive chunks

Output:

Number of chunks: 3
Chunk size: torch.Size([3, 128])

First chunk content:
This is a very long text that needs to be split into chunks. This is a very long text that needs to be split...

Overlap demonstration:
End of first chunk: split into chunks.
Start of second chunk: chunks. This is a v

The exact number of chunks and content may vary depending on the actual tokenization, but this demonstrates the key output components showing:

  • The number of chunks created from the input text
  • The dimension of the chunks tensor
  • A sample of the first chunk's content
  • The overlapping region between consecutive chunks

3.1.3 Preprocessing for Specific Tasks

Different NLP tasks require specific preprocessing steps to ensure optimal model performance. This preprocessing phase is crucial as it transforms raw text data into a format that transformer models can effectively process. The preprocessing pipeline must be carefully designed to handle the unique characteristics of each task while maintaining data integrity and model compatibility.

The exact preprocessing steps vary significantly depending on several key factors:

  • Task Type:
    • Classification tasks require balanced datasets and appropriate label encoding
    • Generation tasks need careful handling of start/end tokens and sequence formatting
    • Translation tasks must align source and target language pairs effectively
    • Question-answering tasks require proper context and question formatting
  • Model Architecture:
    • BERT-based models need special [CLS] and [SEP] tokens
    • GPT models require specific attention to end-of-sequence tokens
    • T5 models need task-specific prefixes
  • Dataset Requirements:
    • Data cleaning and normalization standards
    • Handling of special characters and formatting
    • Domain-specific terminology processing

Common preprocessing steps form the foundation of any NLP pipeline:

  • Tokenization: Converting text into tokens that the model can process
    • Word-level: Splitting text into individual words
    • Subword-level: Breaking words into meaningful subunits
    • Character-level: Processing text as individual characters
  • Sequence Length Adjustment:
    • Padding shorter sequences to a fixed length
    • Truncating longer sequences to fit model constraints
    • Implementing dynamic batching strategies
  • Label Encoding:
    • Converting categorical labels to numerical format
    • Implementing one-hot encoding where appropriate
    • Handling multi-label scenarios
  • Special Token Handling:
    • Adding task-specific tokens
    • Managing separator and classification tokens
    • Implementing masking strategies

Additionally, task-specific considerations require careful attention:

  • Classification Tasks:
    • Handling class imbalance through sampling or weighting
    • Implementing stratification strategies
  • Long Document Processing:
    • Implementing sliding windows with appropriate overlap
    • Managing document segmentation
    • Maintaining context across segments

Here are examples for two common tasks:

Text Classification:

For classification, text inputs need to be tokenized, and their corresponding labels should be encoded.

from sklearn.preprocessing import LabelEncoder
import torch
from transformers import AutoTokenizer
import numpy as np

# Initialize tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sample data
texts = [
    "This movie was amazing!",
    "I did not like the ending.",
    "A masterpiece of modern cinema",
    "Waste of time and money",
    "It was just okay, nothing special"
]
labels = ["positive", "negative", "positive", "negative", "neutral"]

# Tokenize the texts with attention masks
tokenized_texts = tokenizer(
    texts,
    padding="max_length",
    truncation=True,
    max_length=32,
    return_tensors="pt",
    return_attention_mask=True
)

# Encode the labels
label_encoder = LabelEncoder()
encoded_labels = torch.tensor(label_encoder.fit_transform(labels))

# Create dataset dictionary
dataset = {
    'input_ids': tokenized_texts['input_ids'],
    'attention_mask': tokenized_texts['attention_mask'],
    'labels': encoded_labels
}

# Print dataset information
print("Dataset Structure:")
print(f"Number of examples: {len(texts)}")
print(f"Input shape: {dataset['input_ids'].shape}")
print(f"Attention mask shape: {dataset['attention_mask'].shape}")
print(f"Labels shape: {dataset['labels'].shape}")
print("\nLabel mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# Example of accessing first sample
first_text = tokenizer.decode(dataset['input_ids'][0])
print(f"\nFirst example:")
print(f"Text: {first_text}")
print(f"Label: {labels[0]} (encoded: {dataset['labels'][0]})")
print(f"Attention mask: {dataset['attention_mask'][0][:10]}...")

Code Breakdown:

  1. Imports and Setup:
    • sklearn.preprocessing.LabelEncoder for converting text labels to numbers
    • torch for tensor operations
    • transformers for the tokenizer
    • numpy for numerical operations
  2. Data Preparation:
    • Expanded dataset with 5 examples covering different sentiments
    • Added a "neutral" class to demonstrate multi-class capability
    • Structured text and label pairs
  3. Tokenization:
    • Uses BERT tokenizer with increased max_length (32 tokens)
    • Includes padding and truncation for consistent lengths
    • Returns attention masks for proper transformer input
  4. Label Processing:
    • Converts text labels to numerical format
    • Creates a mapping between original labels and encoded values
    • Stores labels as PyTorch tensors
  5. Dataset Creation:
    • Combines input_ids, attention_masks, and labels
    • Organizes data in a format ready for model training
    • Maintains alignment between inputs and labels
  6. Information Display:
    • Shows dataset structure and dimensions
    • Displays label encoding mapping
    • Demonstrates how to access and decode individual examples

Expected Output:

Dataset Structure:
Number of examples: 5
Input shape: torch.Size([5, 32])
Attention mask shape: torch.Size([5, 32])
Labels shape: torch.Size([5])

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}

First example:
Text: [CLS] this movie was amazing! [SEP] [PAD]...
Label: positive (encoded: 2)
Attention mask: tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0]...)

Token Classification (e.g., Named Entity Recognition):

For token classification tasks, each token in the input sequence must be assigned a label.

# Sample data for Named Entity Recognition (NER)
text = "Hugging Face is based in New York City."
labels = ["B-ORG", "I-ORG", "O", "O", "O", "B-LOC", "I-LOC"]

# Tokenize the text using the pre-initialized tokenizer
# padding="max_length": Ensures all sequences have the same length
# truncation=True: Cuts off text that exceeds max_length
# max_length=20: Maximum number of tokens allowed
# return_tensors="pt": Returns PyTorch tensors
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=20, return_tensors="pt")

# Align labels with tokenized text
# This is crucial because tokenization might split words into subwords
# - If a token starts with "##", it's a subword token (BERT-specific)
# - We assign -100 to subword tokens to ignore them in loss calculation
# - Other tokens retain their original NER labels
aligned_labels = [-100 if token.startswith("##") else label for token, label in zip(tokenized_text["input_ids"][0], labels)]

# Print the aligned labels to verify the alignment
print("Aligned Labels:", aligned_labels)

# Label explanation:
# B-ORG: Beginning of Organization entity (Hugging Face)
# I-ORG: Inside of Organization entity (Face)
# O: Outside any entity (is, based, in)
# B-LOC: Beginning of Location entity (New York)
# I-LOC: Inside of Location entity (York City)

Here's what the output would look like:

Aligned Labels: ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC', 'I-LOC']

This output shows the NER (Named Entity Recognition) labels aligned with the tokens from the input text "Hugging Face is based in New York City", where:

  • Hugging Face is labeled as an organization (B-ORG, I-ORG)
  • New York City is labeled as a location (B-LOC, I-LOC)
  • The remaining words (is, based, in) are labeled as outside entities (O)

Data preprocessing is a crucial step in preparing text for transformer models, serving as the foundation for successful model training and deployment. This phase involves several critical components:

First, proper tokenization breaks down text into meaningful units that the model can process. This includes handling word boundaries, special characters, and subword tokenization strategies that help manage vocabulary size while preserving semantic meaning.

Second, padding and truncation ensure consistent input sizes. Padding adds special tokens to shorter sequences to match a target length, while truncation carefully removes excess tokens from longer sequences while preserving essential information.

Third, the alignment of labels with tokenized input is essential for supervised learning tasks. This process requires careful attention to maintain the relationship between input tokens and their corresponding labels, especially when dealing with subword tokenization.

Additionally, preprocessing includes crucial steps like handling out-of-vocabulary words, managing special tokens (such as [CLS] and [SEP] for BERT models), and implementing appropriate masking strategies for different model architectures.

Mastering these preprocessing techniques is vital as they directly impact model performance. Proper implementation helps avoid common pitfalls like misaligned labels, inconsistent sequence lengths, or lost contextual information. When done correctly, these steps create clean, well-structured input that allows transformer models to achieve their optimal performance.

3.1 Data Preprocessing for Transformer Models

Fine-tuning transformer models has become the industry standard method for adapting pretrained language models to specialized NLP tasks. This process involves taking a model that has been trained on a large corpus of general text data and further training it on task-specific data to optimize its performance. While powerful pretrained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer) demonstrate impressive capabilities in understanding and generating human language, they typically require additional fine-tuning to excel at specific applications, such as sentiment analysis, document classification, or specialized translation tasks. 

The fine-tuning process involves several key components that we'll explore in detail throughout this chapter. We begin with data preprocessing, which is crucial for ensuring your input data is correctly formatted and tokenized for transformer models. This includes cleaning the text, handling special characters, and converting words into the numerical representations that these models can process.

Following preprocessing, we'll examine advanced fine-tuning techniques that have revolutionized the field. These include LoRA (Low-Rank Adaptation), which efficiently adapts large models by updating a small number of parameters, and Prefix Tuning, which prepends learnable tokens to the input while keeping the original model frozen. We'll also cover comprehensive evaluation strategies using industry-standard metrics: BLEU (Bilingual Evaluation Understudy) for measuring translation quality, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for assessing text summarization, and BERTScore for semantic similarity evaluation.

By the end of this chapter, you will possess a comprehensive understanding of the entire fine-tuning pipeline: from preparing your datasets and selecting appropriate training strategies, to implementing effective fine-tuning techniques and rigorously evaluating model performance using multiple metrics. This knowledge will enable you to adapt transformer models to your specific use cases while maintaining efficiency and accuracy.

Data preprocessing is a critical step when working with transformer models, serving as the foundation for successful model training and deployment. This process involves several key transformations of raw text data. First, transformers require text inputs to be tokenized - broken down into smaller units like words or subwords - and converted into numerical representations (typically vectors) that the model can process mathematically. This tokenization process can use different approaches such as WordPiece, Byte-Pair Encoding (BPE), or SentencePiece, each with its own advantages for different languages and use cases.

Beyond basic tokenization, attention masks play a crucial role in efficient processing. These binary masks tell the model which tokens are actual input data and which are padding tokens (used to make all sequences in a batch the same length). This distinction is essential because it prevents the model from wasting computational resources on padding tokens and ensures that padding doesn't influence the model's understanding of the actual content.

Furthermore, proper label encoding is essential for supervised learning tasks. Whether you're working on classification (converting categorical labels to numerical values), sequence labeling (assigning labels to individual tokens), or more complex tasks, the labels must be encoded in a format that aligns with the model's architecture and training objectives.

In this section, we will cover three fundamental aspects of preprocessing:

  1. Tokenization and Padding - Converting text to tokens and ensuring uniform sequence lengths
  2. Handling Long Sequences - Strategies for managing text that exceeds the model's maximum input length
  3. Preprocessing for Specific Tasks - Task-specific considerations and requirements

3.1.1 Tokenization and Padding

Tokenization is a fundamental preprocessing step that transforms raw text into a format that transformer models can process. This process breaks down text into smaller units called tokens, which can be:

  • Words (e.g., "hello", "world")
  • Subwords (e.g., "play", "##ing", where "##" indicates a continuation)
  • Individual characters (particularly useful for character-based languages)

For example, consider the sentence "transformers are amazing". Using subword tokenization, it might be broken down as:

  1. "transform" (root word)
  2. "##ers" (suffix)
  3. "are" (complete word)
  4. "amazing" (complete word)

These tokens are then mapped to unique numerical IDs using a vocabulary lookup table. For instance:

  • "transform" → 19081
  • "##ers" → 2024
  • "are" → 2003
  • "amazing" → 6429

This numerical representation is essential because neural networks can only process numbers, not text directly.

Padding is another crucial preprocessing step that addresses a technical requirement of transformer models: batch processing. Since neural networks process multiple sequences simultaneously for efficiency, all sequences in a batch must have the same length. Here's how padding works:

  1. First, identify the longest sequence in your batch
  2. Add special padding tokens ([PAD] or 0) to shorter sequences
  3. Create an attention mask to tell the model which tokens are real and which are padding

For example, if we have these sequences:

  • "Hello world" (2 tokens)
  • "The quick brown fox jumps" (5 tokens)

The padding process would:

  1. Make both sequences 5 tokens long
  2. "Hello world [PAD] [PAD] [PAD]"
  3. "The quick brown fox jumps"

This ensures uniform processing while maintaining the integrity of the original sequences through attention masks that tell the model to ignore the padding tokens during computation.

Example: Tokenization and Padding with BERT

from transformers import BertTokenizer
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define sample texts of different lengths
texts = [
    "Transformers are amazing!",
    "They are used for many NLP tasks.",
    "This is a longer sentence that will show padding in action."
]

# Tokenize the texts with different parameters
# 1. Basic tokenization
basic_tokens = tokenizer(texts[0])
print("\n1. Basic tokenization:")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(basic_tokens['input_ids'])}")

# 2. Batch tokenization with padding
batch_tokenized = tokenizer(
    texts,
    padding=True,  # Add padding
    truncation=True,  # Enable truncation
    max_length=12,  # Set maximum length
    return_tensors="pt"  # Return PyTorch tensors
)

print("\n2. Batch tokenization results:")
print("Input IDs:")
print(batch_tokenized["input_ids"])
print("\nAttention Masks:")
print(batch_tokenized["attention_mask"])

# 3. Decode back to text
print("\n3. Decoded text from tokens:")
for i in range(len(texts)):
    decoded = tokenizer.decode(batch_tokenized["input_ids"][i])
    print(f"Original: {texts[i]}")
    print(f"Decoded:  {decoded}\n")

Detailed Breakdown:

  1. Importing and Setup:
    • We import both the BertTokenizer and torch
    • Initialize the BERT tokenizer with the uncased model variant
  2. Basic Tokenization:
    • Shows how a single sentence is tokenized
    • Demonstrates token-to-text conversion for better understanding
  3. Batch Processing:
    • Processes multiple sentences of different lengths
    • Uses padding to make all sequences uniform length
    • Sets max_length=12 to demonstrate truncation
  4. Key Parameters:
    • padding=True: Adds padding tokens to shorter sequences
    • truncation=True: Cuts longer sequences to max_length
    • return_tensors="pt": Returns PyTorch tensors instead of lists
  5. Output Explanation:
    • input_ids: Numerical representations of tokens
    • attention_mask: 1s for real tokens, 0s for padding
    • Decoded text shows how the model reconstructs the original input

Explanation:

  • input_ids: Tokenized representation of the input text.
  • attention_mask: Binary mask indicating which tokens are actual input (1) and which are padding (0).

Output:

1. Basic tokenization:
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']

2. Batch tokenization results:
Input IDs:
tensor([[  101,  2234,  2024,  6429,   999,   102,     0,     0,     0,     0,
            0,     0],
        [  101,  2027,  2024,  2107,  2005,  2116,  3319,  2202,   999,   102,
            0,     0],
        [  101,  2023,  2003,  1037,  2208,  6251,  2008,  2097,  4058,  1999,
         2039,   102]])

Attention Masks:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

3. Decoded text from tokens:
Original: Transformers are amazing!
Decoded: [CLS] transformers are amazing ! [SEP]

Original: They are used for many NLP tasks.
Decoded: [CLS] they are used for many nlp tasks . [SEP]

Original: This is a longer sentence that will show padding in action.
Decoded: [CLS] this is a longer sentence that will show [SEP]

3.1.2 Handling Long Sequences

Transformers have a maximum input length limitation due to their self-attention mechanism, which grows quadratically with sequence length. This limitation exists because the self-attention mechanism computes attention scores between every pair of tokens in the sequence, resulting in a computational complexity of O(n²), where n is the sequence length. As the sequence length grows, both memory usage and computational requirements increase dramatically.

For example, BERT has a maximum sequence length of 512 tokens, while GPT models typically handle 1024 or 2048 tokens. This means that for BERT, processing a sequence of 512 tokens requires computing and storing a 512 x 512 attention matrix for each attention head in each transformer layer. GPT models can handle longer sequences but still face similar computational constraints.

When dealing with texts that exceed these limits, there are two main approaches:

  1. Truncation: Simply cutting off the text at the maximum length. While straightforward, this may lose important information. This approach works best when:
    • The most relevant information appears at the beginning of the text
    • The task only requires understanding the general context rather than specific details
    • Processing speed is a priority over completeness
  2. Chunking: Splitting the text into overlapping or non-overlapping segments that fit within the length limit. This preserves all information but requires strategies for combining the results from multiple chunks. Common chunking strategies include:
    • Sliding window: Creating overlapping chunks with a fixed stride length
    • Sentence-based splitting: Breaking text at natural sentence boundaries
    • Hierarchical processing: Processing chunks individually and then combining results

The choice between these approaches depends on your specific task - truncation might work well for classification, while chunking is often necessary for tasks like document summarization or question answering. For example, in sentiment analysis, the overall sentiment might be captured well enough in the first few hundred tokens, making truncation acceptable. However, for tasks like document summarization or question answering where important information could be anywhere in the text, chunking becomes essential to ensure no critical information is lost.

Example: Truncating Long Sequences

# Define a long text sample
long_text = "Transformers are incredibly versatile models that have revolutionized the field of NLP. " * 20

# Initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 1. Basic tokenization without truncation
tokenized_full = tokenizer(long_text, truncation=False, return_tensors="pt")
print("\n1. Full text tokenization:")
print(f"Original sequence length: {tokenized_full['input_ids'].shape[1]} tokens")

# 2. Tokenization with truncation
tokenized_truncated = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print("\n2. Truncated tokenization:")
print(f"Truncated sequence length: {tokenized_truncated['input_ids'].shape[1]} tokens")

# 3. Sliding window approach
def create_sliding_windows(text, window_size=256, stride=128):
    tokenized = tokenizer(text, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        if len(window) < window_size:  # Pad last window if needed
            padding = window_size - len(window)
            window = torch.cat([window, torch.zeros(padding, dtype=torch.long)])
        windows.append(window)
    
    return torch.stack(windows)

# Apply sliding window
sliding_windows = create_sliding_windows(long_text)
print("\n3. Sliding window approach:")
print(f"Number of windows: {len(sliding_windows)}")
print(f"Window shape: {sliding_windows.shape}")

# 4. Demonstrate window content
print("\n4. Content of first window:")
first_window_text = tokenizer.decode(sliding_windows[0])
print(first_window_text[:100] + "...")

Code Breakdown:

  1. Text Preparation:
    • Creates a long text sample by repeating a sentence 20 times
    • Initializes the BERT tokenizer for processing
  2. Full Tokenization:
    • Shows the original sequence length without truncation
    • Helps understand how much text exceeds the model's limits
  3. Truncation Approach:
    • Implements standard truncation at 512 tokens (BERT's limit)
    • Demonstrates the basic way to handle long sequences
  4. Sliding Window Implementation:
    • Creates overlapping windows of text (window_size=256, stride=128)
    • Allows processing of the entire text in manageable chunks
    • Includes padding for the last window if needed
  5. Window Content Display:
    • Shows the actual content of the first window
    • Helps verify the windowing process works correctly

Output:

1. Full text tokenization:
Original sequence length: ~400 tokens

2. Truncated tokenization:
Truncated sequence length: 512 tokens

3. Sliding window approach:
Number of windows: ~4
Window shape: torch.Size([4, 256])

4. Content of first window:
[CLS] transformers are incredibly versatile models that have revolutionized the field of nlp. transformers are...

Note: The exact numbers would vary based on the actual tokenization of the repeated sentence, but this represents the expected structure of the output given the code's logic.

Example: Splitting Long Text into Chunks

# Function to split long text into chunks with overlap
def split_text_into_chunks(text, max_length=128, overlap=20):
    # Tokenize the text
    tokenized = tokenizer(text, truncation=False, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    attention_mask = tokenized["attention_mask"][0]
    
    chunks = []
    chunk_masks = []
    
    # Create chunks with overlap
    for i in range(0, len(input_ids), max_length - overlap):
        # Extract chunk
        chunk = input_ids[i:i + max_length]
        mask = attention_mask[i:i + max_length]
        
        # Pad if necessary
        if len(chunk) < max_length:
            padding_length = max_length - len(chunk)
            chunk = torch.cat([chunk, torch.zeros(padding_length, dtype=torch.long)])
            mask = torch.cat([mask, torch.zeros(padding_length, dtype=torch.long)])
        
        chunks.append(chunk)
        chunk_masks.append(mask)
    
    return {
        "input_ids": torch.stack(chunks),
        "attention_mask": torch.stack(chunk_masks)
    }

# Example usage
long_text = "This is a very long text that needs to be split into chunks. " * 20
chunks = split_text_into_chunks(long_text, max_length=128, overlap=20)

# Print information about chunks
print(f"Number of chunks: {len(chunks['input_ids'])}")
print(f"Chunk size: {chunks['input_ids'].shape}")

# Decode and print first chunk to verify content
first_chunk = tokenizer.decode(chunks['input_ids'][0])
print("\nFirst chunk content:")
print(first_chunk[:100], "...")

# Print overlap between chunks to verify
if len(chunks['input_ids']) > 1:
    overlap_first = tokenizer.decode(chunks['input_ids'][0][-20:])
    overlap_second = tokenizer.decode(chunks['input_ids'][1][:20])
    print("\nOverlap demonstration:")
    print("End of first chunk:", overlap_first)
    print("Start of second chunk:", overlap_second)

Code Breakdown:

  1. Function Parameters:
    • max_length: Maximum number of tokens per chunk (default: 128)
    • overlap: Number of overlapping tokens between chunks (default: 20)
  2. Key Components:
    • Tokenization: Converts input text to token IDs and attention masks
    • Chunk Creation: Creates overlapping chunks of specified length
    • Padding: Ensures all chunks are of equal length
    • Return Format: Dictionary with input_ids and attention_mask tensors
  3. Important Features:
    • Overlap handling prevents loss of context between chunks
    • Attention masks track valid tokens vs padding
    • Maintains compatibility with transformer model input requirements
  4. Verification Steps:
    • Prints number and size of chunks
    • Shows content of first chunk
    • Demonstrates overlap between consecutive chunks

Output:

Number of chunks: 3
Chunk size: torch.Size([3, 128])

First chunk content:
This is a very long text that needs to be split into chunks. This is a very long text that needs to be split...

Overlap demonstration:
End of first chunk: split into chunks.
Start of second chunk: chunks. This is a v

The exact number of chunks and content may vary depending on the actual tokenization, but this demonstrates the key output components showing:

  • The number of chunks created from the input text
  • The dimension of the chunks tensor
  • A sample of the first chunk's content
  • The overlapping region between consecutive chunks

3.1.3 Preprocessing for Specific Tasks

Different NLP tasks require specific preprocessing steps to ensure optimal model performance. This preprocessing phase is crucial as it transforms raw text data into a format that transformer models can effectively process. The preprocessing pipeline must be carefully designed to handle the unique characteristics of each task while maintaining data integrity and model compatibility.

The exact preprocessing steps vary significantly depending on several key factors:

  • Task Type:
    • Classification tasks require balanced datasets and appropriate label encoding
    • Generation tasks need careful handling of start/end tokens and sequence formatting
    • Translation tasks must align source and target language pairs effectively
    • Question-answering tasks require proper context and question formatting
  • Model Architecture:
    • BERT-based models need special [CLS] and [SEP] tokens
    • GPT models require specific attention to end-of-sequence tokens
    • T5 models need task-specific prefixes
  • Dataset Requirements:
    • Data cleaning and normalization standards
    • Handling of special characters and formatting
    • Domain-specific terminology processing

Common preprocessing steps form the foundation of any NLP pipeline:

  • Tokenization: Converting text into tokens that the model can process
    • Word-level: Splitting text into individual words
    • Subword-level: Breaking words into meaningful subunits
    • Character-level: Processing text as individual characters
  • Sequence Length Adjustment:
    • Padding shorter sequences to a fixed length
    • Truncating longer sequences to fit model constraints
    • Implementing dynamic batching strategies
  • Label Encoding:
    • Converting categorical labels to numerical format
    • Implementing one-hot encoding where appropriate
    • Handling multi-label scenarios
  • Special Token Handling:
    • Adding task-specific tokens
    • Managing separator and classification tokens
    • Implementing masking strategies

Additionally, task-specific considerations require careful attention:

  • Classification Tasks:
    • Handling class imbalance through sampling or weighting
    • Implementing stratification strategies
  • Long Document Processing:
    • Implementing sliding windows with appropriate overlap
    • Managing document segmentation
    • Maintaining context across segments

Here are examples for two common tasks:

Text Classification:

For classification, text inputs need to be tokenized, and their corresponding labels should be encoded.

from sklearn.preprocessing import LabelEncoder
import torch
from transformers import AutoTokenizer
import numpy as np

# Initialize tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sample data
texts = [
    "This movie was amazing!",
    "I did not like the ending.",
    "A masterpiece of modern cinema",
    "Waste of time and money",
    "It was just okay, nothing special"
]
labels = ["positive", "negative", "positive", "negative", "neutral"]

# Tokenize the texts with attention masks
tokenized_texts = tokenizer(
    texts,
    padding="max_length",
    truncation=True,
    max_length=32,
    return_tensors="pt",
    return_attention_mask=True
)

# Encode the labels
label_encoder = LabelEncoder()
encoded_labels = torch.tensor(label_encoder.fit_transform(labels))

# Create dataset dictionary
dataset = {
    'input_ids': tokenized_texts['input_ids'],
    'attention_mask': tokenized_texts['attention_mask'],
    'labels': encoded_labels
}

# Print dataset information
print("Dataset Structure:")
print(f"Number of examples: {len(texts)}")
print(f"Input shape: {dataset['input_ids'].shape}")
print(f"Attention mask shape: {dataset['attention_mask'].shape}")
print(f"Labels shape: {dataset['labels'].shape}")
print("\nLabel mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# Example of accessing first sample
first_text = tokenizer.decode(dataset['input_ids'][0])
print(f"\nFirst example:")
print(f"Text: {first_text}")
print(f"Label: {labels[0]} (encoded: {dataset['labels'][0]})")
print(f"Attention mask: {dataset['attention_mask'][0][:10]}...")

Code Breakdown:

  1. Imports and Setup:
    • sklearn.preprocessing.LabelEncoder for converting text labels to numbers
    • torch for tensor operations
    • transformers for the tokenizer
    • numpy for numerical operations
  2. Data Preparation:
    • Expanded dataset with 5 examples covering different sentiments
    • Added a "neutral" class to demonstrate multi-class capability
    • Structured text and label pairs
  3. Tokenization:
    • Uses BERT tokenizer with increased max_length (32 tokens)
    • Includes padding and truncation for consistent lengths
    • Returns attention masks for proper transformer input
  4. Label Processing:
    • Converts text labels to numerical format
    • Creates a mapping between original labels and encoded values
    • Stores labels as PyTorch tensors
  5. Dataset Creation:
    • Combines input_ids, attention_masks, and labels
    • Organizes data in a format ready for model training
    • Maintains alignment between inputs and labels
  6. Information Display:
    • Shows dataset structure and dimensions
    • Displays label encoding mapping
    • Demonstrates how to access and decode individual examples

Expected Output:

Dataset Structure:
Number of examples: 5
Input shape: torch.Size([5, 32])
Attention mask shape: torch.Size([5, 32])
Labels shape: torch.Size([5])

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}

First example:
Text: [CLS] this movie was amazing! [SEP] [PAD]...
Label: positive (encoded: 2)
Attention mask: tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0]...)

Token Classification (e.g., Named Entity Recognition):

For token classification tasks, each token in the input sequence must be assigned a label.

# Sample data for Named Entity Recognition (NER)
text = "Hugging Face is based in New York City."
labels = ["B-ORG", "I-ORG", "O", "O", "O", "B-LOC", "I-LOC"]

# Tokenize the text using the pre-initialized tokenizer
# padding="max_length": Ensures all sequences have the same length
# truncation=True: Cuts off text that exceeds max_length
# max_length=20: Maximum number of tokens allowed
# return_tensors="pt": Returns PyTorch tensors
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=20, return_tensors="pt")

# Align labels with tokenized text
# This is crucial because tokenization might split words into subwords
# - If a token starts with "##", it's a subword token (BERT-specific)
# - We assign -100 to subword tokens to ignore them in loss calculation
# - Other tokens retain their original NER labels
aligned_labels = [-100 if token.startswith("##") else label for token, label in zip(tokenized_text["input_ids"][0], labels)]

# Print the aligned labels to verify the alignment
print("Aligned Labels:", aligned_labels)

# Label explanation:
# B-ORG: Beginning of Organization entity (Hugging Face)
# I-ORG: Inside of Organization entity (Face)
# O: Outside any entity (is, based, in)
# B-LOC: Beginning of Location entity (New York)
# I-LOC: Inside of Location entity (York City)

Here's what the output would look like:

Aligned Labels: ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC', 'I-LOC']

This output shows the NER (Named Entity Recognition) labels aligned with the tokens from the input text "Hugging Face is based in New York City", where:

  • Hugging Face is labeled as an organization (B-ORG, I-ORG)
  • New York City is labeled as a location (B-LOC, I-LOC)
  • The remaining words (is, based, in) are labeled as outside entities (O)

Data preprocessing is a crucial step in preparing text for transformer models, serving as the foundation for successful model training and deployment. This phase involves several critical components:

First, proper tokenization breaks down text into meaningful units that the model can process. This includes handling word boundaries, special characters, and subword tokenization strategies that help manage vocabulary size while preserving semantic meaning.

Second, padding and truncation ensure consistent input sizes. Padding adds special tokens to shorter sequences to match a target length, while truncation carefully removes excess tokens from longer sequences while preserving essential information.

Third, the alignment of labels with tokenized input is essential for supervised learning tasks. This process requires careful attention to maintain the relationship between input tokens and their corresponding labels, especially when dealing with subword tokenization.

Additionally, preprocessing includes crucial steps like handling out-of-vocabulary words, managing special tokens (such as [CLS] and [SEP] for BERT models), and implementing appropriate masking strategies for different model architectures.

Mastering these preprocessing techniques is vital as they directly impact model performance. Proper implementation helps avoid common pitfalls like misaligned labels, inconsistent sequence lengths, or lost contextual information. When done correctly, these steps create clean, well-structured input that allows transformer models to achieve their optimal performance.

3.1 Data Preprocessing for Transformer Models

Fine-tuning transformer models has become the industry standard method for adapting pretrained language models to specialized NLP tasks. This process involves taking a model that has been trained on a large corpus of general text data and further training it on task-specific data to optimize its performance. While powerful pretrained models like BERT (Bidirectional Encoder Representations from Transformers), GPT (Generative Pre-trained Transformer), and T5 (Text-to-Text Transfer Transformer) demonstrate impressive capabilities in understanding and generating human language, they typically require additional fine-tuning to excel at specific applications, such as sentiment analysis, document classification, or specialized translation tasks. 

The fine-tuning process involves several key components that we'll explore in detail throughout this chapter. We begin with data preprocessing, which is crucial for ensuring your input data is correctly formatted and tokenized for transformer models. This includes cleaning the text, handling special characters, and converting words into the numerical representations that these models can process.

Following preprocessing, we'll examine advanced fine-tuning techniques that have revolutionized the field. These include LoRA (Low-Rank Adaptation), which efficiently adapts large models by updating a small number of parameters, and Prefix Tuning, which prepends learnable tokens to the input while keeping the original model frozen. We'll also cover comprehensive evaluation strategies using industry-standard metrics: BLEU (Bilingual Evaluation Understudy) for measuring translation quality, ROUGE (Recall-Oriented Understudy for Gisting Evaluation) for assessing text summarization, and BERTScore for semantic similarity evaluation.

By the end of this chapter, you will possess a comprehensive understanding of the entire fine-tuning pipeline: from preparing your datasets and selecting appropriate training strategies, to implementing effective fine-tuning techniques and rigorously evaluating model performance using multiple metrics. This knowledge will enable you to adapt transformer models to your specific use cases while maintaining efficiency and accuracy.

Data preprocessing is a critical step when working with transformer models, serving as the foundation for successful model training and deployment. This process involves several key transformations of raw text data. First, transformers require text inputs to be tokenized - broken down into smaller units like words or subwords - and converted into numerical representations (typically vectors) that the model can process mathematically. This tokenization process can use different approaches such as WordPiece, Byte-Pair Encoding (BPE), or SentencePiece, each with its own advantages for different languages and use cases.

Beyond basic tokenization, attention masks play a crucial role in efficient processing. These binary masks tell the model which tokens are actual input data and which are padding tokens (used to make all sequences in a batch the same length). This distinction is essential because it prevents the model from wasting computational resources on padding tokens and ensures that padding doesn't influence the model's understanding of the actual content.

Furthermore, proper label encoding is essential for supervised learning tasks. Whether you're working on classification (converting categorical labels to numerical values), sequence labeling (assigning labels to individual tokens), or more complex tasks, the labels must be encoded in a format that aligns with the model's architecture and training objectives.

In this section, we will cover three fundamental aspects of preprocessing:

  1. Tokenization and Padding - Converting text to tokens and ensuring uniform sequence lengths
  2. Handling Long Sequences - Strategies for managing text that exceeds the model's maximum input length
  3. Preprocessing for Specific Tasks - Task-specific considerations and requirements

3.1.1 Tokenization and Padding

Tokenization is a fundamental preprocessing step that transforms raw text into a format that transformer models can process. This process breaks down text into smaller units called tokens, which can be:

  • Words (e.g., "hello", "world")
  • Subwords (e.g., "play", "##ing", where "##" indicates a continuation)
  • Individual characters (particularly useful for character-based languages)

For example, consider the sentence "transformers are amazing". Using subword tokenization, it might be broken down as:

  1. "transform" (root word)
  2. "##ers" (suffix)
  3. "are" (complete word)
  4. "amazing" (complete word)

These tokens are then mapped to unique numerical IDs using a vocabulary lookup table. For instance:

  • "transform" → 19081
  • "##ers" → 2024
  • "are" → 2003
  • "amazing" → 6429

This numerical representation is essential because neural networks can only process numbers, not text directly.

Padding is another crucial preprocessing step that addresses a technical requirement of transformer models: batch processing. Since neural networks process multiple sequences simultaneously for efficiency, all sequences in a batch must have the same length. Here's how padding works:

  1. First, identify the longest sequence in your batch
  2. Add special padding tokens ([PAD] or 0) to shorter sequences
  3. Create an attention mask to tell the model which tokens are real and which are padding

For example, if we have these sequences:

  • "Hello world" (2 tokens)
  • "The quick brown fox jumps" (5 tokens)

The padding process would:

  1. Make both sequences 5 tokens long
  2. "Hello world [PAD] [PAD] [PAD]"
  3. "The quick brown fox jumps"

This ensures uniform processing while maintaining the integrity of the original sequences through attention masks that tell the model to ignore the padding tokens during computation.

Example: Tokenization and Padding with BERT

from transformers import BertTokenizer
import torch

# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")

# Define sample texts of different lengths
texts = [
    "Transformers are amazing!",
    "They are used for many NLP tasks.",
    "This is a longer sentence that will show padding in action."
]

# Tokenize the texts with different parameters
# 1. Basic tokenization
basic_tokens = tokenizer(texts[0])
print("\n1. Basic tokenization:")
print(f"Tokens: {tokenizer.convert_ids_to_tokens(basic_tokens['input_ids'])}")

# 2. Batch tokenization with padding
batch_tokenized = tokenizer(
    texts,
    padding=True,  # Add padding
    truncation=True,  # Enable truncation
    max_length=12,  # Set maximum length
    return_tensors="pt"  # Return PyTorch tensors
)

print("\n2. Batch tokenization results:")
print("Input IDs:")
print(batch_tokenized["input_ids"])
print("\nAttention Masks:")
print(batch_tokenized["attention_mask"])

# 3. Decode back to text
print("\n3. Decoded text from tokens:")
for i in range(len(texts)):
    decoded = tokenizer.decode(batch_tokenized["input_ids"][i])
    print(f"Original: {texts[i]}")
    print(f"Decoded:  {decoded}\n")

Detailed Breakdown:

  1. Importing and Setup:
    • We import both the BertTokenizer and torch
    • Initialize the BERT tokenizer with the uncased model variant
  2. Basic Tokenization:
    • Shows how a single sentence is tokenized
    • Demonstrates token-to-text conversion for better understanding
  3. Batch Processing:
    • Processes multiple sentences of different lengths
    • Uses padding to make all sequences uniform length
    • Sets max_length=12 to demonstrate truncation
  4. Key Parameters:
    • padding=True: Adds padding tokens to shorter sequences
    • truncation=True: Cuts longer sequences to max_length
    • return_tensors="pt": Returns PyTorch tensors instead of lists
  5. Output Explanation:
    • input_ids: Numerical representations of tokens
    • attention_mask: 1s for real tokens, 0s for padding
    • Decoded text shows how the model reconstructs the original input

Explanation:

  • input_ids: Tokenized representation of the input text.
  • attention_mask: Binary mask indicating which tokens are actual input (1) and which are padding (0).

Output:

1. Basic tokenization:
Tokens: ['[CLS]', 'transformers', 'are', 'amazing', '!', '[SEP]']

2. Batch tokenization results:
Input IDs:
tensor([[  101,  2234,  2024,  6429,   999,   102,     0,     0,     0,     0,
            0,     0],
        [  101,  2027,  2024,  2107,  2005,  2116,  3319,  2202,   999,   102,
            0,     0],
        [  101,  2023,  2003,  1037,  2208,  6251,  2008,  2097,  4058,  1999,
         2039,   102]])

Attention Masks:
tensor([[1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])

3. Decoded text from tokens:
Original: Transformers are amazing!
Decoded: [CLS] transformers are amazing ! [SEP]

Original: They are used for many NLP tasks.
Decoded: [CLS] they are used for many nlp tasks . [SEP]

Original: This is a longer sentence that will show padding in action.
Decoded: [CLS] this is a longer sentence that will show [SEP]

3.1.2 Handling Long Sequences

Transformers have a maximum input length limitation due to their self-attention mechanism, which grows quadratically with sequence length. This limitation exists because the self-attention mechanism computes attention scores between every pair of tokens in the sequence, resulting in a computational complexity of O(n²), where n is the sequence length. As the sequence length grows, both memory usage and computational requirements increase dramatically.

For example, BERT has a maximum sequence length of 512 tokens, while GPT models typically handle 1024 or 2048 tokens. This means that for BERT, processing a sequence of 512 tokens requires computing and storing a 512 x 512 attention matrix for each attention head in each transformer layer. GPT models can handle longer sequences but still face similar computational constraints.

When dealing with texts that exceed these limits, there are two main approaches:

  1. Truncation: Simply cutting off the text at the maximum length. While straightforward, this may lose important information. This approach works best when:
    • The most relevant information appears at the beginning of the text
    • The task only requires understanding the general context rather than specific details
    • Processing speed is a priority over completeness
  2. Chunking: Splitting the text into overlapping or non-overlapping segments that fit within the length limit. This preserves all information but requires strategies for combining the results from multiple chunks. Common chunking strategies include:
    • Sliding window: Creating overlapping chunks with a fixed stride length
    • Sentence-based splitting: Breaking text at natural sentence boundaries
    • Hierarchical processing: Processing chunks individually and then combining results

The choice between these approaches depends on your specific task - truncation might work well for classification, while chunking is often necessary for tasks like document summarization or question answering. For example, in sentiment analysis, the overall sentiment might be captured well enough in the first few hundred tokens, making truncation acceptable. However, for tasks like document summarization or question answering where important information could be anywhere in the text, chunking becomes essential to ensure no critical information is lost.

Example: Truncating Long Sequences

# Define a long text sample
long_text = "Transformers are incredibly versatile models that have revolutionized the field of NLP. " * 20

# Initialize tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

# 1. Basic tokenization without truncation
tokenized_full = tokenizer(long_text, truncation=False, return_tensors="pt")
print("\n1. Full text tokenization:")
print(f"Original sequence length: {tokenized_full['input_ids'].shape[1]} tokens")

# 2. Tokenization with truncation
tokenized_truncated = tokenizer(
    long_text,
    truncation=True,
    max_length=512,
    return_tensors="pt"
)
print("\n2. Truncated tokenization:")
print(f"Truncated sequence length: {tokenized_truncated['input_ids'].shape[1]} tokens")

# 3. Sliding window approach
def create_sliding_windows(text, window_size=256, stride=128):
    tokenized = tokenizer(text, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    
    windows = []
    for i in range(0, len(input_ids), stride):
        window = input_ids[i:i + window_size]
        if len(window) < window_size:  # Pad last window if needed
            padding = window_size - len(window)
            window = torch.cat([window, torch.zeros(padding, dtype=torch.long)])
        windows.append(window)
    
    return torch.stack(windows)

# Apply sliding window
sliding_windows = create_sliding_windows(long_text)
print("\n3. Sliding window approach:")
print(f"Number of windows: {len(sliding_windows)}")
print(f"Window shape: {sliding_windows.shape}")

# 4. Demonstrate window content
print("\n4. Content of first window:")
first_window_text = tokenizer.decode(sliding_windows[0])
print(first_window_text[:100] + "...")

Code Breakdown:

  1. Text Preparation:
    • Creates a long text sample by repeating a sentence 20 times
    • Initializes the BERT tokenizer for processing
  2. Full Tokenization:
    • Shows the original sequence length without truncation
    • Helps understand how much text exceeds the model's limits
  3. Truncation Approach:
    • Implements standard truncation at 512 tokens (BERT's limit)
    • Demonstrates the basic way to handle long sequences
  4. Sliding Window Implementation:
    • Creates overlapping windows of text (window_size=256, stride=128)
    • Allows processing of the entire text in manageable chunks
    • Includes padding for the last window if needed
  5. Window Content Display:
    • Shows the actual content of the first window
    • Helps verify the windowing process works correctly

Output:

1. Full text tokenization:
Original sequence length: ~400 tokens

2. Truncated tokenization:
Truncated sequence length: 512 tokens

3. Sliding window approach:
Number of windows: ~4
Window shape: torch.Size([4, 256])

4. Content of first window:
[CLS] transformers are incredibly versatile models that have revolutionized the field of nlp. transformers are...

Note: The exact numbers would vary based on the actual tokenization of the repeated sentence, but this represents the expected structure of the output given the code's logic.

Example: Splitting Long Text into Chunks

# Function to split long text into chunks with overlap
def split_text_into_chunks(text, max_length=128, overlap=20):
    # Tokenize the text
    tokenized = tokenizer(text, truncation=False, return_tensors="pt")
    input_ids = tokenized["input_ids"][0]
    attention_mask = tokenized["attention_mask"][0]
    
    chunks = []
    chunk_masks = []
    
    # Create chunks with overlap
    for i in range(0, len(input_ids), max_length - overlap):
        # Extract chunk
        chunk = input_ids[i:i + max_length]
        mask = attention_mask[i:i + max_length]
        
        # Pad if necessary
        if len(chunk) < max_length:
            padding_length = max_length - len(chunk)
            chunk = torch.cat([chunk, torch.zeros(padding_length, dtype=torch.long)])
            mask = torch.cat([mask, torch.zeros(padding_length, dtype=torch.long)])
        
        chunks.append(chunk)
        chunk_masks.append(mask)
    
    return {
        "input_ids": torch.stack(chunks),
        "attention_mask": torch.stack(chunk_masks)
    }

# Example usage
long_text = "This is a very long text that needs to be split into chunks. " * 20
chunks = split_text_into_chunks(long_text, max_length=128, overlap=20)

# Print information about chunks
print(f"Number of chunks: {len(chunks['input_ids'])}")
print(f"Chunk size: {chunks['input_ids'].shape}")

# Decode and print first chunk to verify content
first_chunk = tokenizer.decode(chunks['input_ids'][0])
print("\nFirst chunk content:")
print(first_chunk[:100], "...")

# Print overlap between chunks to verify
if len(chunks['input_ids']) > 1:
    overlap_first = tokenizer.decode(chunks['input_ids'][0][-20:])
    overlap_second = tokenizer.decode(chunks['input_ids'][1][:20])
    print("\nOverlap demonstration:")
    print("End of first chunk:", overlap_first)
    print("Start of second chunk:", overlap_second)

Code Breakdown:

  1. Function Parameters:
    • max_length: Maximum number of tokens per chunk (default: 128)
    • overlap: Number of overlapping tokens between chunks (default: 20)
  2. Key Components:
    • Tokenization: Converts input text to token IDs and attention masks
    • Chunk Creation: Creates overlapping chunks of specified length
    • Padding: Ensures all chunks are of equal length
    • Return Format: Dictionary with input_ids and attention_mask tensors
  3. Important Features:
    • Overlap handling prevents loss of context between chunks
    • Attention masks track valid tokens vs padding
    • Maintains compatibility with transformer model input requirements
  4. Verification Steps:
    • Prints number and size of chunks
    • Shows content of first chunk
    • Demonstrates overlap between consecutive chunks

Output:

Number of chunks: 3
Chunk size: torch.Size([3, 128])

First chunk content:
This is a very long text that needs to be split into chunks. This is a very long text that needs to be split...

Overlap demonstration:
End of first chunk: split into chunks.
Start of second chunk: chunks. This is a v

The exact number of chunks and content may vary depending on the actual tokenization, but this demonstrates the key output components showing:

  • The number of chunks created from the input text
  • The dimension of the chunks tensor
  • A sample of the first chunk's content
  • The overlapping region between consecutive chunks

3.1.3 Preprocessing for Specific Tasks

Different NLP tasks require specific preprocessing steps to ensure optimal model performance. This preprocessing phase is crucial as it transforms raw text data into a format that transformer models can effectively process. The preprocessing pipeline must be carefully designed to handle the unique characteristics of each task while maintaining data integrity and model compatibility.

The exact preprocessing steps vary significantly depending on several key factors:

  • Task Type:
    • Classification tasks require balanced datasets and appropriate label encoding
    • Generation tasks need careful handling of start/end tokens and sequence formatting
    • Translation tasks must align source and target language pairs effectively
    • Question-answering tasks require proper context and question formatting
  • Model Architecture:
    • BERT-based models need special [CLS] and [SEP] tokens
    • GPT models require specific attention to end-of-sequence tokens
    • T5 models need task-specific prefixes
  • Dataset Requirements:
    • Data cleaning and normalization standards
    • Handling of special characters and formatting
    • Domain-specific terminology processing

Common preprocessing steps form the foundation of any NLP pipeline:

  • Tokenization: Converting text into tokens that the model can process
    • Word-level: Splitting text into individual words
    • Subword-level: Breaking words into meaningful subunits
    • Character-level: Processing text as individual characters
  • Sequence Length Adjustment:
    • Padding shorter sequences to a fixed length
    • Truncating longer sequences to fit model constraints
    • Implementing dynamic batching strategies
  • Label Encoding:
    • Converting categorical labels to numerical format
    • Implementing one-hot encoding where appropriate
    • Handling multi-label scenarios
  • Special Token Handling:
    • Adding task-specific tokens
    • Managing separator and classification tokens
    • Implementing masking strategies

Additionally, task-specific considerations require careful attention:

  • Classification Tasks:
    • Handling class imbalance through sampling or weighting
    • Implementing stratification strategies
  • Long Document Processing:
    • Implementing sliding windows with appropriate overlap
    • Managing document segmentation
    • Maintaining context across segments

Here are examples for two common tasks:

Text Classification:

For classification, text inputs need to be tokenized, and their corresponding labels should be encoded.

from sklearn.preprocessing import LabelEncoder
import torch
from transformers import AutoTokenizer
import numpy as np

# Initialize tokenizer (e.g., BERT)
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

# Sample data
texts = [
    "This movie was amazing!",
    "I did not like the ending.",
    "A masterpiece of modern cinema",
    "Waste of time and money",
    "It was just okay, nothing special"
]
labels = ["positive", "negative", "positive", "negative", "neutral"]

# Tokenize the texts with attention masks
tokenized_texts = tokenizer(
    texts,
    padding="max_length",
    truncation=True,
    max_length=32,
    return_tensors="pt",
    return_attention_mask=True
)

# Encode the labels
label_encoder = LabelEncoder()
encoded_labels = torch.tensor(label_encoder.fit_transform(labels))

# Create dataset dictionary
dataset = {
    'input_ids': tokenized_texts['input_ids'],
    'attention_mask': tokenized_texts['attention_mask'],
    'labels': encoded_labels
}

# Print dataset information
print("Dataset Structure:")
print(f"Number of examples: {len(texts)}")
print(f"Input shape: {dataset['input_ids'].shape}")
print(f"Attention mask shape: {dataset['attention_mask'].shape}")
print(f"Labels shape: {dataset['labels'].shape}")
print("\nLabel mapping:", dict(zip(label_encoder.classes_, label_encoder.transform(label_encoder.classes_))))

# Example of accessing first sample
first_text = tokenizer.decode(dataset['input_ids'][0])
print(f"\nFirst example:")
print(f"Text: {first_text}")
print(f"Label: {labels[0]} (encoded: {dataset['labels'][0]})")
print(f"Attention mask: {dataset['attention_mask'][0][:10]}...")

Code Breakdown:

  1. Imports and Setup:
    • sklearn.preprocessing.LabelEncoder for converting text labels to numbers
    • torch for tensor operations
    • transformers for the tokenizer
    • numpy for numerical operations
  2. Data Preparation:
    • Expanded dataset with 5 examples covering different sentiments
    • Added a "neutral" class to demonstrate multi-class capability
    • Structured text and label pairs
  3. Tokenization:
    • Uses BERT tokenizer with increased max_length (32 tokens)
    • Includes padding and truncation for consistent lengths
    • Returns attention masks for proper transformer input
  4. Label Processing:
    • Converts text labels to numerical format
    • Creates a mapping between original labels and encoded values
    • Stores labels as PyTorch tensors
  5. Dataset Creation:
    • Combines input_ids, attention_masks, and labels
    • Organizes data in a format ready for model training
    • Maintains alignment between inputs and labels
  6. Information Display:
    • Shows dataset structure and dimensions
    • Displays label encoding mapping
    • Demonstrates how to access and decode individual examples

Expected Output:

Dataset Structure:
Number of examples: 5
Input shape: torch.Size([5, 32])
Attention mask shape: torch.Size([5, 32])
Labels shape: torch.Size([5])

Label mapping: {'negative': 0, 'neutral': 1, 'positive': 2}

First example:
Text: [CLS] this movie was amazing! [SEP] [PAD]...
Label: positive (encoded: 2)
Attention mask: tensor([1, 1, 1, 1, 1, 1, 0, 0, 0, 0]...)

Token Classification (e.g., Named Entity Recognition):

For token classification tasks, each token in the input sequence must be assigned a label.

# Sample data for Named Entity Recognition (NER)
text = "Hugging Face is based in New York City."
labels = ["B-ORG", "I-ORG", "O", "O", "O", "B-LOC", "I-LOC"]

# Tokenize the text using the pre-initialized tokenizer
# padding="max_length": Ensures all sequences have the same length
# truncation=True: Cuts off text that exceeds max_length
# max_length=20: Maximum number of tokens allowed
# return_tensors="pt": Returns PyTorch tensors
tokenized_text = tokenizer(text, padding="max_length", truncation=True, max_length=20, return_tensors="pt")

# Align labels with tokenized text
# This is crucial because tokenization might split words into subwords
# - If a token starts with "##", it's a subword token (BERT-specific)
# - We assign -100 to subword tokens to ignore them in loss calculation
# - Other tokens retain their original NER labels
aligned_labels = [-100 if token.startswith("##") else label for token, label in zip(tokenized_text["input_ids"][0], labels)]

# Print the aligned labels to verify the alignment
print("Aligned Labels:", aligned_labels)

# Label explanation:
# B-ORG: Beginning of Organization entity (Hugging Face)
# I-ORG: Inside of Organization entity (Face)
# O: Outside any entity (is, based, in)
# B-LOC: Beginning of Location entity (New York)
# I-LOC: Inside of Location entity (York City)

Here's what the output would look like:

Aligned Labels: ['B-ORG', 'I-ORG', 'O', 'O', 'O', 'B-LOC', 'I-LOC']

This output shows the NER (Named Entity Recognition) labels aligned with the tokens from the input text "Hugging Face is based in New York City", where:

  • Hugging Face is labeled as an organization (B-ORG, I-ORG)
  • New York City is labeled as a location (B-LOC, I-LOC)
  • The remaining words (is, based, in) are labeled as outside entities (O)

Data preprocessing is a crucial step in preparing text for transformer models, serving as the foundation for successful model training and deployment. This phase involves several critical components:

First, proper tokenization breaks down text into meaningful units that the model can process. This includes handling word boundaries, special characters, and subword tokenization strategies that help manage vocabulary size while preserving semantic meaning.

Second, padding and truncation ensure consistent input sizes. Padding adds special tokens to shorter sequences to match a target length, while truncation carefully removes excess tokens from longer sequences while preserving essential information.

Third, the alignment of labels with tokenized input is essential for supervised learning tasks. This process requires careful attention to maintain the relationship between input tokens and their corresponding labels, especially when dealing with subword tokenization.

Additionally, preprocessing includes crucial steps like handling out-of-vocabulary words, managing special tokens (such as [CLS] and [SEP] for BERT models), and implementing appropriate masking strategies for different model architectures.

Mastering these preprocessing techniques is vital as they directly impact model performance. Proper implementation helps avoid common pitfalls like misaligned labels, inconsistent sequence lengths, or lost contextual information. When done correctly, these steps create clean, well-structured input that allows transformer models to achieve their optimal performance.