Click here to view the next lesson.

Project 1: Build a Toy Transformer from Scratch in PyTorch

4. Training Loop (Causal LM)

The training loop is a crucial component of any language model implementation. In this section, we set up the process for training our TinyGPT model for causal language modeling - predicting the next token given previous tokens.

Key Components of the Training Loop:

Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32

The Training Process:

Get a batch of training data (x = input tokens, y = target tokens)
Forward pass through the model to get logits (raw prediction scores)
Calculate loss by comparing predictions to actual next tokens
Zero gradients, perform backpropagation, and clip gradients to prevent explosion
Update model parameters using the optimizer
Periodically evaluate on validation data to monitor progress

Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.

This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

block_size = 64
for step in range(500):  # small demo; increase for better quality
    model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)
    logits = model(x)
    loss = loss_fn(logits, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Here's a comprehensive breakdown of the training loop code for the TinyGPT model:

Loss Function Definition

The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

This function:

Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence

Model and Optimizer Setup

Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

The optimizer uses:

Learning rate of 3e-4 (0.0003), which is a common default for transformer models
Weight decay of 0.01 for regularization to prevent overfitting

Training Configuration

The code sets up training parameters:

block_size = 64

The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.

Training Loop

The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):

for step in range(500):  # small demo; increase for better quality

Each training iteration follows these steps:

1. Prepare for training

model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)

Sets the model to training mode (enables dropout, batch normalization, etc.)
Gets a batch of training data with context length of 64 tokens and batch size of 32
'x' contains input tokens, 'y' contains target tokens (the next token to predict)

2. Forward pass

logits = model(x)
    loss = loss_fn(logits, y)

Passes input tokens through the model to get prediction logits
Calculates loss by comparing predictions to actual next tokens

3. Backward pass and optimization

optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Clears previous gradients using optimizer.zero_grad()
Computes gradients with loss.backward()
Clips gradients to prevent exploding gradients (maximum norm of 1.0)
Updates model parameters with optimizer.step()

4. Evaluation

if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Every 50 steps, evaluates model performance on validation data
Sets model to evaluation mode (disables dropout, etc.)
Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
Gets a validation batch and computes validation loss
Prints training progress with step number, training loss, and validation loss

This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.

4. Training Loop (Causal LM)

Key Components of the Training Loop:

Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32

The Training Process:

Get a batch of training data (x = input tokens, y = target tokens)
Forward pass through the model to get logits (raw prediction scores)
Calculate loss by comparing predictions to actual next tokens
Zero gradients, perform backpropagation, and clip gradients to prevent explosion
Update model parameters using the optimizer
Periodically evaluate on validation data to monitor progress

Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.

This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

block_size = 64
for step in range(500):  # small demo; increase for better quality
    model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)
    logits = model(x)
    loss = loss_fn(logits, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Here's a comprehensive breakdown of the training loop code for the TinyGPT model:

Loss Function Definition

The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

This function:

Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence

Model and Optimizer Setup

Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

The optimizer uses:

Learning rate of 3e-4 (0.0003), which is a common default for transformer models
Weight decay of 0.01 for regularization to prevent overfitting

Training Configuration

The code sets up training parameters:

block_size = 64

The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.

Training Loop

The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):

for step in range(500):  # small demo; increase for better quality

Each training iteration follows these steps:

1. Prepare for training

model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)

Sets the model to training mode (enables dropout, batch normalization, etc.)
Gets a batch of training data with context length of 64 tokens and batch size of 32
'x' contains input tokens, 'y' contains target tokens (the next token to predict)

2. Forward pass

logits = model(x)
    loss = loss_fn(logits, y)

Passes input tokens through the model to get prediction logits
Calculates loss by comparing predictions to actual next tokens

3. Backward pass and optimization

optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Clears previous gradients using optimizer.zero_grad()
Computes gradients with loss.backward()
Clips gradients to prevent exploding gradients (maximum norm of 1.0)
Updates model parameters with optimizer.step()

4. Evaluation

if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Every 50 steps, evaluates model performance on validation data
Sets model to evaluation mode (disables dropout, etc.)
Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
Gets a validation batch and computes validation loss
Prints training progress with step number, training loss, and validation loss

This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.

4. Training Loop (Causal LM)

Key Components of the Training Loop:

Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32

The Training Process:

Get a batch of training data (x = input tokens, y = target tokens)
Forward pass through the model to get logits (raw prediction scores)
Calculate loss by comparing predictions to actual next tokens
Zero gradients, perform backpropagation, and clip gradients to prevent explosion
Update model parameters using the optimizer
Periodically evaluate on validation data to monitor progress

Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.

This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

block_size = 64
for step in range(500):  # small demo; increase for better quality
    model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)
    logits = model(x)
    loss = loss_fn(logits, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Here's a comprehensive breakdown of the training loop code for the TinyGPT model:

Loss Function Definition

The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

This function:

Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence

Model and Optimizer Setup

Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

The optimizer uses:

Learning rate of 3e-4 (0.0003), which is a common default for transformer models
Weight decay of 0.01 for regularization to prevent overfitting

Training Configuration

The code sets up training parameters:

block_size = 64

The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.

Training Loop

The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):

for step in range(500):  # small demo; increase for better quality

Each training iteration follows these steps:

1. Prepare for training

model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)

Sets the model to training mode (enables dropout, batch normalization, etc.)
Gets a batch of training data with context length of 64 tokens and batch size of 32
'x' contains input tokens, 'y' contains target tokens (the next token to predict)

2. Forward pass

logits = model(x)
    loss = loss_fn(logits, y)

Passes input tokens through the model to get prediction logits
Calculates loss by comparing predictions to actual next tokens

3. Backward pass and optimization

optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Clears previous gradients using optimizer.zero_grad()
Computes gradients with loss.backward()
Clips gradients to prevent exploding gradients (maximum norm of 1.0)
Updates model parameters with optimizer.step()

4. Evaluation

if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Every 50 steps, evaluates model performance on validation data
Sets model to evaluation mode (disables dropout, etc.)
Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
Gets a validation batch and computes validation loss
Prints training progress with step number, training loss, and validation loss

This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.

4. Training Loop (Causal LM)

Key Components of the Training Loop:

Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32

The Training Process:

Get a batch of training data (x = input tokens, y = target tokens)
Forward pass through the model to get logits (raw prediction scores)
Calculate loss by comparing predictions to actual next tokens
Zero gradients, perform backpropagation, and clip gradients to prevent explosion
Update model parameters using the optimizer
Periodically evaluate on validation data to monitor progress

Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.

This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

block_size = 64
for step in range(500):  # small demo; increase for better quality
    model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)
    logits = model(x)
    loss = loss_fn(logits, y)

    optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

    if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Here's a comprehensive breakdown of the training loop code for the TinyGPT model:

Loss Function Definition

The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:

def loss_fn(logits, targets):
    # Flatten time+batch for cross-entropy
    return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))

This function:

Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence

Model and Optimizer Setup

Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:

model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)

The optimizer uses:

Learning rate of 3e-4 (0.0003), which is a common default for transformer models
Weight decay of 0.01 for regularization to prevent overfitting

Training Configuration

The code sets up training parameters:

block_size = 64

The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.

Training Loop

The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):

for step in range(500):  # small demo; increase for better quality

Each training iteration follows these steps:

1. Prepare for training

model.train()
    x, y = get_batch("train", block_size=block_size, batch_size=32)

Sets the model to training mode (enables dropout, batch normalization, etc.)
Gets a batch of training data with context length of 64 tokens and batch size of 32
'x' contains input tokens, 'y' contains target tokens (the next token to predict)

2. Forward pass

logits = model(x)
    loss = loss_fn(logits, y)

Passes input tokens through the model to get prediction logits
Calculates loss by comparing predictions to actual next tokens

3. Backward pass and optimization

optimizer.zero_grad(set_to_none=True)
    loss.backward()
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    optimizer.step()

Clears previous gradients using optimizer.zero_grad()
Computes gradients with loss.backward()
Clips gradients to prevent exploding gradients (maximum norm of 1.0)
Updates model parameters with optimizer.step()

4. Evaluation

if step % 50 == 0:
        model.eval()
        with torch.no_grad():
            vx, vy = get_batch("val", block_size=block_size, batch_size=32)
            vloss = loss_fn(model(vx), vy).item()
        print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")

Every 50 steps, evaluates model performance on validation data
Sets model to evaluation mode (disables dropout, etc.)
Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
Gets a validation batch and computes validation loss
Prints training progress with step number, training loss, and validation loss

This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Project 1: Build a Toy Transformer from Scratch in PyTorch

4. Training Loop (Causal LM)

4. Training Loop (Causal LM)

4. Training Loop (Causal LM)

4. Training Loop (Causal LM)