Project 1: Build a Toy Transformer from Scratch in PyTorch
4. Training Loop (Causal LM)
The training loop is a crucial component of any language model implementation. In this section, we set up the process for training our TinyGPT model for causal language modeling - predicting the next token given previous tokens.
Key Components of the Training Loop:
- Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
- Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
- Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
- Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32
The Training Process:
- Get a batch of training data (x = input tokens, y = target tokens)
- Forward pass through the model to get logits (raw prediction scores)
- Calculate loss by comparing predictions to actual next tokens
- Zero gradients, perform backpropagation, and clip gradients to prevent explosion
- Update model parameters using the optimizer
- Periodically evaluate on validation data to monitor progress
Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.
This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
block_size = 64
for step in range(500): # small demo; increase for better quality
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
logits = model(x)
loss = loss_fn(logits, y)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
Here's a comprehensive breakdown of the training loop code for the TinyGPT model:
Loss Function Definition
The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
This function:
- Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
- Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
- Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence
Model and Optimizer Setup
Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
The optimizer uses:
- Learning rate of 3e-4 (0.0003), which is a common default for transformer models
- Weight decay of 0.01 for regularization to prevent overfitting
Training Configuration
The code sets up training parameters:
block_size = 64
The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.
Training Loop
The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):
for step in range(500): # small demo; increase for better quality
Each training iteration follows these steps:
1. Prepare for training
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
- Sets the model to training mode (enables dropout, batch normalization, etc.)
- Gets a batch of training data with context length of 64 tokens and batch size of 32
- 'x' contains input tokens, 'y' contains target tokens (the next token to predict)
2. Forward pass
logits = model(x)
loss = loss_fn(logits, y)
- Passes input tokens through the model to get prediction logits
- Calculates loss by comparing predictions to actual next tokens
3. Backward pass and optimization
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
- Clears previous gradients using optimizer.zero_grad()
- Computes gradients with loss.backward()
- Clips gradients to prevent exploding gradients (maximum norm of 1.0)
- Updates model parameters with optimizer.step()
4. Evaluation
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
- Every 50 steps, evaluates model performance on validation data
- Sets model to evaluation mode (disables dropout, etc.)
- Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
- Gets a validation batch and computes validation loss
- Prints training progress with step number, training loss, and validation loss
This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.
4. Training Loop (Causal LM)
The training loop is a crucial component of any language model implementation. In this section, we set up the process for training our TinyGPT model for causal language modeling - predicting the next token given previous tokens.
Key Components of the Training Loop:
- Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
- Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
- Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
- Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32
The Training Process:
- Get a batch of training data (x = input tokens, y = target tokens)
- Forward pass through the model to get logits (raw prediction scores)
- Calculate loss by comparing predictions to actual next tokens
- Zero gradients, perform backpropagation, and clip gradients to prevent explosion
- Update model parameters using the optimizer
- Periodically evaluate on validation data to monitor progress
Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.
This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
block_size = 64
for step in range(500): # small demo; increase for better quality
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
logits = model(x)
loss = loss_fn(logits, y)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
Here's a comprehensive breakdown of the training loop code for the TinyGPT model:
Loss Function Definition
The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
This function:
- Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
- Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
- Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence
Model and Optimizer Setup
Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
The optimizer uses:
- Learning rate of 3e-4 (0.0003), which is a common default for transformer models
- Weight decay of 0.01 for regularization to prevent overfitting
Training Configuration
The code sets up training parameters:
block_size = 64
The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.
Training Loop
The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):
for step in range(500): # small demo; increase for better quality
Each training iteration follows these steps:
1. Prepare for training
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
- Sets the model to training mode (enables dropout, batch normalization, etc.)
- Gets a batch of training data with context length of 64 tokens and batch size of 32
- 'x' contains input tokens, 'y' contains target tokens (the next token to predict)
2. Forward pass
logits = model(x)
loss = loss_fn(logits, y)
- Passes input tokens through the model to get prediction logits
- Calculates loss by comparing predictions to actual next tokens
3. Backward pass and optimization
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
- Clears previous gradients using optimizer.zero_grad()
- Computes gradients with loss.backward()
- Clips gradients to prevent exploding gradients (maximum norm of 1.0)
- Updates model parameters with optimizer.step()
4. Evaluation
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
- Every 50 steps, evaluates model performance on validation data
- Sets model to evaluation mode (disables dropout, etc.)
- Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
- Gets a validation batch and computes validation loss
- Prints training progress with step number, training loss, and validation loss
This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.
4. Training Loop (Causal LM)
The training loop is a crucial component of any language model implementation. In this section, we set up the process for training our TinyGPT model for causal language modeling - predicting the next token given previous tokens.
Key Components of the Training Loop:
- Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
- Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
- Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
- Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32
The Training Process:
- Get a batch of training data (x = input tokens, y = target tokens)
- Forward pass through the model to get logits (raw prediction scores)
- Calculate loss by comparing predictions to actual next tokens
- Zero gradients, perform backpropagation, and clip gradients to prevent explosion
- Update model parameters using the optimizer
- Periodically evaluate on validation data to monitor progress
Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.
This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
block_size = 64
for step in range(500): # small demo; increase for better quality
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
logits = model(x)
loss = loss_fn(logits, y)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
Here's a comprehensive breakdown of the training loop code for the TinyGPT model:
Loss Function Definition
The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
This function:
- Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
- Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
- Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence
Model and Optimizer Setup
Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
The optimizer uses:
- Learning rate of 3e-4 (0.0003), which is a common default for transformer models
- Weight decay of 0.01 for regularization to prevent overfitting
Training Configuration
The code sets up training parameters:
block_size = 64
The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.
Training Loop
The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):
for step in range(500): # small demo; increase for better quality
Each training iteration follows these steps:
1. Prepare for training
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
- Sets the model to training mode (enables dropout, batch normalization, etc.)
- Gets a batch of training data with context length of 64 tokens and batch size of 32
- 'x' contains input tokens, 'y' contains target tokens (the next token to predict)
2. Forward pass
logits = model(x)
loss = loss_fn(logits, y)
- Passes input tokens through the model to get prediction logits
- Calculates loss by comparing predictions to actual next tokens
3. Backward pass and optimization
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
- Clears previous gradients using optimizer.zero_grad()
- Computes gradients with loss.backward()
- Clips gradients to prevent exploding gradients (maximum norm of 1.0)
- Updates model parameters with optimizer.step()
4. Evaluation
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
- Every 50 steps, evaluates model performance on validation data
- Sets model to evaluation mode (disables dropout, etc.)
- Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
- Gets a validation batch and computes validation loss
- Prints training progress with step number, training loss, and validation loss
This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.
4. Training Loop (Causal LM)
The training loop is a crucial component of any language model implementation. In this section, we set up the process for training our TinyGPT model for causal language modeling - predicting the next token given previous tokens.
Key Components of the Training Loop:
- Loss Function: Uses cross-entropy loss, which measures the difference between predicted token probabilities and actual next tokens
- Model Initialization: Creates our TinyGPT model and moves it to the specified device (CPU/GPU)
- Optimizer: Uses AdamW with learning rate 3e-4 and weight decay 0.01 for regularization
- Training Parameters: Uses a context window (block_size) of 64 tokens and batch size of 32
The Training Process:
- Get a batch of training data (x = input tokens, y = target tokens)
- Forward pass through the model to get logits (raw prediction scores)
- Calculate loss by comparing predictions to actual next tokens
- Zero gradients, perform backpropagation, and clip gradients to prevent explosion
- Update model parameters using the optimizer
- Periodically evaluate on validation data to monitor progress
Gradient clipping (set to 1.0) prevents unstable updates by limiting gradient magnitudes, which is especially important in transformers that can suffer from training instability.
This basic training loop can be extended with learning rate scheduling, more sophisticated evaluation metrics, or early stopping based on validation performance.
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
block_size = 64
for step in range(500): # small demo; increase for better quality
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
logits = model(x)
loss = loss_fn(logits, y)
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
Here's a comprehensive breakdown of the training loop code for the TinyGPT model:
Loss Function Definition
The code begins by defining a loss function that calculates cross-entropy loss between the model's predictions (logits) and the target tokens:
def loss_fn(logits, targets):
# Flatten time+batch for cross-entropy
return F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
This function:
- Reshapes the logits from [batch_size, sequence_length, vocab_size] to [batch_size*sequence_length, vocab_size]
- Reshapes targets from [batch_size, sequence_length] to [batch_size*sequence_length]
- Computes cross-entropy loss, which measures how well the model predicts the next token in the sequence
Model and Optimizer Setup
Next, the code initializes the TinyGPT model and moves it to the specified device (CPU/GPU), then sets up the AdamW optimizer:
model = TinyGPT(vocab_size).to(device)
optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=0.01)
The optimizer uses:
- Learning rate of 3e-4 (0.0003), which is a common default for transformer models
- Weight decay of 0.01 for regularization to prevent overfitting
Training Configuration
The code sets up training parameters:
block_size = 64
The block_size (64) determines the context window - how many previous tokens the model can see when predicting the next token.
Training Loop
The main training loop runs for 500 iterations (though the comment suggests this is just a small demo):
for step in range(500): # small demo; increase for better quality
Each training iteration follows these steps:
1. Prepare for training
model.train()
x, y = get_batch("train", block_size=block_size, batch_size=32)
- Sets the model to training mode (enables dropout, batch normalization, etc.)
- Gets a batch of training data with context length of 64 tokens and batch size of 32
- 'x' contains input tokens, 'y' contains target tokens (the next token to predict)
2. Forward pass
logits = model(x)
loss = loss_fn(logits, y)
- Passes input tokens through the model to get prediction logits
- Calculates loss by comparing predictions to actual next tokens
3. Backward pass and optimization
optimizer.zero_grad(set_to_none=True)
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
- Clears previous gradients using optimizer.zero_grad()
- Computes gradients with loss.backward()
- Clips gradients to prevent exploding gradients (maximum norm of 1.0)
- Updates model parameters with optimizer.step()
4. Evaluation
if step % 50 == 0:
model.eval()
with torch.no_grad():
vx, vy = get_batch("val", block_size=block_size, batch_size=32)
vloss = loss_fn(model(vx), vy).item()
print(f"step {step:04d} | train loss {loss.item():.3f} | val loss {vloss:.3f}")
- Every 50 steps, evaluates model performance on validation data
- Sets model to evaluation mode (disables dropout, etc.)
- Uses torch.no_grad() to disable gradient calculation during evaluation (saves memory)
- Gets a validation batch and computes validation loss
- Prints training progress with step number, training loss, and validation loss
This training loop implements core deep learning practices including train/validation splits, gradient clipping, and periodic evaluation to monitor for overfitting.

