Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Project 1: Build a Toy Transformer from Scratch in PyTorch

0. Setup

What you'll build

A compact decoder-only Transformer (think GPT-style) that performs language modeling tasks. Unlike encoder-decoder architectures used in models like BERT, this follows the GPT architecture which only uses the decoder component:

Tokenizes text (we'll start with a simple character-level tokenizer to keep focus on the model). This converts raw text into numerical tokens that the model can process, with each character mapped to a unique ID.

Embeds tokens and adds positional information. Token embeddings convert IDs into dense vectors, while positional encodings tell the model where each token appears in the sequence - critical since attention has no inherent notion of order.

Uses multi-head self-attention + feedforward (SwiGLU optional) inside a TransformerBlock. Self-attention allows tokens to attend to all previous tokens in the sequence, while multiple heads let the model focus on different aspects of the input. The feedforward network processes each position independently.

Trains with causal language modeling (predict next token). This means the model can only see previous tokens when predicting the next one, maintaining the autoregressive property needed for text generation.

Generates text with temperature/top-k sampling. Temperature controls randomness (higher values = more diverse outputs), while top-k sampling restricts the model to choosing from only the k most likely next tokens.

You can run this on CPU or a single GPU. The implementation is hardware-flexible and doesn't require specialized equipment. The code is organized for readability, so you can swap parts later (e.g., try RoPE for better handling of longer sequences, add LayerNorm→RMSNorm for faster training, or plug in a subword tokenizer like BPE for more efficient vocabulary usage).

# pip install torch --upgrade
import math, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Breakdown of the code:

Line 1: Package Installation Comment

# pip install torch --upgrade

This is a comment indicating how to install or upgrade PyTorch using pip. It's not executed code but serves as a reminder for setting up the environment.

Line 2: Imports

import math, torch, torch.nn as nn, torch.nn.functional as F

This line imports several necessary libraries:

  • math: Python's standard math library for mathematical operations
  • torch: The main PyTorch library for deep learning
  • torch.nn as nn: Neural network modules from PyTorch
  • torch.nn.functional as F: Functional interface for neural network operations

Line 3: Device Configuration

device = "cuda" if torch.cuda.is_available() else "cpu"

This line determines whether to use GPU (CUDA) or CPU for computations:

  • It checks if CUDA is available using torch.cuda.is_available()
  • If a GPU with CUDA support is available, it sets device = "cuda"
  • Otherwise, it defaults to device = "cpu"
  • This allows the code to run optimally on different hardware configurations

Line 4: Setting Random Seed

torch.manual_seed(42)

This line sets a fixed random seed (42) for PyTorch's random number generators:

  • Setting a seed ensures reproducibility of results
  • Every time this code runs, the random operations (like weight initialization) will produce the same values
  • This is crucial for debugging and consistent experiments in machine learning

Overall, this is a standard setup block for a PyTorch deep learning project, preparing the environment for building and training neural networks, specifically a transformer model as indicated by the page content.

0. Setup

What you'll build

A compact decoder-only Transformer (think GPT-style) that performs language modeling tasks. Unlike encoder-decoder architectures used in models like BERT, this follows the GPT architecture which only uses the decoder component:

Tokenizes text (we'll start with a simple character-level tokenizer to keep focus on the model). This converts raw text into numerical tokens that the model can process, with each character mapped to a unique ID.

Embeds tokens and adds positional information. Token embeddings convert IDs into dense vectors, while positional encodings tell the model where each token appears in the sequence - critical since attention has no inherent notion of order.

Uses multi-head self-attention + feedforward (SwiGLU optional) inside a TransformerBlock. Self-attention allows tokens to attend to all previous tokens in the sequence, while multiple heads let the model focus on different aspects of the input. The feedforward network processes each position independently.

Trains with causal language modeling (predict next token). This means the model can only see previous tokens when predicting the next one, maintaining the autoregressive property needed for text generation.

Generates text with temperature/top-k sampling. Temperature controls randomness (higher values = more diverse outputs), while top-k sampling restricts the model to choosing from only the k most likely next tokens.

You can run this on CPU or a single GPU. The implementation is hardware-flexible and doesn't require specialized equipment. The code is organized for readability, so you can swap parts later (e.g., try RoPE for better handling of longer sequences, add LayerNorm→RMSNorm for faster training, or plug in a subword tokenizer like BPE for more efficient vocabulary usage).

# pip install torch --upgrade
import math, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Breakdown of the code:

Line 1: Package Installation Comment

# pip install torch --upgrade

This is a comment indicating how to install or upgrade PyTorch using pip. It's not executed code but serves as a reminder for setting up the environment.

Line 2: Imports

import math, torch, torch.nn as nn, torch.nn.functional as F

This line imports several necessary libraries:

  • math: Python's standard math library for mathematical operations
  • torch: The main PyTorch library for deep learning
  • torch.nn as nn: Neural network modules from PyTorch
  • torch.nn.functional as F: Functional interface for neural network operations

Line 3: Device Configuration

device = "cuda" if torch.cuda.is_available() else "cpu"

This line determines whether to use GPU (CUDA) or CPU for computations:

  • It checks if CUDA is available using torch.cuda.is_available()
  • If a GPU with CUDA support is available, it sets device = "cuda"
  • Otherwise, it defaults to device = "cpu"
  • This allows the code to run optimally on different hardware configurations

Line 4: Setting Random Seed

torch.manual_seed(42)

This line sets a fixed random seed (42) for PyTorch's random number generators:

  • Setting a seed ensures reproducibility of results
  • Every time this code runs, the random operations (like weight initialization) will produce the same values
  • This is crucial for debugging and consistent experiments in machine learning

Overall, this is a standard setup block for a PyTorch deep learning project, preparing the environment for building and training neural networks, specifically a transformer model as indicated by the page content.

0. Setup

What you'll build

A compact decoder-only Transformer (think GPT-style) that performs language modeling tasks. Unlike encoder-decoder architectures used in models like BERT, this follows the GPT architecture which only uses the decoder component:

Tokenizes text (we'll start with a simple character-level tokenizer to keep focus on the model). This converts raw text into numerical tokens that the model can process, with each character mapped to a unique ID.

Embeds tokens and adds positional information. Token embeddings convert IDs into dense vectors, while positional encodings tell the model where each token appears in the sequence - critical since attention has no inherent notion of order.

Uses multi-head self-attention + feedforward (SwiGLU optional) inside a TransformerBlock. Self-attention allows tokens to attend to all previous tokens in the sequence, while multiple heads let the model focus on different aspects of the input. The feedforward network processes each position independently.

Trains with causal language modeling (predict next token). This means the model can only see previous tokens when predicting the next one, maintaining the autoregressive property needed for text generation.

Generates text with temperature/top-k sampling. Temperature controls randomness (higher values = more diverse outputs), while top-k sampling restricts the model to choosing from only the k most likely next tokens.

You can run this on CPU or a single GPU. The implementation is hardware-flexible and doesn't require specialized equipment. The code is organized for readability, so you can swap parts later (e.g., try RoPE for better handling of longer sequences, add LayerNorm→RMSNorm for faster training, or plug in a subword tokenizer like BPE for more efficient vocabulary usage).

# pip install torch --upgrade
import math, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Breakdown of the code:

Line 1: Package Installation Comment

# pip install torch --upgrade

This is a comment indicating how to install or upgrade PyTorch using pip. It's not executed code but serves as a reminder for setting up the environment.

Line 2: Imports

import math, torch, torch.nn as nn, torch.nn.functional as F

This line imports several necessary libraries:

  • math: Python's standard math library for mathematical operations
  • torch: The main PyTorch library for deep learning
  • torch.nn as nn: Neural network modules from PyTorch
  • torch.nn.functional as F: Functional interface for neural network operations

Line 3: Device Configuration

device = "cuda" if torch.cuda.is_available() else "cpu"

This line determines whether to use GPU (CUDA) or CPU for computations:

  • It checks if CUDA is available using torch.cuda.is_available()
  • If a GPU with CUDA support is available, it sets device = "cuda"
  • Otherwise, it defaults to device = "cpu"
  • This allows the code to run optimally on different hardware configurations

Line 4: Setting Random Seed

torch.manual_seed(42)

This line sets a fixed random seed (42) for PyTorch's random number generators:

  • Setting a seed ensures reproducibility of results
  • Every time this code runs, the random operations (like weight initialization) will produce the same values
  • This is crucial for debugging and consistent experiments in machine learning

Overall, this is a standard setup block for a PyTorch deep learning project, preparing the environment for building and training neural networks, specifically a transformer model as indicated by the page content.

0. Setup

What you'll build

A compact decoder-only Transformer (think GPT-style) that performs language modeling tasks. Unlike encoder-decoder architectures used in models like BERT, this follows the GPT architecture which only uses the decoder component:

Tokenizes text (we'll start with a simple character-level tokenizer to keep focus on the model). This converts raw text into numerical tokens that the model can process, with each character mapped to a unique ID.

Embeds tokens and adds positional information. Token embeddings convert IDs into dense vectors, while positional encodings tell the model where each token appears in the sequence - critical since attention has no inherent notion of order.

Uses multi-head self-attention + feedforward (SwiGLU optional) inside a TransformerBlock. Self-attention allows tokens to attend to all previous tokens in the sequence, while multiple heads let the model focus on different aspects of the input. The feedforward network processes each position independently.

Trains with causal language modeling (predict next token). This means the model can only see previous tokens when predicting the next one, maintaining the autoregressive property needed for text generation.

Generates text with temperature/top-k sampling. Temperature controls randomness (higher values = more diverse outputs), while top-k sampling restricts the model to choosing from only the k most likely next tokens.

You can run this on CPU or a single GPU. The implementation is hardware-flexible and doesn't require specialized equipment. The code is organized for readability, so you can swap parts later (e.g., try RoPE for better handling of longer sequences, add LayerNorm→RMSNorm for faster training, or plug in a subword tokenizer like BPE for more efficient vocabulary usage).

# pip install torch --upgrade
import math, torch, torch.nn as nn, torch.nn.functional as F
device = "cuda" if torch.cuda.is_available() else "cpu"
torch.manual_seed(42)

Breakdown of the code:

Line 1: Package Installation Comment

# pip install torch --upgrade

This is a comment indicating how to install or upgrade PyTorch using pip. It's not executed code but serves as a reminder for setting up the environment.

Line 2: Imports

import math, torch, torch.nn as nn, torch.nn.functional as F

This line imports several necessary libraries:

  • math: Python's standard math library for mathematical operations
  • torch: The main PyTorch library for deep learning
  • torch.nn as nn: Neural network modules from PyTorch
  • torch.nn.functional as F: Functional interface for neural network operations

Line 3: Device Configuration

device = "cuda" if torch.cuda.is_available() else "cpu"

This line determines whether to use GPU (CUDA) or CPU for computations:

  • It checks if CUDA is available using torch.cuda.is_available()
  • If a GPU with CUDA support is available, it sets device = "cuda"
  • Otherwise, it defaults to device = "cpu"
  • This allows the code to run optimally on different hardware configurations

Line 4: Setting Random Seed

torch.manual_seed(42)

This line sets a fixed random seed (42) for PyTorch's random number generators:

  • Setting a seed ensures reproducibility of results
  • Every time this code runs, the random operations (like weight initialization) will produce the same values
  • This is crucial for debugging and consistent experiments in machine learning

Overall, this is a standard setup block for a PyTorch deep learning project, preparing the environment for building and training neural networks, specifically a transformer model as indicated by the page content.