Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Project 1: Build a Toy Transformer from Scratch in PyTorch

3. The Tiny GPT-Style Model

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

  • Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
  • Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
  • Weight tying: Shares weights between input embedding and output projection for parameter efficiency
  • Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

  1. Convert token IDs to embeddings
  2. Add positional information
  3. Process through a series of transformer blocks
  4. Apply final layer normalization
  5. Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos = SinusoidalPositionalEncoding(d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying helps a bit on tiny setups
        self.tok_embed.weight = self.lm_head.weight

    def forward(self, idx):
        x = self.tok_embed(idx)               # [B,T,C]
        x = self.pos(x)
        for blk in self.blocks:
            x = blk(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)              # [B,T,V]
        return logits

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

  • vocab_size: Size of the vocabulary (number of unique tokens)
  • d_model: Dimension of the embedding vectors (default: 256)
  • n_layers: Number of transformer blocks (default: 4)
  • n_heads: Number of attention heads in each transformer block (default: 4)
  • d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
  • max_len: Maximum sequence length supported (default: 512)
  • dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

  • tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
  • pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
  • blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
  • ln_f: A final LayerNorm applied after all transformer blocks
  • lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

  1. Takes token indices (idx) as input
  2. Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
  3. Adds positional information using the pos encoder
  4. Sequentially processes the embeddings through each transformer block
  5. Applies the final layer normalization (ln_f)
  6. Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
  7. Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.

3. The Tiny GPT-Style Model

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

  • Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
  • Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
  • Weight tying: Shares weights between input embedding and output projection for parameter efficiency
  • Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

  1. Convert token IDs to embeddings
  2. Add positional information
  3. Process through a series of transformer blocks
  4. Apply final layer normalization
  5. Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos = SinusoidalPositionalEncoding(d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying helps a bit on tiny setups
        self.tok_embed.weight = self.lm_head.weight

    def forward(self, idx):
        x = self.tok_embed(idx)               # [B,T,C]
        x = self.pos(x)
        for blk in self.blocks:
            x = blk(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)              # [B,T,V]
        return logits

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

  • vocab_size: Size of the vocabulary (number of unique tokens)
  • d_model: Dimension of the embedding vectors (default: 256)
  • n_layers: Number of transformer blocks (default: 4)
  • n_heads: Number of attention heads in each transformer block (default: 4)
  • d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
  • max_len: Maximum sequence length supported (default: 512)
  • dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

  • tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
  • pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
  • blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
  • ln_f: A final LayerNorm applied after all transformer blocks
  • lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

  1. Takes token indices (idx) as input
  2. Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
  3. Adds positional information using the pos encoder
  4. Sequentially processes the embeddings through each transformer block
  5. Applies the final layer normalization (ln_f)
  6. Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
  7. Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.

3. The Tiny GPT-Style Model

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

  • Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
  • Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
  • Weight tying: Shares weights between input embedding and output projection for parameter efficiency
  • Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

  1. Convert token IDs to embeddings
  2. Add positional information
  3. Process through a series of transformer blocks
  4. Apply final layer normalization
  5. Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos = SinusoidalPositionalEncoding(d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying helps a bit on tiny setups
        self.tok_embed.weight = self.lm_head.weight

    def forward(self, idx):
        x = self.tok_embed(idx)               # [B,T,C]
        x = self.pos(x)
        for blk in self.blocks:
            x = blk(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)              # [B,T,V]
        return logits

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

  • vocab_size: Size of the vocabulary (number of unique tokens)
  • d_model: Dimension of the embedding vectors (default: 256)
  • n_layers: Number of transformer blocks (default: 4)
  • n_heads: Number of attention heads in each transformer block (default: 4)
  • d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
  • max_len: Maximum sequence length supported (default: 512)
  • dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

  • tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
  • pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
  • blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
  • ln_f: A final LayerNorm applied after all transformer blocks
  • lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

  1. Takes token indices (idx) as input
  2. Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
  3. Adds positional information using the pos encoder
  4. Sequentially processes the embeddings through each transformer block
  5. Applies the final layer normalization (ln_f)
  6. Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
  7. Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.

3. The Tiny GPT-Style Model

Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.

The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:

  • Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
  • Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
  • Weight tying: Shares weights between input embedding and output projection for parameter efficiency
  • Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)

The forward method shows the complete data flow through the model:

  1. Convert token IDs to embeddings
  2. Add positional information
  3. Process through a series of transformer blocks
  4. Apply final layer normalization
  5. Project to vocabulary logits for next-token prediction

This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.

class TinyGPT(nn.Module):
    def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
        super().__init__()
        self.tok_embed = nn.Embedding(vocab_size, d_model)
        self.pos = SinusoidalPositionalEncoding(d_model, max_len)
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
        ])
        self.ln_f = nn.LayerNorm(d_model)
        self.lm_head = nn.Linear(d_model, vocab_size, bias=False)

        # Weight tying helps a bit on tiny setups
        self.tok_embed.weight = self.lm_head.weight

    def forward(self, idx):
        x = self.tok_embed(idx)               # [B,T,C]
        x = self.pos(x)
        for blk in self.blocks:
            x = blk(x)
        x = self.ln_f(x)
        logits = self.lm_head(x)              # [B,T,V]
        return logits

Here's a comprehensive breakdown of the TinyGPT class:

Class Definition:

TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.

Constructor Parameters:

  • vocab_size: Size of the vocabulary (number of unique tokens)
  • d_model: Dimension of the embedding vectors (default: 256)
  • n_layers: Number of transformer blocks (default: 4)
  • n_heads: Number of attention heads in each transformer block (default: 4)
  • d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
  • max_len: Maximum sequence length supported (default: 512)
  • dropout: Dropout probability for regularization (default: 0.1)

Component Initialization:

  • tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
  • pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
  • blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
  • ln_f: A final LayerNorm applied after all transformer blocks
  • lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction

Weight Tying:

The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.

Forward Method:

The forward method defines the data flow through the model:

  1. Takes token indices (idx) as input
  2. Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
  3. Adds positional information using the pos encoder
  4. Sequentially processes the embeddings through each transformer block
  5. Applies the final layer normalization (ln_f)
  6. Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
  7. Returns the logits for further processing (typically computing loss or generating predictions)

Architecture Significance:

This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.

The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.