Project 1: Build a Toy Transformer from Scratch in PyTorch
3. The Tiny GPT-Style Model
Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.
The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:
- Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
- Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
- Weight tying: Shares weights between input embedding and output projection for parameter efficiency
- Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)
The forward method shows the complete data flow through the model:
- Convert token IDs to embeddings
- Add positional information
- Process through a series of transformer blocks
- Apply final layer normalization
- Project to vocabulary logits for next-token prediction
This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.
class TinyGPT(nn.Module):
def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
super().__init__()
self.tok_embed = nn.Embedding(vocab_size, d_model)
self.pos = SinusoidalPositionalEncoding(d_model, max_len)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying helps a bit on tiny setups
self.tok_embed.weight = self.lm_head.weight
def forward(self, idx):
x = self.tok_embed(idx) # [B,T,C]
x = self.pos(x)
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
logits = self.lm_head(x) # [B,T,V]
return logits
Here's a comprehensive breakdown of the TinyGPT class:
Class Definition:
TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.
Constructor Parameters:
- vocab_size: Size of the vocabulary (number of unique tokens)
- d_model: Dimension of the embedding vectors (default: 256)
- n_layers: Number of transformer blocks (default: 4)
- n_heads: Number of attention heads in each transformer block (default: 4)
- d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
- max_len: Maximum sequence length supported (default: 512)
- dropout: Dropout probability for regularization (default: 0.1)
Component Initialization:
- tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
- pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
- blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
- ln_f: A final LayerNorm applied after all transformer blocks
- lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction
Weight Tying:
The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.
Forward Method:
The forward method defines the data flow through the model:
- Takes token indices (idx) as input
- Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
- Adds positional information using the pos encoder
- Sequentially processes the embeddings through each transformer block
- Applies the final layer normalization (ln_f)
- Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
- Returns the logits for further processing (typically computing loss or generating predictions)
Architecture Significance:
This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.
The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.
3. The Tiny GPT-Style Model
Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.
The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:
- Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
- Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
- Weight tying: Shares weights between input embedding and output projection for parameter efficiency
- Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)
The forward method shows the complete data flow through the model:
- Convert token IDs to embeddings
- Add positional information
- Process through a series of transformer blocks
- Apply final layer normalization
- Project to vocabulary logits for next-token prediction
This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.
class TinyGPT(nn.Module):
def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
super().__init__()
self.tok_embed = nn.Embedding(vocab_size, d_model)
self.pos = SinusoidalPositionalEncoding(d_model, max_len)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying helps a bit on tiny setups
self.tok_embed.weight = self.lm_head.weight
def forward(self, idx):
x = self.tok_embed(idx) # [B,T,C]
x = self.pos(x)
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
logits = self.lm_head(x) # [B,T,V]
return logits
Here's a comprehensive breakdown of the TinyGPT class:
Class Definition:
TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.
Constructor Parameters:
- vocab_size: Size of the vocabulary (number of unique tokens)
- d_model: Dimension of the embedding vectors (default: 256)
- n_layers: Number of transformer blocks (default: 4)
- n_heads: Number of attention heads in each transformer block (default: 4)
- d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
- max_len: Maximum sequence length supported (default: 512)
- dropout: Dropout probability for regularization (default: 0.1)
Component Initialization:
- tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
- pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
- blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
- ln_f: A final LayerNorm applied after all transformer blocks
- lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction
Weight Tying:
The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.
Forward Method:
The forward method defines the data flow through the model:
- Takes token indices (idx) as input
- Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
- Adds positional information using the pos encoder
- Sequentially processes the embeddings through each transformer block
- Applies the final layer normalization (ln_f)
- Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
- Returns the logits for further processing (typically computing loss or generating predictions)
Architecture Significance:
This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.
The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.
3. The Tiny GPT-Style Model
Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.
The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:
- Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
- Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
- Weight tying: Shares weights between input embedding and output projection for parameter efficiency
- Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)
The forward method shows the complete data flow through the model:
- Convert token IDs to embeddings
- Add positional information
- Process through a series of transformer blocks
- Apply final layer normalization
- Project to vocabulary logits for next-token prediction
This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.
class TinyGPT(nn.Module):
def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
super().__init__()
self.tok_embed = nn.Embedding(vocab_size, d_model)
self.pos = SinusoidalPositionalEncoding(d_model, max_len)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying helps a bit on tiny setups
self.tok_embed.weight = self.lm_head.weight
def forward(self, idx):
x = self.tok_embed(idx) # [B,T,C]
x = self.pos(x)
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
logits = self.lm_head(x) # [B,T,V]
return logits
Here's a comprehensive breakdown of the TinyGPT class:
Class Definition:
TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.
Constructor Parameters:
- vocab_size: Size of the vocabulary (number of unique tokens)
- d_model: Dimension of the embedding vectors (default: 256)
- n_layers: Number of transformer blocks (default: 4)
- n_heads: Number of attention heads in each transformer block (default: 4)
- d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
- max_len: Maximum sequence length supported (default: 512)
- dropout: Dropout probability for regularization (default: 0.1)
Component Initialization:
- tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
- pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
- blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
- ln_f: A final LayerNorm applied after all transformer blocks
- lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction
Weight Tying:
The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.
Forward Method:
The forward method defines the data flow through the model:
- Takes token indices (idx) as input
- Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
- Adds positional information using the pos encoder
- Sequentially processes the embeddings through each transformer block
- Applies the final layer normalization (ln_f)
- Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
- Returns the logits for further processing (typically computing loss or generating predictions)
Architecture Significance:
This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.
The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.
3. The Tiny GPT-Style Model
Now we arrive at the final assembly of our TinyGPT model - a compact decoder-only transformer that combines all the components we've built so far. This class ties together the token embedding, positional encoding, transformer blocks, and output projection layer into a complete language model.
The TinyGPT class represents a minimal but functional GPT-style architecture with these key features:
- Modular design: Combines embedding, positional encoding, transformer blocks, and output projection
- Configurable architecture: Customizable parameters for model dimensions, layers, heads, etc.
- Weight tying: Shares weights between input embedding and output projection for parameter efficiency
- Decoder-only approach: Uses only the decoder part of the transformer architecture (GPT-style)
The forward method shows the complete data flow through the model:
- Convert token IDs to embeddings
- Add positional information
- Process through a series of transformer blocks
- Apply final layer normalization
- Project to vocabulary logits for next-token prediction
This architecture follows the same principles as much larger models like GPT-2/3/4, but at a more manageable scale for educational purposes.
class TinyGPT(nn.Module):
def __init__(self, vocab_size, d_model=256, n_layers=4, n_heads=4, d_ff=1024, max_len=512, dropout=0.1):
super().__init__()
self.tok_embed = nn.Embedding(vocab_size, d_model)
self.pos = SinusoidalPositionalEncoding(d_model, max_len)
self.blocks = nn.ModuleList([
TransformerBlock(d_model, n_heads, d_ff, dropout) for _ in range(n_layers)
])
self.ln_f = nn.LayerNorm(d_model)
self.lm_head = nn.Linear(d_model, vocab_size, bias=False)
# Weight tying helps a bit on tiny setups
self.tok_embed.weight = self.lm_head.weight
def forward(self, idx):
x = self.tok_embed(idx) # [B,T,C]
x = self.pos(x)
for blk in self.blocks:
x = blk(x)
x = self.ln_f(x)
logits = self.lm_head(x) # [B,T,V]
return logits
Here's a comprehensive breakdown of the TinyGPT class:
Class Definition:
TinyGPT is a PyTorch neural network module that implements a compact decoder-only transformer architecture (similar to GPT-style models). It inherits from PyTorch's nn.Module base class, which provides the foundation for all neural network modules in PyTorch.
Constructor Parameters:
- vocab_size: Size of the vocabulary (number of unique tokens)
- d_model: Dimension of the embedding vectors (default: 256)
- n_layers: Number of transformer blocks (default: 4)
- n_heads: Number of attention heads in each transformer block (default: 4)
- d_ff: Dimension of the feed-forward network within transformer blocks (default: 1024)
- max_len: Maximum sequence length supported (default: 512)
- dropout: Dropout probability for regularization (default: 0.1)
Component Initialization:
- tok_embed: An embedding layer that converts token IDs to dense vectors of size d_model
- pos: A SinusoidalPositionalEncoding layer that adds positional information to the embeddings
- blocks: A ModuleList containing n_layers TransformerBlock instances, each with the specified parameters
- ln_f: A final LayerNorm applied after all transformer blocks
- lm_head: A linear layer that projects from d_model dimensions to vocab_size, producing logits for next-token prediction
Weight Tying:
The code ties the weights of the token embedding (tok_embed) and the output projection (lm_head) with the line: self.tok_embed.weight = self.lm_head.weight. This parameter sharing technique reduces the total number of parameters and has been shown to improve performance in language models.
Forward Method:
The forward method defines the data flow through the model:
- Takes token indices (idx) as input
- Converts them to embeddings using tok_embed - resulting shape is [Batch, Time, Channels]
- Adds positional information using the pos encoder
- Sequentially processes the embeddings through each transformer block
- Applies the final layer normalization (ln_f)
- Projects to vocabulary logits using lm_head - resulting shape is [Batch, Time, Vocabulary]
- Returns the logits for further processing (typically computing loss or generating predictions)
Architecture Significance:
This TinyGPT implementation represents a scaled-down version of modern decoder-only transformer architectures like GPT-2/3/4. Despite its simplicity, it contains all the essential components: token embeddings, positional encodings, self-attention mechanisms (via the TransformerBlock), and the final projection layer for next-token prediction.
The architecture follows a decoder-only approach, which means it's designed for autoregressive tasks like text generation where each token prediction depends only on previous tokens, not future ones.
