Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Project 1: Build a Toy Transformer from Scratch in PyTorch

1. Tiny Dataset & Character Tokenizer

For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.

While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.

This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

data = encode(corpus).to(device)

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Here's the breakdown of the code:

1. Corpus Definition

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.

2. Vocabulary Creation

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

This section creates a character-level vocabulary:

  • chars: Creates a sorted list of unique characters from the corpus using set conversion and sorting
  • stoi (string-to-index): A dictionary mapping each character to a unique integer ID
  • itos (index-to-string): The inverse mapping from IDs back to characters
  • vocab_size: The total number of unique characters in the corpus

3. Encoding and Decoding Functions

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

These are utility functions for converting between text and token IDs:

  • encode(): Converts a string into a tensor of token IDs using the stoi mapping
  • decode(): Converts a tensor of token IDs back into a string using the itos mapping

4. Data Preparation

data = encode(corpus).to(device)

This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).

5. Train/Validation Split

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

This performs a simple train/validation split:

  • Calculates the split point (n) at 90% of the data length
  • Assigns the first 90% to train_data and the remaining 10% to val_data
  • This split allows for evaluating the model on unseen data during training

We’ll create training samples as sliding windows.

def get_batch(split, block_size=64, batch_size=32):
    src = train_data if split=="train" else val_data
    ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
    x = torch.stack([src[i:i+block_size] for i in ix])
    y = torch.stack([src[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Here's a comprehensive breakdown of the get_batch function:

The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).

Function signature and parameters:

  • def get_batch(split, block_size=64, batch_size=32): This function takes three parameters: 
    • split: A string indicating whether to use training data ("train") or validation data ("val")
    • block_size: The sequence length of each example (default: 64 tokens)
    • batch_size: The number of sequences in each batch (default: 32)

Function body and logic:

  • src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameter
  • ix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generates batch_size random starting indices within the source data 
    • The upper bound len(src) - block_size - 1 ensures there's enough space for both input (x) and target (y) sequences
    • This creates a tensor of shape [batch_size] containing random indices
  • x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences 
    • For each random index i, extracts a sequence of length block_size
    • The list comprehension creates batch_size sequences, which are stacked into a tensor
    • The resulting tensor has shape [batch_size, block_size]
  • y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences 
    • Similar to the previous line, but shifts each sequence by one position
    • The target for position j is the token at position j+1 in the source data
    • This implements causal language modeling: predict the next token given previous tokens
  • return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)

1. Tiny Dataset & Character Tokenizer

For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.

While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.

This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

data = encode(corpus).to(device)

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Here's the breakdown of the code:

1. Corpus Definition

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.

2. Vocabulary Creation

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

This section creates a character-level vocabulary:

  • chars: Creates a sorted list of unique characters from the corpus using set conversion and sorting
  • stoi (string-to-index): A dictionary mapping each character to a unique integer ID
  • itos (index-to-string): The inverse mapping from IDs back to characters
  • vocab_size: The total number of unique characters in the corpus

3. Encoding and Decoding Functions

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

These are utility functions for converting between text and token IDs:

  • encode(): Converts a string into a tensor of token IDs using the stoi mapping
  • decode(): Converts a tensor of token IDs back into a string using the itos mapping

4. Data Preparation

data = encode(corpus).to(device)

This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).

5. Train/Validation Split

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

This performs a simple train/validation split:

  • Calculates the split point (n) at 90% of the data length
  • Assigns the first 90% to train_data and the remaining 10% to val_data
  • This split allows for evaluating the model on unseen data during training

We’ll create training samples as sliding windows.

def get_batch(split, block_size=64, batch_size=32):
    src = train_data if split=="train" else val_data
    ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
    x = torch.stack([src[i:i+block_size] for i in ix])
    y = torch.stack([src[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Here's a comprehensive breakdown of the get_batch function:

The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).

Function signature and parameters:

  • def get_batch(split, block_size=64, batch_size=32): This function takes three parameters: 
    • split: A string indicating whether to use training data ("train") or validation data ("val")
    • block_size: The sequence length of each example (default: 64 tokens)
    • batch_size: The number of sequences in each batch (default: 32)

Function body and logic:

  • src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameter
  • ix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generates batch_size random starting indices within the source data 
    • The upper bound len(src) - block_size - 1 ensures there's enough space for both input (x) and target (y) sequences
    • This creates a tensor of shape [batch_size] containing random indices
  • x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences 
    • For each random index i, extracts a sequence of length block_size
    • The list comprehension creates batch_size sequences, which are stacked into a tensor
    • The resulting tensor has shape [batch_size, block_size]
  • y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences 
    • Similar to the previous line, but shifts each sequence by one position
    • The target for position j is the token at position j+1 in the source data
    • This implements causal language modeling: predict the next token given previous tokens
  • return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)

1. Tiny Dataset & Character Tokenizer

For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.

While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.

This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

data = encode(corpus).to(device)

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Here's the breakdown of the code:

1. Corpus Definition

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.

2. Vocabulary Creation

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

This section creates a character-level vocabulary:

  • chars: Creates a sorted list of unique characters from the corpus using set conversion and sorting
  • stoi (string-to-index): A dictionary mapping each character to a unique integer ID
  • itos (index-to-string): The inverse mapping from IDs back to characters
  • vocab_size: The total number of unique characters in the corpus

3. Encoding and Decoding Functions

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

These are utility functions for converting between text and token IDs:

  • encode(): Converts a string into a tensor of token IDs using the stoi mapping
  • decode(): Converts a tensor of token IDs back into a string using the itos mapping

4. Data Preparation

data = encode(corpus).to(device)

This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).

5. Train/Validation Split

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

This performs a simple train/validation split:

  • Calculates the split point (n) at 90% of the data length
  • Assigns the first 90% to train_data and the remaining 10% to val_data
  • This split allows for evaluating the model on unseen data during training

We’ll create training samples as sliding windows.

def get_batch(split, block_size=64, batch_size=32):
    src = train_data if split=="train" else val_data
    ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
    x = torch.stack([src[i:i+block_size] for i in ix])
    y = torch.stack([src[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Here's a comprehensive breakdown of the get_batch function:

The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).

Function signature and parameters:

  • def get_batch(split, block_size=64, batch_size=32): This function takes three parameters: 
    • split: A string indicating whether to use training data ("train") or validation data ("val")
    • block_size: The sequence length of each example (default: 64 tokens)
    • batch_size: The number of sequences in each batch (default: 32)

Function body and logic:

  • src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameter
  • ix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generates batch_size random starting indices within the source data 
    • The upper bound len(src) - block_size - 1 ensures there's enough space for both input (x) and target (y) sequences
    • This creates a tensor of shape [batch_size] containing random indices
  • x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences 
    • For each random index i, extracts a sequence of length block_size
    • The list comprehension creates batch_size sequences, which are stacked into a tensor
    • The resulting tensor has shape [batch_size, block_size]
  • y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences 
    • Similar to the previous line, but shifts each sequence by one position
    • The target for position j is the token at position j+1 in the source data
    • This implements causal language modeling: predict the next token given previous tokens
  • return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)

1. Tiny Dataset & Character Tokenizer

For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.

While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.

This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

data = encode(corpus).to(device)

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

Here's the breakdown of the code:

1. Corpus Definition

corpus = (
    "In the beginning, there were tokens. "
    "A small transformer can still learn patterns."
)

This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.

2. Vocabulary Creation

# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)

This section creates a character-level vocabulary:

  • chars: Creates a sorted list of unique characters from the corpus using set conversion and sorting
  • stoi (string-to-index): A dictionary mapping each character to a unique integer ID
  • itos (index-to-string): The inverse mapping from IDs back to characters
  • vocab_size: The total number of unique characters in the corpus

3. Encoding and Decoding Functions

def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])

These are utility functions for converting between text and token IDs:

  • encode(): Converts a string into a tensor of token IDs using the stoi mapping
  • decode(): Converts a tensor of token IDs back into a string using the itos mapping

4. Data Preparation

data = encode(corpus).to(device)

This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).

5. Train/Validation Split

# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]

This performs a simple train/validation split:

  • Calculates the split point (n) at 90% of the data length
  • Assigns the first 90% to train_data and the remaining 10% to val_data
  • This split allows for evaluating the model on unseen data during training

We’ll create training samples as sliding windows.

def get_batch(split, block_size=64, batch_size=32):
    src = train_data if split=="train" else val_data
    ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
    x = torch.stack([src[i:i+block_size] for i in ix])
    y = torch.stack([src[i+1:i+block_size+1] for i in ix])
    return x.to(device), y.to(device)

Here's a comprehensive breakdown of the get_batch function:

The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).

Function signature and parameters:

  • def get_batch(split, block_size=64, batch_size=32): This function takes three parameters: 
    • split: A string indicating whether to use training data ("train") or validation data ("val")
    • block_size: The sequence length of each example (default: 64 tokens)
    • batch_size: The number of sequences in each batch (default: 32)

Function body and logic:

  • src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameter
  • ix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generates batch_size random starting indices within the source data 
    • The upper bound len(src) - block_size - 1 ensures there's enough space for both input (x) and target (y) sequences
    • This creates a tensor of shape [batch_size] containing random indices
  • x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences 
    • For each random index i, extracts a sequence of length block_size
    • The list comprehension creates batch_size sequences, which are stacked into a tensor
    • The resulting tensor has shape [batch_size, block_size]
  • y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences 
    • Similar to the previous line, but shifts each sequence by one position
    • The target for position j is the token at position j+1 in the source data
    • This implements causal language modeling: predict the next token given previous tokens
  • return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)