Project 1: Build a Toy Transformer from Scratch in PyTorch
1. Tiny Dataset & Character Tokenizer
For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.
While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.
This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])
data = encode(corpus).to(device)
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]
Here's the breakdown of the code:
1. Corpus Definition
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.
2. Vocabulary Creation
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)This section creates a character-level vocabulary:
chars: Creates a sorted list of unique characters from the corpus using set conversion and sortingstoi(string-to-index): A dictionary mapping each character to a unique integer IDitos(index-to-string): The inverse mapping from IDs back to charactersvocab_size: The total number of unique characters in the corpus
3. Encoding and Decoding Functions
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])These are utility functions for converting between text and token IDs:
encode(): Converts a string into a tensor of token IDs using the stoi mappingdecode(): Converts a tensor of token IDs back into a string using the itos mapping
4. Data Preparation
data = encode(corpus).to(device)This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).
5. Train/Validation Split
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]This performs a simple train/validation split:
- Calculates the split point (n) at 90% of the data length
- Assigns the first 90% to
train_dataand the remaining 10% toval_data - This split allows for evaluating the model on unseen data during training
We’ll create training samples as sliding windows.
def get_batch(split, block_size=64, batch_size=32):
src = train_data if split=="train" else val_data
ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
x = torch.stack([src[i:i+block_size] for i in ix])
y = torch.stack([src[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)Here's a comprehensive breakdown of the get_batch function:
The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).
Function signature and parameters:
def get_batch(split, block_size=64, batch_size=32): This function takes three parameters:split: A string indicating whether to use training data ("train") or validation data ("val")block_size: The sequence length of each example (default: 64 tokens)batch_size: The number of sequences in each batch (default: 32)
Function body and logic:
src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameterix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generatesbatch_sizerandom starting indices within the source data- The upper bound
len(src) - block_size - 1ensures there's enough space for both input (x) and target (y) sequences - This creates a tensor of shape
[batch_size]containing random indices
- The upper bound
x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences- For each random index
i, extracts a sequence of lengthblock_size - The list comprehension creates
batch_sizesequences, which are stacked into a tensor - The resulting tensor has shape
[batch_size, block_size]
- For each random index
y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences- Similar to the previous line, but shifts each sequence by one position
- The target for position
jis the token at positionj+1in the source data - This implements causal language modeling: predict the next token given previous tokens
return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)
1. Tiny Dataset & Character Tokenizer
For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.
While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.
This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])
data = encode(corpus).to(device)
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]
Here's the breakdown of the code:
1. Corpus Definition
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.
2. Vocabulary Creation
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)This section creates a character-level vocabulary:
chars: Creates a sorted list of unique characters from the corpus using set conversion and sortingstoi(string-to-index): A dictionary mapping each character to a unique integer IDitos(index-to-string): The inverse mapping from IDs back to charactersvocab_size: The total number of unique characters in the corpus
3. Encoding and Decoding Functions
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])These are utility functions for converting between text and token IDs:
encode(): Converts a string into a tensor of token IDs using the stoi mappingdecode(): Converts a tensor of token IDs back into a string using the itos mapping
4. Data Preparation
data = encode(corpus).to(device)This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).
5. Train/Validation Split
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]This performs a simple train/validation split:
- Calculates the split point (n) at 90% of the data length
- Assigns the first 90% to
train_dataand the remaining 10% toval_data - This split allows for evaluating the model on unseen data during training
We’ll create training samples as sliding windows.
def get_batch(split, block_size=64, batch_size=32):
src = train_data if split=="train" else val_data
ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
x = torch.stack([src[i:i+block_size] for i in ix])
y = torch.stack([src[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)Here's a comprehensive breakdown of the get_batch function:
The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).
Function signature and parameters:
def get_batch(split, block_size=64, batch_size=32): This function takes three parameters:split: A string indicating whether to use training data ("train") or validation data ("val")block_size: The sequence length of each example (default: 64 tokens)batch_size: The number of sequences in each batch (default: 32)
Function body and logic:
src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameterix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generatesbatch_sizerandom starting indices within the source data- The upper bound
len(src) - block_size - 1ensures there's enough space for both input (x) and target (y) sequences - This creates a tensor of shape
[batch_size]containing random indices
- The upper bound
x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences- For each random index
i, extracts a sequence of lengthblock_size - The list comprehension creates
batch_sizesequences, which are stacked into a tensor - The resulting tensor has shape
[batch_size, block_size]
- For each random index
y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences- Similar to the previous line, but shifts each sequence by one position
- The target for position
jis the token at positionj+1in the source data - This implements causal language modeling: predict the next token given previous tokens
return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)
1. Tiny Dataset & Character Tokenizer
For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.
While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.
This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])
data = encode(corpus).to(device)
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]
Here's the breakdown of the code:
1. Corpus Definition
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.
2. Vocabulary Creation
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)This section creates a character-level vocabulary:
chars: Creates a sorted list of unique characters from the corpus using set conversion and sortingstoi(string-to-index): A dictionary mapping each character to a unique integer IDitos(index-to-string): The inverse mapping from IDs back to charactersvocab_size: The total number of unique characters in the corpus
3. Encoding and Decoding Functions
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])These are utility functions for converting between text and token IDs:
encode(): Converts a string into a tensor of token IDs using the stoi mappingdecode(): Converts a tensor of token IDs back into a string using the itos mapping
4. Data Preparation
data = encode(corpus).to(device)This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).
5. Train/Validation Split
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]This performs a simple train/validation split:
- Calculates the split point (n) at 90% of the data length
- Assigns the first 90% to
train_dataand the remaining 10% toval_data - This split allows for evaluating the model on unseen data during training
We’ll create training samples as sliding windows.
def get_batch(split, block_size=64, batch_size=32):
src = train_data if split=="train" else val_data
ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
x = torch.stack([src[i:i+block_size] for i in ix])
y = torch.stack([src[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)Here's a comprehensive breakdown of the get_batch function:
The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).
Function signature and parameters:
def get_batch(split, block_size=64, batch_size=32): This function takes three parameters:split: A string indicating whether to use training data ("train") or validation data ("val")block_size: The sequence length of each example (default: 64 tokens)batch_size: The number of sequences in each batch (default: 32)
Function body and logic:
src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameterix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generatesbatch_sizerandom starting indices within the source data- The upper bound
len(src) - block_size - 1ensures there's enough space for both input (x) and target (y) sequences - This creates a tensor of shape
[batch_size]containing random indices
- The upper bound
x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences- For each random index
i, extracts a sequence of lengthblock_size - The list comprehension creates
batch_sizesequences, which are stacked into a tensor - The resulting tensor has shape
[batch_size, block_size]
- For each random index
y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences- Similar to the previous line, but shifts each sequence by one position
- The target for position
jis the token at positionj+1in the source data - This implements causal language modeling: predict the next token given previous tokens
return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)
1. Tiny Dataset & Character Tokenizer
For pedagogy, a char-level tokenizer keeps the model logic front and center. Character-level tokenization assigns a unique token to each character in the text, creating a simple and transparent vocabulary.
While this approach is less efficient than subword tokenizers (like BPE or WordPiece) for practical applications, it offers significant educational benefits: it eliminates the complexity of sophisticated tokenization algorithms, allows students to focus on the core transformer architecture, and creates a direct mapping between raw text and model inputs that's easy to visualize and debug.
This straightforward approach ensures that learners can concentrate on understanding attention mechanisms and model training without getting distracted by tokenization details.
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])
data = encode(corpus).to(device)
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]
Here's the breakdown of the code:
1. Corpus Definition
corpus = (
"In the beginning, there were tokens. "
"A small transformer can still learn patterns."
)This defines the training text (corpus) for the model - a simple two-sentence string that will be used to train the toy transformer. In Python, adjacent string literals are automatically concatenated.
2. Vocabulary Creation
# Build vocab
chars = sorted(list(set(corpus)))
stoi = {ch:i for i,ch in enumerate(chars)}
itos = {i:ch for ch,i in stoi.items()}
vocab_size = len(chars)This section creates a character-level vocabulary:
chars: Creates a sorted list of unique characters from the corpus using set conversion and sortingstoi(string-to-index): A dictionary mapping each character to a unique integer IDitos(index-to-string): The inverse mapping from IDs back to charactersvocab_size: The total number of unique characters in the corpus
3. Encoding and Decoding Functions
def encode(s): return torch.tensor([stoi[c] for c in s], dtype=torch.long)
def decode(ids): return "".join([itos[int(i)] for i in ids])These are utility functions for converting between text and token IDs:
encode(): Converts a string into a tensor of token IDs using the stoi mappingdecode(): Converts a tensor of token IDs back into a string using the itos mapping
4. Data Preparation
data = encode(corpus).to(device)This line encodes the entire corpus into a tensor of token IDs and moves it to the appropriate device (CPU or GPU).
5. Train/Validation Split
# Train/val split
n = int(0.9 * len(data))
train_data, val_data = data[:n], data[n:]This performs a simple train/validation split:
- Calculates the split point (n) at 90% of the data length
- Assigns the first 90% to
train_dataand the remaining 10% toval_data - This split allows for evaluating the model on unseen data during training
We’ll create training samples as sliding windows.
def get_batch(split, block_size=64, batch_size=32):
src = train_data if split=="train" else val_data
ix = torch.randint(0, len(src) - block_size - 1, (batch_size,))
x = torch.stack([src[i:i+block_size] for i in ix])
y = torch.stack([src[i+1:i+block_size+1] for i in ix])
return x.to(device), y.to(device)Here's a comprehensive breakdown of the get_batch function:
The get_batch function creates training or validation batches for the transformer model. It generates input-output pairs where each input is a sequence of tokens, and the corresponding output is the same sequence shifted by one position (for next-token prediction).
Function signature and parameters:
def get_batch(split, block_size=64, batch_size=32): This function takes three parameters:split: A string indicating whether to use training data ("train") or validation data ("val")block_size: The sequence length of each example (default: 64 tokens)batch_size: The number of sequences in each batch (default: 32)
Function body and logic:
src = train_data if split=="train" else val_data: Selects the appropriate dataset based on the split parameterix = torch.randint(0, len(src) - block_size - 1, (batch_size,)): Generatesbatch_sizerandom starting indices within the source data- The upper bound
len(src) - block_size - 1ensures there's enough space for both input (x) and target (y) sequences - This creates a tensor of shape
[batch_size]containing random indices
- The upper bound
x = torch.stack([src[i:i+block_size] for i in ix]): Creates input sequences- For each random index
i, extracts a sequence of lengthblock_size - The list comprehension creates
batch_sizesequences, which are stacked into a tensor - The resulting tensor has shape
[batch_size, block_size]
- For each random index
y = torch.stack([src[i+1:i+block_size+1] for i in ix]): Creates target sequences- Similar to the previous line, but shifts each sequence by one position
- The target for position
jis the token at positionj+1in the source data - This implements causal language modeling: predict the next token given previous tokens
return x.to(device), y.to(device): Returns both tensors moved to the appropriate device (CPU or GPU)

