You've learned this already. β
Click here to view the next lesson.
Project 1: Build a Toy Transformer from Scratch in PyTorch
6. Where to go next (your experiments)
- Tokenizer: swap char-level for BPE/SentencePiece - Character-level tokenization assigns one token per character, which is simple but inefficient. BPE (Byte Pair Encoding) or SentencePiece tokenizers create subword units that better capture semantic meaning while keeping vocabulary size manageable. This can dramatically improve model performance by reducing sequence lengths and capturing common word patterns.
- Positions: replace sinusoidal with RoPE for long-context robustness - Rotary Position Embedding (RoPE) integrates relative position information directly into attention calculations rather than adding it to embeddings. This helps models handle longer sequences more effectively and generalizes better to sequence lengths not seen during training.
- Norm: try RMSNorm (often used in LLaMA-family models) - Root Mean Square Normalization simplifies LayerNorm by removing the mean centering step. This reduces computational complexity while maintaining similar regularization benefits, leading to faster training without performance degradation. LLaMA and other modern architectures prefer RMSNorm for these efficiency gains.
- FFN: replace GELU with SwiGLU - SwiGLU (Swish-Gated Linear Unit) activation combines a gating mechanism with the Swish activation function. This offers better gradient flow and typically improves model performance, especially in larger models. The gating structure allows the network to selectively pass information through the feedforward layers, enhancing expressivity.
- Regularization: dropout schedules, weight decay tuning - Instead of fixed dropout rates, implement schedules that reduce dropout as training progresses. Similarly, tune weight decay to find the optimal balance between regularization and model capacity. These techniques can significantly improve generalization by preventing both underfitting and overfitting.
- Scaling: increase
d_model,n_layers—watch overfitting and training time - Larger models (more hidden dimensions and more layers) can capture more complex patterns, but require more data to avoid overfitting and more compute for training. Monitor validation loss carefully when scaling up, and consider techniques like gradient checkpointing to manage memory usage. - Data: feed a bigger corpus; save/load checkpoints - Language models benefit enormously from more diverse training data. Implement checkpoint saving/loading to resume training runs and enable curriculum learning strategies where you gradually increase data complexity or model size throughout training.
6. Where to go next (your experiments)
- Tokenizer: swap char-level for BPE/SentencePiece - Character-level tokenization assigns one token per character, which is simple but inefficient. BPE (Byte Pair Encoding) or SentencePiece tokenizers create subword units that better capture semantic meaning while keeping vocabulary size manageable. This can dramatically improve model performance by reducing sequence lengths and capturing common word patterns.
- Positions: replace sinusoidal with RoPE for long-context robustness - Rotary Position Embedding (RoPE) integrates relative position information directly into attention calculations rather than adding it to embeddings. This helps models handle longer sequences more effectively and generalizes better to sequence lengths not seen during training.
- Norm: try RMSNorm (often used in LLaMA-family models) - Root Mean Square Normalization simplifies LayerNorm by removing the mean centering step. This reduces computational complexity while maintaining similar regularization benefits, leading to faster training without performance degradation. LLaMA and other modern architectures prefer RMSNorm for these efficiency gains.
- FFN: replace GELU with SwiGLU - SwiGLU (Swish-Gated Linear Unit) activation combines a gating mechanism with the Swish activation function. This offers better gradient flow and typically improves model performance, especially in larger models. The gating structure allows the network to selectively pass information through the feedforward layers, enhancing expressivity.
- Regularization: dropout schedules, weight decay tuning - Instead of fixed dropout rates, implement schedules that reduce dropout as training progresses. Similarly, tune weight decay to find the optimal balance between regularization and model capacity. These techniques can significantly improve generalization by preventing both underfitting and overfitting.
- Scaling: increase
d_model,n_layers—watch overfitting and training time - Larger models (more hidden dimensions and more layers) can capture more complex patterns, but require more data to avoid overfitting and more compute for training. Monitor validation loss carefully when scaling up, and consider techniques like gradient checkpointing to manage memory usage. - Data: feed a bigger corpus; save/load checkpoints - Language models benefit enormously from more diverse training data. Implement checkpoint saving/loading to resume training runs and enable curriculum learning strategies where you gradually increase data complexity or model size throughout training.
6. Where to go next (your experiments)
- Tokenizer: swap char-level for BPE/SentencePiece - Character-level tokenization assigns one token per character, which is simple but inefficient. BPE (Byte Pair Encoding) or SentencePiece tokenizers create subword units that better capture semantic meaning while keeping vocabulary size manageable. This can dramatically improve model performance by reducing sequence lengths and capturing common word patterns.
- Positions: replace sinusoidal with RoPE for long-context robustness - Rotary Position Embedding (RoPE) integrates relative position information directly into attention calculations rather than adding it to embeddings. This helps models handle longer sequences more effectively and generalizes better to sequence lengths not seen during training.
- Norm: try RMSNorm (often used in LLaMA-family models) - Root Mean Square Normalization simplifies LayerNorm by removing the mean centering step. This reduces computational complexity while maintaining similar regularization benefits, leading to faster training without performance degradation. LLaMA and other modern architectures prefer RMSNorm for these efficiency gains.
- FFN: replace GELU with SwiGLU - SwiGLU (Swish-Gated Linear Unit) activation combines a gating mechanism with the Swish activation function. This offers better gradient flow and typically improves model performance, especially in larger models. The gating structure allows the network to selectively pass information through the feedforward layers, enhancing expressivity.
- Regularization: dropout schedules, weight decay tuning - Instead of fixed dropout rates, implement schedules that reduce dropout as training progresses. Similarly, tune weight decay to find the optimal balance between regularization and model capacity. These techniques can significantly improve generalization by preventing both underfitting and overfitting.
- Scaling: increase
d_model,n_layers—watch overfitting and training time - Larger models (more hidden dimensions and more layers) can capture more complex patterns, but require more data to avoid overfitting and more compute for training. Monitor validation loss carefully when scaling up, and consider techniques like gradient checkpointing to manage memory usage. - Data: feed a bigger corpus; save/load checkpoints - Language models benefit enormously from more diverse training data. Implement checkpoint saving/loading to resume training runs and enable curriculum learning strategies where you gradually increase data complexity or model size throughout training.
6. Where to go next (your experiments)
- Tokenizer: swap char-level for BPE/SentencePiece - Character-level tokenization assigns one token per character, which is simple but inefficient. BPE (Byte Pair Encoding) or SentencePiece tokenizers create subword units that better capture semantic meaning while keeping vocabulary size manageable. This can dramatically improve model performance by reducing sequence lengths and capturing common word patterns.
- Positions: replace sinusoidal with RoPE for long-context robustness - Rotary Position Embedding (RoPE) integrates relative position information directly into attention calculations rather than adding it to embeddings. This helps models handle longer sequences more effectively and generalizes better to sequence lengths not seen during training.
- Norm: try RMSNorm (often used in LLaMA-family models) - Root Mean Square Normalization simplifies LayerNorm by removing the mean centering step. This reduces computational complexity while maintaining similar regularization benefits, leading to faster training without performance degradation. LLaMA and other modern architectures prefer RMSNorm for these efficiency gains.
- FFN: replace GELU with SwiGLU - SwiGLU (Swish-Gated Linear Unit) activation combines a gating mechanism with the Swish activation function. This offers better gradient flow and typically improves model performance, especially in larger models. The gating structure allows the network to selectively pass information through the feedforward layers, enhancing expressivity.
- Regularization: dropout schedules, weight decay tuning - Instead of fixed dropout rates, implement schedules that reduce dropout as training progresses. Similarly, tune weight decay to find the optimal balance between regularization and model capacity. These techniques can significantly improve generalization by preventing both underfitting and overfitting.
- Scaling: increase
d_model,n_layers—watch overfitting and training time - Larger models (more hidden dimensions and more layers) can capture more complex patterns, but require more data to avoid overfitting and more compute for training. Monitor validation loss carefully when scaling up, and consider techniques like gradient checkpointing to manage memory usage. - Data: feed a bigger corpus; save/load checkpoints - Language models benefit enormously from more diverse training data. Implement checkpoint saving/loading to resume training runs and enable curriculum learning strategies where you gradually increase data complexity or model size throughout training.

