Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Chapter 1: What Are LLMs? From Transformers to Titans

Practical Exercises – Chapter 1

The following exercises will help you apply what you learned about LLM families, architectures, and scaling laws.

Exercise 1 – Exploring Decoder-Only Models with Hugging Face

Task:

Use the Hugging Face Transformers library to load a decoder-only model (like GPT-2) and generate text. Write a Python script that prompts the model with:

Artificial intelligence will change the world by

and generates a continuation of at least 30 tokens.

Solution:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 (decoder-only model)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Artificial intelligence will change the world by"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate continuation
outputs = model.generate(
    inputs["input_ids"],
    max_length=50,        # prompt + ~30 tokens
    do_sample=True,
    top_k=50,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This demonstrates the decoder-only left-to-right generation process.

Exercise 2 – Summarization with an Encoder-Decoder Model

Task:

Use a T5-small model to summarize the following text:

“The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention.”

Solution:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention."
inputs = tokenizer("summarize: " + text, return_tensors="pt")

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=25,
    min_length=5,
    length_penalty=2.0
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

This highlights how encoder-decoder models handle input → output transformations.

Exercise 3 – Simulating a Mixture-of-Experts Layer

Task:

Implement a very simple Mixture-of-Experts (MoE) layer in PyTorch where only the top-2 experts are selected for a given input.

Solution:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        return torch.relu(self.fc(x))

class MoELayer(nn.Module):
    def __init__(self, num_experts=4, hidden_dim=32, k=2):
        super().__init__()
        self.experts = nn.ModuleList([Expert(hidden_dim) for _ in range(num_experts)])
        self.router = nn.Linear(hidden_dim, num_experts)
        self.k = k

    def forward(self, x):
        scores = torch.softmax(self.router(x), dim=-1)
        topk = torch.topk(scores, self.k, dim=-1)
        outputs = []
        for i, idx in enumerate(topk.indices[0]):
            outputs.append(self.experts[idx](x) * topk.values[0][i])
        return sum(outputs)

# Example usage
layer = MoELayer(num_experts=4, hidden_dim=32, k=2)
x = torch.randn(1, 32)
print(layer(x).shape)

This shows how an MoE layer routes tokens through specialist sub-networks.

Exercise 4 – Visualizing Scaling Laws

Task:

Simulate scaling laws using a toy function. Plot how model performance increases with more parameters under Kaplan’s law (bigger is always better) vs Chinchilla’s insight (data balance matters).

Solution:

import numpy as np
import matplotlib.pyplot as plt

# Parameters (model size) from 1M to 10B
params = np.logspace(6, 10, 20)
data = params * 20  # Chinchilla's 20x rule

# Fake "performance" functions
performance_kaplan = 1 - 1 / (np.log(params))
performance_chinchilla = 1 - 1 / (np.log(data))

plt.figure(figsize=(8,5))
plt.plot(params, performance_kaplan, label="Kaplan Scaling")
plt.plot(params, performance_chinchilla, label="Chinchilla Scaling", linestyle="--")
plt.xscale("log")
plt.xlabel("Model Parameters (log scale)")
plt.ylabel("Performance (arbitrary units)")
plt.title("Toy Visualization of Scaling Laws")
plt.legend()
plt.show()

You’ll see how Kaplan favors size, while Chinchilla emphasizes data balance.

Exercise 5 – Model Choice for Your Startup

Task (theoretical):

Imagine you are starting an AI-powered customer support startup. You have a limited budget for compute and need an LLM that balances cost, efficiency, and control. Based on what you learned in this chapter, which model family would you pick (GPT, LLaMA, Claude, Gemini, Mistral, or DeepSeek) and why?

Solution (sample answer):

I would choose Mistral or LLaMA, since both provide open weights and can be run locally with quantization. GPT or Claude might be too expensive for continuous usage, while Gemini is closed-source. DeepSeek is interesting but lacks a mature ecosystem. For cost-effective customization, Mistral strikes a great balance between efficiency and performance.

Summary of Learning Goals

By completing these exercises, you have:

  • Generated text with decoder-only models.
  • Summarized text with an encoder-decoder model.
  • Built a toy Mixture-of-Experts layer.
  • Visualized scaling laws in action.
  • Practiced making real-world trade-off decisions about model families.

Practical Exercises – Chapter 1

The following exercises will help you apply what you learned about LLM families, architectures, and scaling laws.

Exercise 1 – Exploring Decoder-Only Models with Hugging Face

Task:

Use the Hugging Face Transformers library to load a decoder-only model (like GPT-2) and generate text. Write a Python script that prompts the model with:

Artificial intelligence will change the world by

and generates a continuation of at least 30 tokens.

Solution:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 (decoder-only model)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Artificial intelligence will change the world by"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate continuation
outputs = model.generate(
    inputs["input_ids"],
    max_length=50,        # prompt + ~30 tokens
    do_sample=True,
    top_k=50,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This demonstrates the decoder-only left-to-right generation process.

Exercise 2 – Summarization with an Encoder-Decoder Model

Task:

Use a T5-small model to summarize the following text:

“The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention.”

Solution:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention."
inputs = tokenizer("summarize: " + text, return_tensors="pt")

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=25,
    min_length=5,
    length_penalty=2.0
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

This highlights how encoder-decoder models handle input → output transformations.

Exercise 3 – Simulating a Mixture-of-Experts Layer

Task:

Implement a very simple Mixture-of-Experts (MoE) layer in PyTorch where only the top-2 experts are selected for a given input.

Solution:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        return torch.relu(self.fc(x))

class MoELayer(nn.Module):
    def __init__(self, num_experts=4, hidden_dim=32, k=2):
        super().__init__()
        self.experts = nn.ModuleList([Expert(hidden_dim) for _ in range(num_experts)])
        self.router = nn.Linear(hidden_dim, num_experts)
        self.k = k

    def forward(self, x):
        scores = torch.softmax(self.router(x), dim=-1)
        topk = torch.topk(scores, self.k, dim=-1)
        outputs = []
        for i, idx in enumerate(topk.indices[0]):
            outputs.append(self.experts[idx](x) * topk.values[0][i])
        return sum(outputs)

# Example usage
layer = MoELayer(num_experts=4, hidden_dim=32, k=2)
x = torch.randn(1, 32)
print(layer(x).shape)

This shows how an MoE layer routes tokens through specialist sub-networks.

Exercise 4 – Visualizing Scaling Laws

Task:

Simulate scaling laws using a toy function. Plot how model performance increases with more parameters under Kaplan’s law (bigger is always better) vs Chinchilla’s insight (data balance matters).

Solution:

import numpy as np
import matplotlib.pyplot as plt

# Parameters (model size) from 1M to 10B
params = np.logspace(6, 10, 20)
data = params * 20  # Chinchilla's 20x rule

# Fake "performance" functions
performance_kaplan = 1 - 1 / (np.log(params))
performance_chinchilla = 1 - 1 / (np.log(data))

plt.figure(figsize=(8,5))
plt.plot(params, performance_kaplan, label="Kaplan Scaling")
plt.plot(params, performance_chinchilla, label="Chinchilla Scaling", linestyle="--")
plt.xscale("log")
plt.xlabel("Model Parameters (log scale)")
plt.ylabel("Performance (arbitrary units)")
plt.title("Toy Visualization of Scaling Laws")
plt.legend()
plt.show()

You’ll see how Kaplan favors size, while Chinchilla emphasizes data balance.

Exercise 5 – Model Choice for Your Startup

Task (theoretical):

Imagine you are starting an AI-powered customer support startup. You have a limited budget for compute and need an LLM that balances cost, efficiency, and control. Based on what you learned in this chapter, which model family would you pick (GPT, LLaMA, Claude, Gemini, Mistral, or DeepSeek) and why?

Solution (sample answer):

I would choose Mistral or LLaMA, since both provide open weights and can be run locally with quantization. GPT or Claude might be too expensive for continuous usage, while Gemini is closed-source. DeepSeek is interesting but lacks a mature ecosystem. For cost-effective customization, Mistral strikes a great balance between efficiency and performance.

Summary of Learning Goals

By completing these exercises, you have:

  • Generated text with decoder-only models.
  • Summarized text with an encoder-decoder model.
  • Built a toy Mixture-of-Experts layer.
  • Visualized scaling laws in action.
  • Practiced making real-world trade-off decisions about model families.

Practical Exercises – Chapter 1

The following exercises will help you apply what you learned about LLM families, architectures, and scaling laws.

Exercise 1 – Exploring Decoder-Only Models with Hugging Face

Task:

Use the Hugging Face Transformers library to load a decoder-only model (like GPT-2) and generate text. Write a Python script that prompts the model with:

Artificial intelligence will change the world by

and generates a continuation of at least 30 tokens.

Solution:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 (decoder-only model)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Artificial intelligence will change the world by"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate continuation
outputs = model.generate(
    inputs["input_ids"],
    max_length=50,        # prompt + ~30 tokens
    do_sample=True,
    top_k=50,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This demonstrates the decoder-only left-to-right generation process.

Exercise 2 – Summarization with an Encoder-Decoder Model

Task:

Use a T5-small model to summarize the following text:

“The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention.”

Solution:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention."
inputs = tokenizer("summarize: " + text, return_tensors="pt")

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=25,
    min_length=5,
    length_penalty=2.0
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

This highlights how encoder-decoder models handle input → output transformations.

Exercise 3 – Simulating a Mixture-of-Experts Layer

Task:

Implement a very simple Mixture-of-Experts (MoE) layer in PyTorch where only the top-2 experts are selected for a given input.

Solution:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        return torch.relu(self.fc(x))

class MoELayer(nn.Module):
    def __init__(self, num_experts=4, hidden_dim=32, k=2):
        super().__init__()
        self.experts = nn.ModuleList([Expert(hidden_dim) for _ in range(num_experts)])
        self.router = nn.Linear(hidden_dim, num_experts)
        self.k = k

    def forward(self, x):
        scores = torch.softmax(self.router(x), dim=-1)
        topk = torch.topk(scores, self.k, dim=-1)
        outputs = []
        for i, idx in enumerate(topk.indices[0]):
            outputs.append(self.experts[idx](x) * topk.values[0][i])
        return sum(outputs)

# Example usage
layer = MoELayer(num_experts=4, hidden_dim=32, k=2)
x = torch.randn(1, 32)
print(layer(x).shape)

This shows how an MoE layer routes tokens through specialist sub-networks.

Exercise 4 – Visualizing Scaling Laws

Task:

Simulate scaling laws using a toy function. Plot how model performance increases with more parameters under Kaplan’s law (bigger is always better) vs Chinchilla’s insight (data balance matters).

Solution:

import numpy as np
import matplotlib.pyplot as plt

# Parameters (model size) from 1M to 10B
params = np.logspace(6, 10, 20)
data = params * 20  # Chinchilla's 20x rule

# Fake "performance" functions
performance_kaplan = 1 - 1 / (np.log(params))
performance_chinchilla = 1 - 1 / (np.log(data))

plt.figure(figsize=(8,5))
plt.plot(params, performance_kaplan, label="Kaplan Scaling")
plt.plot(params, performance_chinchilla, label="Chinchilla Scaling", linestyle="--")
plt.xscale("log")
plt.xlabel("Model Parameters (log scale)")
plt.ylabel("Performance (arbitrary units)")
plt.title("Toy Visualization of Scaling Laws")
plt.legend()
plt.show()

You’ll see how Kaplan favors size, while Chinchilla emphasizes data balance.

Exercise 5 – Model Choice for Your Startup

Task (theoretical):

Imagine you are starting an AI-powered customer support startup. You have a limited budget for compute and need an LLM that balances cost, efficiency, and control. Based on what you learned in this chapter, which model family would you pick (GPT, LLaMA, Claude, Gemini, Mistral, or DeepSeek) and why?

Solution (sample answer):

I would choose Mistral or LLaMA, since both provide open weights and can be run locally with quantization. GPT or Claude might be too expensive for continuous usage, while Gemini is closed-source. DeepSeek is interesting but lacks a mature ecosystem. For cost-effective customization, Mistral strikes a great balance between efficiency and performance.

Summary of Learning Goals

By completing these exercises, you have:

  • Generated text with decoder-only models.
  • Summarized text with an encoder-decoder model.
  • Built a toy Mixture-of-Experts layer.
  • Visualized scaling laws in action.
  • Practiced making real-world trade-off decisions about model families.

Practical Exercises – Chapter 1

The following exercises will help you apply what you learned about LLM families, architectures, and scaling laws.

Exercise 1 – Exploring Decoder-Only Models with Hugging Face

Task:

Use the Hugging Face Transformers library to load a decoder-only model (like GPT-2) and generate text. Write a Python script that prompts the model with:

Artificial intelligence will change the world by

and generates a continuation of at least 30 tokens.

Solution:

from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load GPT-2 (decoder-only model)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

prompt = "Artificial intelligence will change the world by"
inputs = tokenizer(prompt, return_tensors="pt")

# Generate continuation
outputs = model.generate(
    inputs["input_ids"],
    max_length=50,        # prompt + ~30 tokens
    do_sample=True,
    top_k=50,
    top_p=0.95
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))

This demonstrates the decoder-only left-to-right generation process.

Exercise 2 – Summarization with an Encoder-Decoder Model

Task:

Use a T5-small model to summarize the following text:

“The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention.”

Solution:

from transformers import T5Tokenizer, T5ForConditionalGeneration

tokenizer = T5Tokenizer.from_pretrained("t5-small")
model = T5ForConditionalGeneration.from_pretrained("t5-small")

text = "The Transformer architecture has revolutionized natural language processing by allowing models to process long sequences in parallel using self-attention."
inputs = tokenizer("summarize: " + text, return_tensors="pt")

# Generate summary
summary_ids = model.generate(
    inputs["input_ids"],
    max_length=25,
    min_length=5,
    length_penalty=2.0
)

print(tokenizer.decode(summary_ids[0], skip_special_tokens=True))

This highlights how encoder-decoder models handle input → output transformations.

Exercise 3 – Simulating a Mixture-of-Experts Layer

Task:

Implement a very simple Mixture-of-Experts (MoE) layer in PyTorch where only the top-2 experts are selected for a given input.

Solution:

import torch
import torch.nn as nn

class Expert(nn.Module):
    def __init__(self, hidden_dim):
        super().__init__()
        self.fc = nn.Linear(hidden_dim, hidden_dim)

    def forward(self, x):
        return torch.relu(self.fc(x))

class MoELayer(nn.Module):
    def __init__(self, num_experts=4, hidden_dim=32, k=2):
        super().__init__()
        self.experts = nn.ModuleList([Expert(hidden_dim) for _ in range(num_experts)])
        self.router = nn.Linear(hidden_dim, num_experts)
        self.k = k

    def forward(self, x):
        scores = torch.softmax(self.router(x), dim=-1)
        topk = torch.topk(scores, self.k, dim=-1)
        outputs = []
        for i, idx in enumerate(topk.indices[0]):
            outputs.append(self.experts[idx](x) * topk.values[0][i])
        return sum(outputs)

# Example usage
layer = MoELayer(num_experts=4, hidden_dim=32, k=2)
x = torch.randn(1, 32)
print(layer(x).shape)

This shows how an MoE layer routes tokens through specialist sub-networks.

Exercise 4 – Visualizing Scaling Laws

Task:

Simulate scaling laws using a toy function. Plot how model performance increases with more parameters under Kaplan’s law (bigger is always better) vs Chinchilla’s insight (data balance matters).

Solution:

import numpy as np
import matplotlib.pyplot as plt

# Parameters (model size) from 1M to 10B
params = np.logspace(6, 10, 20)
data = params * 20  # Chinchilla's 20x rule

# Fake "performance" functions
performance_kaplan = 1 - 1 / (np.log(params))
performance_chinchilla = 1 - 1 / (np.log(data))

plt.figure(figsize=(8,5))
plt.plot(params, performance_kaplan, label="Kaplan Scaling")
plt.plot(params, performance_chinchilla, label="Chinchilla Scaling", linestyle="--")
plt.xscale("log")
plt.xlabel("Model Parameters (log scale)")
plt.ylabel("Performance (arbitrary units)")
plt.title("Toy Visualization of Scaling Laws")
plt.legend()
plt.show()

You’ll see how Kaplan favors size, while Chinchilla emphasizes data balance.

Exercise 5 – Model Choice for Your Startup

Task (theoretical):

Imagine you are starting an AI-powered customer support startup. You have a limited budget for compute and need an LLM that balances cost, efficiency, and control. Based on what you learned in this chapter, which model family would you pick (GPT, LLaMA, Claude, Gemini, Mistral, or DeepSeek) and why?

Solution (sample answer):

I would choose Mistral or LLaMA, since both provide open weights and can be run locally with quantization. GPT or Claude might be too expensive for continuous usage, while Gemini is closed-source. DeepSeek is interesting but lacks a mature ecosystem. For cost-effective customization, Mistral strikes a great balance between efficiency and performance.

Summary of Learning Goals

By completing these exercises, you have:

  • Generated text with decoder-only models.
  • Summarized text with an encoder-decoder model.
  • Built a toy Mixture-of-Experts layer.
  • Visualized scaling laws in action.
  • Practiced making real-world trade-off decisions about model families.