Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models
Under the Hood of Large Language Models

Quiz

Questions

Chapter 1: What Are LLMs? From Transformers to Titans

Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?

a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.

b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.

c) Decoder-only models require labeled data, encoder-decoder models do not.

d) Both are identical, but use different tokenizers.

Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?

Chapter 2: Tokenization and Embeddings

Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:

a) Attention mechanisms

b) Tokenization methods

c) Activation functions

d) Optimizers

Q4. Why is deduplication important in training corpora for embeddings?

Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?

Chapter 3: Anatomy of an LLM

Q6. What role does multi-head attention play in transformers?

a) It normalizes hidden states.

b) It allows the model to attend to different parts of the input sequence simultaneously.

c) It reduces training cost.

d) It encodes word positions.

Q7. What is the main difference between LayerNorm and RMSNorm?

Q8. In grouped-query attention (GQA), how are queries and key/value projections related?

Chapter 4: Training LLMs from Scratch

Q9. Why is filtering low-quality or toxic data essential before training?

Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?

Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?

a) Mixed precision training

b) Gradient checkpointing

c) Curriculum learning

d) Early stopping

Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.

Chapter 5: Beyond Text: Multimodal LLMs

Q13. In models like CLIP, how are text and image representations aligned?

Q14. What is the primary advantage of Whisper over earlier ASR systems?

Q15. Why is video understanding more challenging than image understanding in transformers?

Q16. (Code) What does the following code snippet (using CLIP) compute?

inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Questions

Chapter 1: What Are LLMs? From Transformers to Titans

Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?

a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.

b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.

c) Decoder-only models require labeled data, encoder-decoder models do not.

d) Both are identical, but use different tokenizers.

Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?

Chapter 2: Tokenization and Embeddings

Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:

a) Attention mechanisms

b) Tokenization methods

c) Activation functions

d) Optimizers

Q4. Why is deduplication important in training corpora for embeddings?

Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?

Chapter 3: Anatomy of an LLM

Q6. What role does multi-head attention play in transformers?

a) It normalizes hidden states.

b) It allows the model to attend to different parts of the input sequence simultaneously.

c) It reduces training cost.

d) It encodes word positions.

Q7. What is the main difference between LayerNorm and RMSNorm?

Q8. In grouped-query attention (GQA), how are queries and key/value projections related?

Chapter 4: Training LLMs from Scratch

Q9. Why is filtering low-quality or toxic data essential before training?

Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?

Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?

a) Mixed precision training

b) Gradient checkpointing

c) Curriculum learning

d) Early stopping

Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.

Chapter 5: Beyond Text: Multimodal LLMs

Q13. In models like CLIP, how are text and image representations aligned?

Q14. What is the primary advantage of Whisper over earlier ASR systems?

Q15. Why is video understanding more challenging than image understanding in transformers?

Q16. (Code) What does the following code snippet (using CLIP) compute?

inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Questions

Chapter 1: What Are LLMs? From Transformers to Titans

Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?

a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.

b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.

c) Decoder-only models require labeled data, encoder-decoder models do not.

d) Both are identical, but use different tokenizers.

Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?

Chapter 2: Tokenization and Embeddings

Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:

a) Attention mechanisms

b) Tokenization methods

c) Activation functions

d) Optimizers

Q4. Why is deduplication important in training corpora for embeddings?

Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?

Chapter 3: Anatomy of an LLM

Q6. What role does multi-head attention play in transformers?

a) It normalizes hidden states.

b) It allows the model to attend to different parts of the input sequence simultaneously.

c) It reduces training cost.

d) It encodes word positions.

Q7. What is the main difference between LayerNorm and RMSNorm?

Q8. In grouped-query attention (GQA), how are queries and key/value projections related?

Chapter 4: Training LLMs from Scratch

Q9. Why is filtering low-quality or toxic data essential before training?

Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?

Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?

a) Mixed precision training

b) Gradient checkpointing

c) Curriculum learning

d) Early stopping

Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.

Chapter 5: Beyond Text: Multimodal LLMs

Q13. In models like CLIP, how are text and image representations aligned?

Q14. What is the primary advantage of Whisper over earlier ASR systems?

Q15. Why is video understanding more challenging than image understanding in transformers?

Q16. (Code) What does the following code snippet (using CLIP) compute?

inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

Questions

Chapter 1: What Are LLMs? From Transformers to Titans

Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?

a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.

b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.

c) Decoder-only models require labeled data, encoder-decoder models do not.

d) Both are identical, but use different tokenizers.

Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?

Chapter 2: Tokenization and Embeddings

Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:

a) Attention mechanisms

b) Tokenization methods

c) Activation functions

d) Optimizers

Q4. Why is deduplication important in training corpora for embeddings?

Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?

Chapter 3: Anatomy of an LLM

Q6. What role does multi-head attention play in transformers?

a) It normalizes hidden states.

b) It allows the model to attend to different parts of the input sequence simultaneously.

c) It reduces training cost.

d) It encodes word positions.

Q7. What is the main difference between LayerNorm and RMSNorm?

Q8. In grouped-query attention (GQA), how are queries and key/value projections related?

Chapter 4: Training LLMs from Scratch

Q9. Why is filtering low-quality or toxic data essential before training?

Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?

Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?

a) Mixed precision training

b) Gradient checkpointing

c) Curriculum learning

d) Early stopping

Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.

Chapter 5: Beyond Text: Multimodal LLMs

Q13. In models like CLIP, how are text and image representations aligned?

Q14. What is the primary advantage of Whisper over earlier ASR systems?

Q15. Why is video understanding more challenging than image understanding in transformers?

Q16. (Code) What does the following code snippet (using CLIP) compute?

inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)