Quiz
Questions
Chapter 1: What Are LLMs? From Transformers to Titans
Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?
a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.
c) Decoder-only models require labeled data, encoder-decoder models do not.
d) Both are identical, but use different tokenizers.
Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?
Chapter 2: Tokenization and Embeddings
Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:
a) Attention mechanisms
b) Tokenization methods
c) Activation functions
d) Optimizers
Q4. Why is deduplication important in training corpora for embeddings?
Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?
Chapter 3: Anatomy of an LLM
Q6. What role does multi-head attention play in transformers?
a) It normalizes hidden states.
b) It allows the model to attend to different parts of the input sequence simultaneously.
c) It reduces training cost.
d) It encodes word positions.
Q7. What is the main difference between LayerNorm and RMSNorm?
Q8. In grouped-query attention (GQA), how are queries and key/value projections related?
Chapter 4: Training LLMs from Scratch
Q9. Why is filtering low-quality or toxic data essential before training?
Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?
Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?
a) Mixed precision training
b) Gradient checkpointing
c) Curriculum learning
d) Early stopping
Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.
Chapter 5: Beyond Text: Multimodal LLMs
Q13. In models like CLIP, how are text and image representations aligned?
Q14. What is the primary advantage of Whisper over earlier ASR systems?
Q15. Why is video understanding more challenging than image understanding in transformers?
Q16. (Code) What does the following code snippet (using CLIP) compute?
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
Questions
Chapter 1: What Are LLMs? From Transformers to Titans
Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?
a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.
c) Decoder-only models require labeled data, encoder-decoder models do not.
d) Both are identical, but use different tokenizers.
Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?
Chapter 2: Tokenization and Embeddings
Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:
a) Attention mechanisms
b) Tokenization methods
c) Activation functions
d) Optimizers
Q4. Why is deduplication important in training corpora for embeddings?
Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?
Chapter 3: Anatomy of an LLM
Q6. What role does multi-head attention play in transformers?
a) It normalizes hidden states.
b) It allows the model to attend to different parts of the input sequence simultaneously.
c) It reduces training cost.
d) It encodes word positions.
Q7. What is the main difference between LayerNorm and RMSNorm?
Q8. In grouped-query attention (GQA), how are queries and key/value projections related?
Chapter 4: Training LLMs from Scratch
Q9. Why is filtering low-quality or toxic data essential before training?
Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?
Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?
a) Mixed precision training
b) Gradient checkpointing
c) Curriculum learning
d) Early stopping
Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.
Chapter 5: Beyond Text: Multimodal LLMs
Q13. In models like CLIP, how are text and image representations aligned?
Q14. What is the primary advantage of Whisper over earlier ASR systems?
Q15. Why is video understanding more challenging than image understanding in transformers?
Q16. (Code) What does the following code snippet (using CLIP) compute?
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
Questions
Chapter 1: What Are LLMs? From Transformers to Titans
Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?
a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.
c) Decoder-only models require labeled data, encoder-decoder models do not.
d) Both are identical, but use different tokenizers.
Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?
Chapter 2: Tokenization and Embeddings
Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:
a) Attention mechanisms
b) Tokenization methods
c) Activation functions
d) Optimizers
Q4. Why is deduplication important in training corpora for embeddings?
Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?
Chapter 3: Anatomy of an LLM
Q6. What role does multi-head attention play in transformers?
a) It normalizes hidden states.
b) It allows the model to attend to different parts of the input sequence simultaneously.
c) It reduces training cost.
d) It encodes word positions.
Q7. What is the main difference between LayerNorm and RMSNorm?
Q8. In grouped-query attention (GQA), how are queries and key/value projections related?
Chapter 4: Training LLMs from Scratch
Q9. Why is filtering low-quality or toxic data essential before training?
Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?
Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?
a) Mixed precision training
b) Gradient checkpointing
c) Curriculum learning
d) Early stopping
Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.
Chapter 5: Beyond Text: Multimodal LLMs
Q13. In models like CLIP, how are text and image representations aligned?
Q14. What is the primary advantage of Whisper over earlier ASR systems?
Q15. Why is video understanding more challenging than image understanding in transformers?
Q16. (Code) What does the following code snippet (using CLIP) compute?
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)
Questions
Chapter 1: What Are LLMs? From Transformers to Titans
Q1. Which of the following best describes the difference between decoder-only and encoder-decoder transformers?
a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
b) Encoder-decoder models can only handle classification, while decoder-only models are used for all tasks.
c) Decoder-only models require labeled data, encoder-decoder models do not.
d) Both are identical, but use different tokenizers.
Q2. What do scaling laws (Kaplan, Chinchilla) tell us about the relationship between model size, data, and compute?
Chapter 2: Tokenization and Embeddings
Q3. Byte Pair Encoding (BPE), WordPiece, and SentencePiece are examples of:
a) Attention mechanisms
b) Tokenization methods
c) Activation functions
d) Optimizers
Q4. Why is deduplication important in training corpora for embeddings?
Q5. Suppose a tokenizer splits “hyperparameterization” into [“hyper”, “parameter”, “ization”]. What kind of tokenization strategy is this?
Chapter 3: Anatomy of an LLM
Q6. What role does multi-head attention play in transformers?
a) It normalizes hidden states.
b) It allows the model to attend to different parts of the input sequence simultaneously.
c) It reduces training cost.
d) It encodes word positions.
Q7. What is the main difference between LayerNorm and RMSNorm?
Q8. In grouped-query attention (GQA), how are queries and key/value projections related?
Chapter 4: Training LLMs from Scratch
Q9. Why is filtering low-quality or toxic data essential before training?
Q10. In PyTorch Distributed Data Parallel (DDP), what happens after each GPU computes its gradients on a mini-batch?
Q11. Which of these techniques reduces GPU memory usage by recomputing activations during backpropagation?
a) Mixed precision training
b) Gradient checkpointing
c) Curriculum learning
d) Early stopping
Q12. Estimate the energy used by 4 GPUs running at 350 W each for 10 hours. Show the formula.
Chapter 5: Beyond Text: Multimodal LLMs
Q13. In models like CLIP, how are text and image representations aligned?
Q14. What is the primary advantage of Whisper over earlier ASR systems?
Q15. Why is video understanding more challenging than image understanding in transformers?
Q16. (Code) What does the following code snippet (using CLIP) compute?
inputs = processor(text=["a photo of a dog", "a photo of a cat"], images=image, return_tensors="pt")
outputs = model(**inputs)
probs = outputs.logits_per_image.softmax(dim=1)

