Quiz
Answers
Q1. a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
Q2. Scaling laws show predictable trade-offs: Kaplan emphasized bigger models need more compute, while Chinchilla showed data scaling is as important as parameter scaling (optimal balance avoids undertraining or wasted capacity).
Q3. b) Tokenization methods.
Q4. Deduplication prevents models from memorizing repeated documents, reduces overfitting, and improves generalization.
Q5. Subword tokenization.
Q6. b) It allows the model to attend to different parts of the input simultaneously.
Q7. LayerNorm normalizes across features including mean and variance; RMSNorm normalizes only by the root mean square, making it lighter and sometimes more stable.
Q8. Multiple query heads share fewer key/value projections, reducing memory and compute cost.
Q9. To avoid harmful, biased, or nonsensical outputs and ensure model safety and reliability.
Q10. Gradients are averaged across all GPUs to synchronize model updates.
Q11. b) Gradient checkpointing.
Q12. Energy (kWh) = Power (W) × GPUs × Hours ÷ 1000 = 350 × 4 × 10 ÷ 1000 = 14 kWh.
Q13. By projecting both modalities into a shared embedding space and optimizing similarity between matching pairs.
Q14. Whisper was trained on 680k hours of noisy multilingual data, making it robust across languages and real-world conditions.
Q15. Because video adds a temporal dimension, requiring reasoning about motion and events across frames, not just static objects.
Q16. It computes the probability that the given image corresponds to each text description (“dog” vs. “cat”).
Answers
Q1. a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
Q2. Scaling laws show predictable trade-offs: Kaplan emphasized bigger models need more compute, while Chinchilla showed data scaling is as important as parameter scaling (optimal balance avoids undertraining or wasted capacity).
Q3. b) Tokenization methods.
Q4. Deduplication prevents models from memorizing repeated documents, reduces overfitting, and improves generalization.
Q5. Subword tokenization.
Q6. b) It allows the model to attend to different parts of the input simultaneously.
Q7. LayerNorm normalizes across features including mean and variance; RMSNorm normalizes only by the root mean square, making it lighter and sometimes more stable.
Q8. Multiple query heads share fewer key/value projections, reducing memory and compute cost.
Q9. To avoid harmful, biased, or nonsensical outputs and ensure model safety and reliability.
Q10. Gradients are averaged across all GPUs to synchronize model updates.
Q11. b) Gradient checkpointing.
Q12. Energy (kWh) = Power (W) × GPUs × Hours ÷ 1000 = 350 × 4 × 10 ÷ 1000 = 14 kWh.
Q13. By projecting both modalities into a shared embedding space and optimizing similarity between matching pairs.
Q14. Whisper was trained on 680k hours of noisy multilingual data, making it robust across languages and real-world conditions.
Q15. Because video adds a temporal dimension, requiring reasoning about motion and events across frames, not just static objects.
Q16. It computes the probability that the given image corresponds to each text description (“dog” vs. “cat”).
Answers
Q1. a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
Q2. Scaling laws show predictable trade-offs: Kaplan emphasized bigger models need more compute, while Chinchilla showed data scaling is as important as parameter scaling (optimal balance avoids undertraining or wasted capacity).
Q3. b) Tokenization methods.
Q4. Deduplication prevents models from memorizing repeated documents, reduces overfitting, and improves generalization.
Q5. Subword tokenization.
Q6. b) It allows the model to attend to different parts of the input simultaneously.
Q7. LayerNorm normalizes across features including mean and variance; RMSNorm normalizes only by the root mean square, making it lighter and sometimes more stable.
Q8. Multiple query heads share fewer key/value projections, reducing memory and compute cost.
Q9. To avoid harmful, biased, or nonsensical outputs and ensure model safety and reliability.
Q10. Gradients are averaged across all GPUs to synchronize model updates.
Q11. b) Gradient checkpointing.
Q12. Energy (kWh) = Power (W) × GPUs × Hours ÷ 1000 = 350 × 4 × 10 ÷ 1000 = 14 kWh.
Q13. By projecting both modalities into a shared embedding space and optimizing similarity between matching pairs.
Q14. Whisper was trained on 680k hours of noisy multilingual data, making it robust across languages and real-world conditions.
Q15. Because video adds a temporal dimension, requiring reasoning about motion and events across frames, not just static objects.
Q16. It computes the probability that the given image corresponds to each text description (“dog” vs. “cat”).
Answers
Q1. a) Decoder-only models generate text autoregressively, while encoder-decoder models are designed for sequence-to-sequence tasks.
Q2. Scaling laws show predictable trade-offs: Kaplan emphasized bigger models need more compute, while Chinchilla showed data scaling is as important as parameter scaling (optimal balance avoids undertraining or wasted capacity).
Q3. b) Tokenization methods.
Q4. Deduplication prevents models from memorizing repeated documents, reduces overfitting, and improves generalization.
Q5. Subword tokenization.
Q6. b) It allows the model to attend to different parts of the input simultaneously.
Q7. LayerNorm normalizes across features including mean and variance; RMSNorm normalizes only by the root mean square, making it lighter and sometimes more stable.
Q8. Multiple query heads share fewer key/value projections, reducing memory and compute cost.
Q9. To avoid harmful, biased, or nonsensical outputs and ensure model safety and reliability.
Q10. Gradients are averaged across all GPUs to synchronize model updates.
Q11. b) Gradient checkpointing.
Q12. Energy (kWh) = Power (W) × GPUs × Hours ÷ 1000 = 350 × 4 × 10 ÷ 1000 = 14 kWh.
Q13. By projecting both modalities into a shared embedding space and optimizing similarity between matching pairs.
Q14. Whisper was trained on 680k hours of noisy multilingual data, making it robust across languages and real-world conditions.
Q15. Because video adds a temporal dimension, requiring reasoning about motion and events across frames, not just static objects.
Q16. It computes the probability that the given image corresponds to each text description (“dog” vs. “cat”).
