Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Chapter 5: Beyond Text: Multimodal LLMs

Chapter 5 Summary – Beyond Text: Multimodal LLMs

In this chapter, we expanded our perspective beyond text-only systems and explored how large language models are becoming multimodal, integrating vision, speech, and video into their reasoning. Just as humans use multiple senses to understand the world, LLMs are now learning to connect across modalities, bringing us closer to more natural and useful AI.

We began with text+image models like LLaVAFlamingoDeepSeek-VL, and GPT-4o. These models combine visual encoders with text transformers, allowing them to interpret images alongside written prompts. By projecting image embeddings into the same space as word tokens, they can caption photos, answer visual questions, and even ground text in visual scenes. Our CLIP example showed how embeddings align text descriptions with image features, demonstrating the power of vision-language fusion.

Next, we explored audio and speech integration. With Whisper, we saw how robust transcription systems can convert speech into text across many languages, even in noisy conditions. Moving further, SpeechLM and SpeechGPT illustrated how speech features can be integrated directly with transformers, enabling systems that listen, understand, and respond in spoken dialogue. Through wav2vec2 embeddings, we glimpsed how raw waveforms become structured inputs for language reasoning.

From there, we moved into the temporal domain of video. Unlike images, video introduces the dimension of time, requiring models to reason about sequences of frames and events. We looked at VideoGPTGemini, and Kosmos-2, which tackle challenges like frame sampling, temporal embeddings, and cross-modal grounding. Using VideoMAE, we demonstrated how frames can be transformed into embeddings for downstream reasoning. Video understanding is essential for real-world AI tasks like summarizing lectures, analyzing security footage, or supporting robotics.

Finally, we discussed cross-modal research directions. Real-life interactions rarely involve just one modality: a person might speak while showing slides, or gesture during conversation. True cross-modal reasoning means fusing these signals into a coherent whole. By combining ASR transcription, image captioning, and summarization in a small pipeline, we showed how even today’s tools can approximate this integration.

The central lesson of this chapter is that multimodality is not an add-on — it is the future of AI. As models learn to handle words, images, sounds, and video together, they move from being language predictors to becoming general perception-and-reasoning systems. These advances expand their usefulness in accessibility, education, creative work, and human–AI collaboration.

Chapter 5 Summary – Beyond Text: Multimodal LLMs

In this chapter, we expanded our perspective beyond text-only systems and explored how large language models are becoming multimodal, integrating vision, speech, and video into their reasoning. Just as humans use multiple senses to understand the world, LLMs are now learning to connect across modalities, bringing us closer to more natural and useful AI.

We began with text+image models like LLaVAFlamingoDeepSeek-VL, and GPT-4o. These models combine visual encoders with text transformers, allowing them to interpret images alongside written prompts. By projecting image embeddings into the same space as word tokens, they can caption photos, answer visual questions, and even ground text in visual scenes. Our CLIP example showed how embeddings align text descriptions with image features, demonstrating the power of vision-language fusion.

Next, we explored audio and speech integration. With Whisper, we saw how robust transcription systems can convert speech into text across many languages, even in noisy conditions. Moving further, SpeechLM and SpeechGPT illustrated how speech features can be integrated directly with transformers, enabling systems that listen, understand, and respond in spoken dialogue. Through wav2vec2 embeddings, we glimpsed how raw waveforms become structured inputs for language reasoning.

From there, we moved into the temporal domain of video. Unlike images, video introduces the dimension of time, requiring models to reason about sequences of frames and events. We looked at VideoGPTGemini, and Kosmos-2, which tackle challenges like frame sampling, temporal embeddings, and cross-modal grounding. Using VideoMAE, we demonstrated how frames can be transformed into embeddings for downstream reasoning. Video understanding is essential for real-world AI tasks like summarizing lectures, analyzing security footage, or supporting robotics.

Finally, we discussed cross-modal research directions. Real-life interactions rarely involve just one modality: a person might speak while showing slides, or gesture during conversation. True cross-modal reasoning means fusing these signals into a coherent whole. By combining ASR transcription, image captioning, and summarization in a small pipeline, we showed how even today’s tools can approximate this integration.

The central lesson of this chapter is that multimodality is not an add-on — it is the future of AI. As models learn to handle words, images, sounds, and video together, they move from being language predictors to becoming general perception-and-reasoning systems. These advances expand their usefulness in accessibility, education, creative work, and human–AI collaboration.

Chapter 5 Summary – Beyond Text: Multimodal LLMs

In this chapter, we expanded our perspective beyond text-only systems and explored how large language models are becoming multimodal, integrating vision, speech, and video into their reasoning. Just as humans use multiple senses to understand the world, LLMs are now learning to connect across modalities, bringing us closer to more natural and useful AI.

We began with text+image models like LLaVAFlamingoDeepSeek-VL, and GPT-4o. These models combine visual encoders with text transformers, allowing them to interpret images alongside written prompts. By projecting image embeddings into the same space as word tokens, they can caption photos, answer visual questions, and even ground text in visual scenes. Our CLIP example showed how embeddings align text descriptions with image features, demonstrating the power of vision-language fusion.

Next, we explored audio and speech integration. With Whisper, we saw how robust transcription systems can convert speech into text across many languages, even in noisy conditions. Moving further, SpeechLM and SpeechGPT illustrated how speech features can be integrated directly with transformers, enabling systems that listen, understand, and respond in spoken dialogue. Through wav2vec2 embeddings, we glimpsed how raw waveforms become structured inputs for language reasoning.

From there, we moved into the temporal domain of video. Unlike images, video introduces the dimension of time, requiring models to reason about sequences of frames and events. We looked at VideoGPTGemini, and Kosmos-2, which tackle challenges like frame sampling, temporal embeddings, and cross-modal grounding. Using VideoMAE, we demonstrated how frames can be transformed into embeddings for downstream reasoning. Video understanding is essential for real-world AI tasks like summarizing lectures, analyzing security footage, or supporting robotics.

Finally, we discussed cross-modal research directions. Real-life interactions rarely involve just one modality: a person might speak while showing slides, or gesture during conversation. True cross-modal reasoning means fusing these signals into a coherent whole. By combining ASR transcription, image captioning, and summarization in a small pipeline, we showed how even today’s tools can approximate this integration.

The central lesson of this chapter is that multimodality is not an add-on — it is the future of AI. As models learn to handle words, images, sounds, and video together, they move from being language predictors to becoming general perception-and-reasoning systems. These advances expand their usefulness in accessibility, education, creative work, and human–AI collaboration.

Chapter 5 Summary – Beyond Text: Multimodal LLMs

In this chapter, we expanded our perspective beyond text-only systems and explored how large language models are becoming multimodal, integrating vision, speech, and video into their reasoning. Just as humans use multiple senses to understand the world, LLMs are now learning to connect across modalities, bringing us closer to more natural and useful AI.

We began with text+image models like LLaVAFlamingoDeepSeek-VL, and GPT-4o. These models combine visual encoders with text transformers, allowing them to interpret images alongside written prompts. By projecting image embeddings into the same space as word tokens, they can caption photos, answer visual questions, and even ground text in visual scenes. Our CLIP example showed how embeddings align text descriptions with image features, demonstrating the power of vision-language fusion.

Next, we explored audio and speech integration. With Whisper, we saw how robust transcription systems can convert speech into text across many languages, even in noisy conditions. Moving further, SpeechLM and SpeechGPT illustrated how speech features can be integrated directly with transformers, enabling systems that listen, understand, and respond in spoken dialogue. Through wav2vec2 embeddings, we glimpsed how raw waveforms become structured inputs for language reasoning.

From there, we moved into the temporal domain of video. Unlike images, video introduces the dimension of time, requiring models to reason about sequences of frames and events. We looked at VideoGPTGemini, and Kosmos-2, which tackle challenges like frame sampling, temporal embeddings, and cross-modal grounding. Using VideoMAE, we demonstrated how frames can be transformed into embeddings for downstream reasoning. Video understanding is essential for real-world AI tasks like summarizing lectures, analyzing security footage, or supporting robotics.

Finally, we discussed cross-modal research directions. Real-life interactions rarely involve just one modality: a person might speak while showing slides, or gesture during conversation. True cross-modal reasoning means fusing these signals into a coherent whole. By combining ASR transcription, image captioning, and summarization in a small pipeline, we showed how even today’s tools can approximate this integration.

The central lesson of this chapter is that multimodality is not an add-on — it is the future of AI. As models learn to handle words, images, sounds, and video together, they move from being language predictors to becoming general perception-and-reasoning systems. These advances expand their usefulness in accessibility, education, creative work, and human–AI collaboration.