Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconUnder the Hood of Large Language Models (DAÑADO)
Under the Hood of Large Language Models (DAÑADO)

Chapter 2: Tokenization and Embeddings

Chapter 2 Summary

At the heart of every Large Language Model lies a simple but profound challenge: how to turn human language into a form machines can understand. In this chapter, we explored the essential building blocks that make this possible — tokenization and embeddings.

We began with the idea that words cannot simply be split by spaces. Natural language is too rich, too irregular, and too creative for that. Instead, modern models rely on subword tokenization methods such as Byte Pair Encoding (BPE)WordPiece, and SentencePiece. These techniques break text into smaller, reusable units that balance efficiency with flexibility. We saw how BPE iteratively merges frequent character pairs, how WordPiece optimizes merges using probability, and how SentencePiece generalizes this idea to multilingual and non-segmented languages by working directly with raw byte streams. Together, they allow models to handle everything from “playground” to “hyperparameterization” without losing meaning.

Next, we explored why sometimes a custom tokenizer is essential. Pretrained tokenizers are good generalists, but in specialized fields — law, medicine, programming — they often fragment critical terms into inefficient pieces. Training your own tokenizer on a domain-specific corpus allows you to capture important terms as whole tokens, reduce sequence lengths, and improve performance. We walked through practical examples of building BPE and SentencePiece tokenizers tailored to legal and technical text, showing how even small adjustments in tokenization can yield meaningful improvements.

From there, we turned to embeddings — the vector representations that transform tokens into the numbers neural networks rely on. We examined three approaches:

  • Subword embeddings, which dominate modern LLMs, compactly representing frequent chunks of text.
  • Character-level embeddings, which increase flexibility by representing every character directly, useful for morphologically rich languages and domains like code.
  • Multimodal embeddings, which extend beyond text to align words with images, audio, and more, enabling models like CLIP and Gemini to connect language with other forms of information.

Through examples, we saw embeddings in action: BERT splitting “playground” into meaningful parts, PyTorch embedding characters directly, and CLIP aligning an image of a dog with the caption “a photo of a dog.” These demonstrations highlight how embeddings allow models not just to process language, but to capture meaning in numerical space.

Ultimately, this chapter showed that tokenization and embeddings are not minor technical details — they are the foundation of LLM intelligence. Without them, transformers would be unable to bridge the gap between human language and machine learning. As we move forward, we will now build on this foundation to examine the internal anatomy of LLMs themselves: attention, layers, and the architectures that make these models so powerful.

Chapter 2 Summary

At the heart of every Large Language Model lies a simple but profound challenge: how to turn human language into a form machines can understand. In this chapter, we explored the essential building blocks that make this possible — tokenization and embeddings.

We began with the idea that words cannot simply be split by spaces. Natural language is too rich, too irregular, and too creative for that. Instead, modern models rely on subword tokenization methods such as Byte Pair Encoding (BPE)WordPiece, and SentencePiece. These techniques break text into smaller, reusable units that balance efficiency with flexibility. We saw how BPE iteratively merges frequent character pairs, how WordPiece optimizes merges using probability, and how SentencePiece generalizes this idea to multilingual and non-segmented languages by working directly with raw byte streams. Together, they allow models to handle everything from “playground” to “hyperparameterization” without losing meaning.

Next, we explored why sometimes a custom tokenizer is essential. Pretrained tokenizers are good generalists, but in specialized fields — law, medicine, programming — they often fragment critical terms into inefficient pieces. Training your own tokenizer on a domain-specific corpus allows you to capture important terms as whole tokens, reduce sequence lengths, and improve performance. We walked through practical examples of building BPE and SentencePiece tokenizers tailored to legal and technical text, showing how even small adjustments in tokenization can yield meaningful improvements.

From there, we turned to embeddings — the vector representations that transform tokens into the numbers neural networks rely on. We examined three approaches:

  • Subword embeddings, which dominate modern LLMs, compactly representing frequent chunks of text.
  • Character-level embeddings, which increase flexibility by representing every character directly, useful for morphologically rich languages and domains like code.
  • Multimodal embeddings, which extend beyond text to align words with images, audio, and more, enabling models like CLIP and Gemini to connect language with other forms of information.

Through examples, we saw embeddings in action: BERT splitting “playground” into meaningful parts, PyTorch embedding characters directly, and CLIP aligning an image of a dog with the caption “a photo of a dog.” These demonstrations highlight how embeddings allow models not just to process language, but to capture meaning in numerical space.

Ultimately, this chapter showed that tokenization and embeddings are not minor technical details — they are the foundation of LLM intelligence. Without them, transformers would be unable to bridge the gap between human language and machine learning. As we move forward, we will now build on this foundation to examine the internal anatomy of LLMs themselves: attention, layers, and the architectures that make these models so powerful.

Chapter 2 Summary

At the heart of every Large Language Model lies a simple but profound challenge: how to turn human language into a form machines can understand. In this chapter, we explored the essential building blocks that make this possible — tokenization and embeddings.

We began with the idea that words cannot simply be split by spaces. Natural language is too rich, too irregular, and too creative for that. Instead, modern models rely on subword tokenization methods such as Byte Pair Encoding (BPE)WordPiece, and SentencePiece. These techniques break text into smaller, reusable units that balance efficiency with flexibility. We saw how BPE iteratively merges frequent character pairs, how WordPiece optimizes merges using probability, and how SentencePiece generalizes this idea to multilingual and non-segmented languages by working directly with raw byte streams. Together, they allow models to handle everything from “playground” to “hyperparameterization” without losing meaning.

Next, we explored why sometimes a custom tokenizer is essential. Pretrained tokenizers are good generalists, but in specialized fields — law, medicine, programming — they often fragment critical terms into inefficient pieces. Training your own tokenizer on a domain-specific corpus allows you to capture important terms as whole tokens, reduce sequence lengths, and improve performance. We walked through practical examples of building BPE and SentencePiece tokenizers tailored to legal and technical text, showing how even small adjustments in tokenization can yield meaningful improvements.

From there, we turned to embeddings — the vector representations that transform tokens into the numbers neural networks rely on. We examined three approaches:

  • Subword embeddings, which dominate modern LLMs, compactly representing frequent chunks of text.
  • Character-level embeddings, which increase flexibility by representing every character directly, useful for morphologically rich languages and domains like code.
  • Multimodal embeddings, which extend beyond text to align words with images, audio, and more, enabling models like CLIP and Gemini to connect language with other forms of information.

Through examples, we saw embeddings in action: BERT splitting “playground” into meaningful parts, PyTorch embedding characters directly, and CLIP aligning an image of a dog with the caption “a photo of a dog.” These demonstrations highlight how embeddings allow models not just to process language, but to capture meaning in numerical space.

Ultimately, this chapter showed that tokenization and embeddings are not minor technical details — they are the foundation of LLM intelligence. Without them, transformers would be unable to bridge the gap between human language and machine learning. As we move forward, we will now build on this foundation to examine the internal anatomy of LLMs themselves: attention, layers, and the architectures that make these models so powerful.

Chapter 2 Summary

At the heart of every Large Language Model lies a simple but profound challenge: how to turn human language into a form machines can understand. In this chapter, we explored the essential building blocks that make this possible — tokenization and embeddings.

We began with the idea that words cannot simply be split by spaces. Natural language is too rich, too irregular, and too creative for that. Instead, modern models rely on subword tokenization methods such as Byte Pair Encoding (BPE)WordPiece, and SentencePiece. These techniques break text into smaller, reusable units that balance efficiency with flexibility. We saw how BPE iteratively merges frequent character pairs, how WordPiece optimizes merges using probability, and how SentencePiece generalizes this idea to multilingual and non-segmented languages by working directly with raw byte streams. Together, they allow models to handle everything from “playground” to “hyperparameterization” without losing meaning.

Next, we explored why sometimes a custom tokenizer is essential. Pretrained tokenizers are good generalists, but in specialized fields — law, medicine, programming — they often fragment critical terms into inefficient pieces. Training your own tokenizer on a domain-specific corpus allows you to capture important terms as whole tokens, reduce sequence lengths, and improve performance. We walked through practical examples of building BPE and SentencePiece tokenizers tailored to legal and technical text, showing how even small adjustments in tokenization can yield meaningful improvements.

From there, we turned to embeddings — the vector representations that transform tokens into the numbers neural networks rely on. We examined three approaches:

  • Subword embeddings, which dominate modern LLMs, compactly representing frequent chunks of text.
  • Character-level embeddings, which increase flexibility by representing every character directly, useful for morphologically rich languages and domains like code.
  • Multimodal embeddings, which extend beyond text to align words with images, audio, and more, enabling models like CLIP and Gemini to connect language with other forms of information.

Through examples, we saw embeddings in action: BERT splitting “playground” into meaningful parts, PyTorch embedding characters directly, and CLIP aligning an image of a dog with the caption “a photo of a dog.” These demonstrations highlight how embeddings allow models not just to process language, but to capture meaning in numerical space.

Ultimately, this chapter showed that tokenization and embeddings are not minor technical details — they are the foundation of LLM intelligence. Without them, transformers would be unable to bridge the gap between human language and machine learning. As we move forward, we will now build on this foundation to examine the internal anatomy of LLMs themselves: attention, layers, and the architectures that make these models so powerful.