Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNatural Language Processing with Python Updated Edition
Natural Language Processing with Python Updated Edition

Chapter 4: Language Modeling

4.1 N-grams

Language modeling is a fundamental task in Natural Language Processing (NLP) that involves predicting the next word or sequence of words in a sentence. It serves as the backbone for many advanced NLP applications such as speech recognition, machine translation, text generation, and more. A good language model can understand the context and semantics of a given text, enabling it to generate coherent and contextually appropriate sentences.

In this chapter, we will explore different techniques for building language models, starting with simpler methods like N-grams and moving towards more complex approaches such as Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs). By the end of this chapter, you will have a solid understanding of how language models work and how to implement them in Python.

We will begin with N-grams, one of the simplest yet powerful techniques for language modeling.

N-grams are a contiguous sequence of N items derived from a given sample of text or speech. In the context of natural language processing (NLP), these items are typically words, and N can be any integer, such as 1, 2, 3, or more.

The concept of N-grams is fundamental in NLP because it allows the analysis and prediction of sequences of words in a structured manner. N-grams are widely used in various NLP tasks, including language modeling, where they help predict the probability of a sequence of words. 

They are also crucial in text generation, enabling the creation of coherent and contextually appropriate sentences. Additionally, N-grams play a significant role in machine translation, assisting in the accurate translation of text from one language to another by considering the context provided by contiguous word sequences.

Their simplicity and effectiveness make N-grams a powerful tool in the field of NLP, contributing to advancements in understanding and processing human language.

4.1.1 Understanding N-grams

An N-gram is a sequence of N words that appears in a text. This concept is fundamental in the field of Natural Language Processing (NLP) and helps in understanding the structure and meaning of language. Here are some examples of different types of N-grams:

  • Unigram (1-gram): "Natural"
  • Bigram (2-gram): "Natural Language"
  • Trigram (3-gram): "Natural Language Processing"

N-grams capture local word dependencies by considering a fixed window of N words. The choice of N determines the size of the window and the amount of context captured. For instance, unigrams capture individual word frequencies, providing insight into the most common words in a text.

Bigrams, on the other hand, capture relationships between pairs of adjacent words, revealing common word combinations. Trigrams and higher-order N-grams capture even more complex relationships, showing how sequences of three or more words are used together in context.

The use of N-grams is crucial in various applications such as text prediction, machine translation, and speech recognition. By analyzing N-grams, one can better understand the syntactic and semantic properties of text, which is essential for creating more accurate and efficient language models. Therefore, understanding and utilizing N-grams can significantly enhance the performance of NLP tasks.

4.1.2 Generating N-grams in Python

Let's see how to generate N-grams in Python using a sample text.

Example: Generating N-grams

from nltk import ngrams
from collections import Counter
import nltk
nltk.download('punkt')

# Sample text
text = "Natural Language Processing is a fascinating field of study."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Function to generate N-grams
def generate_ngrams(tokens, n):
    n_grams = ngrams(tokens, n)
    return [' '.join(grams) for grams in n_grams]

# Generate unigrams, bigrams, and trigrams
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)
trigrams = generate_ngrams(tokens, 3)

print("Unigrams:")
print(unigrams)
print("\\nBigrams:")
print(bigrams)
print("\\nTrigrams:")
print(trigrams)

This example code demonstrates how to generate unigrams, bigrams, and trigrams using the Natural Language Toolkit (nltk) library. Here’s a step-by-step explanation of the code:

  1. Import Libraries:
    from nltk import ngrams
    from collections import Counter
    import nltk
    nltk.download('punkt')
    • The ngrams function from nltk is used to generate N-grams.
    • The Counter from collections is imported but not used in this specific code.
    • The nltk.download('punkt') line ensures that the Punkt tokenizer models are downloaded, which are necessary for tokenizing the text into words.
  2. Sample Text:
    text = "Natural Language Processing is a fascinating field of study."

    A sample sentence is defined to demonstrate the generation of N-grams.

  3. Tokenize the Text:
    tokens = nltk.word_tokenize(text)

    The text is tokenized into individual words using nltk.word_tokenize.

  4. Function to Generate N-grams:
    def generate_ngrams(tokens, n):
        n_grams = ngrams(tokens, n)
        return [' '.join(grams) for grams in n_grams]
    • A function generate_ngrams is defined that takes a list of tokens and an integer n representing the N-gram size.
    • The ngrams function generates N-grams from the list of tokens.
    • The function returns a list of N-grams joined by spaces.
  5. Generate Unigrams, Bigrams, and Trigrams:
    unigrams = generate_ngrams(tokens, 1)
    bigrams = generate_ngrams(tokens, 2)
    trigrams = generate_ngrams(tokens, 3)

    The generate_ngrams function is called with n values of 1, 2, and 3 to generate unigrams, bigrams, and trigrams, respectively.

  6. Print the N-grams:
    print("Unigrams:")
    print(unigrams)
    print("\\\\nBigrams:")
    print(bigrams)
    print("\\\\nTrigrams:")
    print(trigrams)

    The generated unigrams, bigrams, and trigrams are printed to the console.

Example Output:

Unigrams:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']

Bigrams:
['Natural Language', 'Language Processing', 'Processing is', 'is a', 'a fascinating', 'fascinating field', 'field of', 'of study', 'study .']

Trigrams:
['Natural Language Processing', 'Language Processing is', 'Processing is a', 'is a fascinating', 'a fascinating field', 'fascinating field of', 'field of study', 'of study .']

In this example, we use the ngrams function from the nltk library to generate unigrams, bigrams, and trigrams from the sample text. The function takes a list of tokens and the value of N as input and returns a list of N-grams.

4.1.3 N-gram Language Models

N-gram language models are statistical models used in computational linguistics to predict the next item in a sequence, such as the next word in a sentence, based on the previous N-1 items. The primary goal of an N-gram model is to estimate the probability of a word given the preceding words in the sequence.

An N-gram language model estimates the probability of a word given the previous N-1 words. This is useful for predicting the next word in a sequence. The probability of a word sequence ( w_1, w_2, ..., w_T ) is given by:


[
P(w_1, w_2, ..., w_T) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) ... P(w_T|w_1, ..., w_{T-1})
]


For an N-gram model, this can be simplified to:


[
P(w_1, w_2, ..., w_T) \approx \prod_{i=1}^{T} P(w_i|w_{i-N+1}, ..., w_{i-1})
]


In essence, the model breaks down the probability of a sequence of words into the product of conditional probabilities of each word given the previous N-1 words. This simplification allows the model to be more manageable and computationally feasible.

Example: Bigram Model

A bigram model (where N=2) considers the probability of a word given the previous word. For example, the probability of the word "processing" given the word "language" in the sequence "natural language processing" can be written as:


[
P(\text{processing}|\text{language})
]


To calculate these probabilities, the model needs to be trained on a large corpus of text, where it counts the occurrences of pairs of words (bigrams) and normalizes these counts to obtain probabilities. For instance, if the bigram "language processing" appears 50 times in a corpus and the word "language" appears 200 times, the probability of "processing" given "language" would be:


[
P(\text{processing}|\text{language}) = \frac{\text{Count}(\text{language processing})}{\text{Count}(\text{language})} = \frac{50}{200} = 0.25
]


This means that in the given corpus, the word "processing" follows the word "language" 25% of the time.

Challenges with N-gram Models

While N-gram models are simple and effective, they have several limitations:

  • Sparsity: As N increases, the number of possible N-grams grows exponentially, leading to data sparsity. This means that many N-grams may not appear in the training corpus, making it difficult to estimate their probabilities accurately. For instance, if you have a trigram model (N=3), the number of possible trigrams can be extremely large, and many of these trigrams might never occur in your training data, making it challenging to provide reliable probability estimates.
  • Context Limitation: N-gram models capture only a fixed window of context (N-1 words), which can be insufficient for capturing long-range dependencies in language. For example, if you are using a bigram model, it only considers the previous word to predict the next one, which might not be enough to understand the full context of a sentence, especially in complex or lengthy texts.
  • Memory Usage: High-order N-gram models require significant memory to store the probabilities of all possible N-grams. The higher the value of N, the more memory is needed to store these probabilities, which can become a substantial computational burden. For example, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be impractical for large vocabularies.

Despite these limitations, N-gram models are useful for many natural language processing (NLP) tasks and serve as a foundation for more advanced language modeling techniques. They are often used in speech recognition, text prediction, and other areas where understanding the probability of word sequences is crucial.

Furthermore, N-gram models have laid the groundwork for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language.

Applications of N-gram Models

N-gram models are widely used in various Natural Language Processing (NLP) tasks due to their ability to capture and utilize the statistical properties of word sequences. These models are employed in several key applications, including:

  • Text Prediction: N-gram models are instrumental in predicting the next word in a sequence. This capability is often leveraged in predictive text input on mobile devices, where the model suggests possible words to complete the user's input based on the context of previous words. This feature enhances typing efficiency and accuracy.
  • Speech Recognition: In the realm of speech-to-text systems, N-gram models significantly improve the accuracy of transcriptions. By predicting the most likely word sequences, these models help in filtering out improbable word combinations, thereby refining the output of speech recognition software and making it more reliable.
  • Machine Translation: When translating text from one language to another, N-gram models play a crucial role by considering the context provided by contiguous word sequences. This contextual understanding helps in producing translations that are not only accurate but also contextually appropriate, ensuring that the meaning of the original text is preserved.
  • Text Generation: N-gram models are also used for generating coherent and contextually appropriate sentences. This is particularly useful in applications like chatbots and automated content creation, where the ability to produce natural-sounding language is essential. By analyzing patterns in large corpora of text, N-gram models can construct sentences that mimic human language usage, thereby enhancing the user experience.

Overall, the versatility and effectiveness of N-gram models make them a fundamental component in the toolkit of NLP technologies.

In conclusion, N-gram language models are a fundamental tool in NLP that help in understanding and predicting word sequences based on the context provided by previous words. While they have certain limitations, their simplicity and effectiveness make them a valuable starting point for more complex language modeling techniques.

4.1.4 Training an N-gram Language Model

To train an N-gram language model, we need to calculate the probabilities of N-grams from a training corpus. This involves counting the occurrences of N-grams and normalizing these counts to obtain probabilities.

Training an N-gram language model involves a series of steps to calculate the probabilities of N-grams from a training corpus. Here's a detailed explanation of the process:

  1. Tokenizing the Corpus:
    The first step is to tokenize the training corpus into individual words or tokens. This involves splitting the text into words, which will be used to form N-grams.
    import nltk
    nltk.download('punkt')

    # Sample text corpus
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    # Tokenize the text into words
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]
  2. Generating N-grams:
    Once the text is tokenized, the next step is to generate N-grams from these tokens. N-grams are contiguous sequences of N items from the text.
    from nltk.util import ngrams

    # Example of generating bigrams (N=2)
    bigrams = [ngrams(sentence, 2) for sentence in tokenized_corpus]
  3. Counting N-gram Occurrences:
    The core of training an N-gram model is counting the occurrences of each N-gram in the corpus. This involves iterating through the tokenized text and recording how often each N-gram appears.
    from collections import defaultdict

    def count_ngrams(tokenized_corpus, n):
        counts = defaultdict(lambda: defaultdict(int))
        for sentence in tokenized_corpus:
            for ngram in ngrams(sentence, n):
                counts[ngram[:-1]][ngram[-1]] += 1
        return counts

    # Count bigrams
    bigram_counts = count_ngrams(tokenized_corpus, 2)
  4. Calculating Probabilities:
    After counting the N-grams, the next step is to calculate their probabilities. This is done by normalizing the counts, which means dividing the count of each N-gram by the total count of N-grams that share the same prefix (context).
    def calculate_probabilities(counts):
        probabilities = defaultdict(dict)
        for context in counts:
            total_count = float(sum(counts[context].values()))
            for word in counts[context]:
                probabilities[context][word] = counts[context][word] / total_count
        return probabilities

    # Calculate bigram probabilities
    bigram_probabilities = calculate_probabilities(bigram_counts)
  5. Using the Model:
    With the N-gram probabilities calculated, the model can now predict the likelihood of a word following a given context. This is useful for tasks such as text generation and speech recognition.
    def get_ngram_probability(model, context, word):
        return model[context].get(word, 0)

    # Example: Get probability of "NLP" following "for"
    probability = get_ngram_probability(bigram_probabilities, ('for',), 'NLP')
    print("Bigram Probability (NLP | for):", probability)

In summary, training an N-gram language model involves tokenizing the text, generating N-grams, counting their occurrences, and calculating probabilities. This model can then be used to predict the likelihood of subsequent words in a sequence, aiding various NLP tasks.

Example: Training a Bigram Language Model

from collections import defaultdict

# Sample text corpus
corpus = [
    "Natural Language Processing is a fascinating field of study.",
    "Machine learning and NLP are closely related.",
    "Language models are essential for NLP tasks."
]

# Tokenize the text into words
tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

# Function to calculate bigram probabilities
def train_bigram_model(tokenized_corpus):
    model = defaultdict(lambda: defaultdict(lambda: 0))

    # Count bigrams
    for sentence in tokenized_corpus:
        for w1, w2 in ngrams(sentence, 2):
            model[w1][w2] += 1

    # Calculate probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count

    return model

# Train the bigram model
bigram_model = train_bigram_model(tokenized_corpus)

# Function to get the probability of a bigram
def get_bigram_probability(bigram_model, w1, w2):
    return bigram_model[w1][w2]

print("Bigram Probability (NLP | for):")
print(get_bigram_probability(bigram_model, 'for', 'NLP'))

This example code demonstrates how to build a bigram language model using a sample text corpus. Here is a detailed explanation of each part of the code:

Step-by-Step Explanation

  1. Import Required Libraries:
    from collections import defaultdict
    import nltk
    from nltk.util import ngrams
    nltk.download('punkt')
    • defaultdict from the collections module is used to create a nested dictionary that will store the bigram counts and probabilities.
    • The nltk library is used for natural language processing tasks, including tokenization.
    • ngrams from nltk.util helps generate N-grams from the tokenized text.
  2. Sample Text Corpus:
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    A sample text corpus consisting of three sentences is defined. This corpus will be used to train the bigram model.

  3. Tokenize the Text:
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

    The nltk.word_tokenize function tokenizes each sentence in the corpus into individual words. The result is a list of tokenized sentences.

  4. Function to Train the Bigram Model:
    def train_bigram_model(tokenized_corpus):
        model = defaultdict(lambda: defaultdict(lambda: 0))

        # Count bigrams
        for sentence in tokenized_corpus:
            for w1, w2 in ngrams(sentence, 2):
                model[w1][w2] += 1

        # Calculate probabilities
        for w1 in model:
            total_count = float(sum(model[w1].values()))
            for w2 in model[w1]:
                model[w1][w2] /= total_count

        return model
    • This function trains the bigram model by counting the occurrences of word pairs (bigrams) and then calculating their probabilities.
    • A nested defaultdict is used to store the counts and probabilities of bigrams.
    • The function first counts the occurrences of each bigram in the tokenized corpus.
    • It then calculates the probability of each bigram by normalizing the counts. The probability of a bigram is the count of the bigram divided by the total count of all bigrams starting with the same first word.
  5. Train the Bigram Model:
    bigram_model = train_bigram_model(tokenized_corpus)

    The train_bigram_model function is called with the tokenized corpus to train the bigram model. The resulting model is stored in the bigram_model variable.

  6. Function to Get the Probability of a Bigram:
    def get_bigram_probability(bigram_model, w1, w2):
        return bigram_model[w1][w2]

    This function retrieves the probability of a given bigram from the trained model. It takes the model and the two words forming the bigram as input and returns the probability.

  7. Print the Bigram Probability:
    print("Bigram Probability (NLP | for):")
    print(get_bigram_probability(bigram_model, 'for', 'NLP'))

    The probability of the bigram "NLP" following the word "for" is printed using the get_bigram_probability function.

Example Output

Bigram Probability (NLP | for):
0.5

When you run the code, the output will display the probability of the word "NLP" following the word "for" based on the trained bigram model. Given the small sample corpus, the printed probability might not be very informative, but it demonstrates the process of calculating bigram probabilities.

This example illustrates the fundamental steps involved in building a bigram language model: tokenizing text, counting bigrams, calculating their probabilities, and retrieving probabilities from the model. Despite the simplicity of this model, it serves as a foundation for understanding more complex language modeling techniques. Bigrams capture local word dependencies and provide a statistical basis for predicting the next word in a sequence based on the current word. 

4.1.5 Limitations of N-gram Models

While N-gram models are simple and effective, they come with several significant limitations:

  1. Sparsity Issues:
    • As the value of N increases, the number of possible N-grams grows exponentially. This results in data sparsity, where many possible N-grams may not appear in the training corpus. For example, in a trigram model, the number of possible trigrams is extremely large, and many of these trigrams might never occur in the training data. This makes it difficult to estimate their probabilities accurately, leading to unreliable predictions.
  2. Context Limitation:
    • N-gram models capture only a fixed window of context, specifically the previous N-1 words. This fixed window can be insufficient for capturing long-range dependencies in the text. For example, in a bigram model, only the immediately preceding word is considered when predicting the next word. This can be a significant limitation in understanding the full context of a sentence, especially in complex or lengthy texts where important information may be spread out over several words or sentences.
  3. Memory Usage:
    • Higher-order N-gram models require significant memory to store the probabilities of all possible N-grams. The larger the value of N, the more memory is needed to store these probabilities. For instance, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be computationally expensive and impractical for large vocabularies.
  4. Lack of Generalization:
    • N-gram models can struggle to generalize well to unseen data. They rely heavily on the specific sequences observed in the training data, meaning they may not perform well on new or slightly different sequences. This lack of generalization can limit their effectiveness in real-world applications where language use is highly variable and context-dependent.
  5. Handling of Out-of-Vocabulary Words:
    • N-gram models have difficulty handling out-of-vocabulary (OOV) words—words that were not seen during training. This can result in poor performance when the model encounters new words or phrases, as it has no prior knowledge or probabilities associated with them.
  6. Inability to Capture Semantic Meaning:
    • N-gram models operate purely on the basis of word sequences and do not capture the underlying semantic meaning of the words. They treat words as independent tokens without understanding their meanings or relationships, which can limit their ability to perform tasks that require a deeper understanding of language.

Despite these limitations, N-gram models are still valuable for many NLP tasks and serve as a foundation for more advanced language modeling techniques. They are often used in applications such as text prediction, speech recognition, and machine translation, where understanding the probability of word sequences is crucial.

N-gram models have also paved the way for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language. These advanced models leverage large-scale datasets and powerful computational resources to achieve higher accuracy and better generalization in various NLP tasks.

While N-gram models have their drawbacks, their simplicity and foundational role in language modeling make them an essential starting point for understanding and developing more complex NLP techniques.

4.1 N-grams

Language modeling is a fundamental task in Natural Language Processing (NLP) that involves predicting the next word or sequence of words in a sentence. It serves as the backbone for many advanced NLP applications such as speech recognition, machine translation, text generation, and more. A good language model can understand the context and semantics of a given text, enabling it to generate coherent and contextually appropriate sentences.

In this chapter, we will explore different techniques for building language models, starting with simpler methods like N-grams and moving towards more complex approaches such as Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs). By the end of this chapter, you will have a solid understanding of how language models work and how to implement them in Python.

We will begin with N-grams, one of the simplest yet powerful techniques for language modeling.

N-grams are a contiguous sequence of N items derived from a given sample of text or speech. In the context of natural language processing (NLP), these items are typically words, and N can be any integer, such as 1, 2, 3, or more.

The concept of N-grams is fundamental in NLP because it allows the analysis and prediction of sequences of words in a structured manner. N-grams are widely used in various NLP tasks, including language modeling, where they help predict the probability of a sequence of words. 

They are also crucial in text generation, enabling the creation of coherent and contextually appropriate sentences. Additionally, N-grams play a significant role in machine translation, assisting in the accurate translation of text from one language to another by considering the context provided by contiguous word sequences.

Their simplicity and effectiveness make N-grams a powerful tool in the field of NLP, contributing to advancements in understanding and processing human language.

4.1.1 Understanding N-grams

An N-gram is a sequence of N words that appears in a text. This concept is fundamental in the field of Natural Language Processing (NLP) and helps in understanding the structure and meaning of language. Here are some examples of different types of N-grams:

  • Unigram (1-gram): "Natural"
  • Bigram (2-gram): "Natural Language"
  • Trigram (3-gram): "Natural Language Processing"

N-grams capture local word dependencies by considering a fixed window of N words. The choice of N determines the size of the window and the amount of context captured. For instance, unigrams capture individual word frequencies, providing insight into the most common words in a text.

Bigrams, on the other hand, capture relationships between pairs of adjacent words, revealing common word combinations. Trigrams and higher-order N-grams capture even more complex relationships, showing how sequences of three or more words are used together in context.

The use of N-grams is crucial in various applications such as text prediction, machine translation, and speech recognition. By analyzing N-grams, one can better understand the syntactic and semantic properties of text, which is essential for creating more accurate and efficient language models. Therefore, understanding and utilizing N-grams can significantly enhance the performance of NLP tasks.

4.1.2 Generating N-grams in Python

Let's see how to generate N-grams in Python using a sample text.

Example: Generating N-grams

from nltk import ngrams
from collections import Counter
import nltk
nltk.download('punkt')

# Sample text
text = "Natural Language Processing is a fascinating field of study."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Function to generate N-grams
def generate_ngrams(tokens, n):
    n_grams = ngrams(tokens, n)
    return [' '.join(grams) for grams in n_grams]

# Generate unigrams, bigrams, and trigrams
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)
trigrams = generate_ngrams(tokens, 3)

print("Unigrams:")
print(unigrams)
print("\\nBigrams:")
print(bigrams)
print("\\nTrigrams:")
print(trigrams)

This example code demonstrates how to generate unigrams, bigrams, and trigrams using the Natural Language Toolkit (nltk) library. Here’s a step-by-step explanation of the code:

  1. Import Libraries:
    from nltk import ngrams
    from collections import Counter
    import nltk
    nltk.download('punkt')
    • The ngrams function from nltk is used to generate N-grams.
    • The Counter from collections is imported but not used in this specific code.
    • The nltk.download('punkt') line ensures that the Punkt tokenizer models are downloaded, which are necessary for tokenizing the text into words.
  2. Sample Text:
    text = "Natural Language Processing is a fascinating field of study."

    A sample sentence is defined to demonstrate the generation of N-grams.

  3. Tokenize the Text:
    tokens = nltk.word_tokenize(text)

    The text is tokenized into individual words using nltk.word_tokenize.

  4. Function to Generate N-grams:
    def generate_ngrams(tokens, n):
        n_grams = ngrams(tokens, n)
        return [' '.join(grams) for grams in n_grams]
    • A function generate_ngrams is defined that takes a list of tokens and an integer n representing the N-gram size.
    • The ngrams function generates N-grams from the list of tokens.
    • The function returns a list of N-grams joined by spaces.
  5. Generate Unigrams, Bigrams, and Trigrams:
    unigrams = generate_ngrams(tokens, 1)
    bigrams = generate_ngrams(tokens, 2)
    trigrams = generate_ngrams(tokens, 3)

    The generate_ngrams function is called with n values of 1, 2, and 3 to generate unigrams, bigrams, and trigrams, respectively.

  6. Print the N-grams:
    print("Unigrams:")
    print(unigrams)
    print("\\\\nBigrams:")
    print(bigrams)
    print("\\\\nTrigrams:")
    print(trigrams)

    The generated unigrams, bigrams, and trigrams are printed to the console.

Example Output:

Unigrams:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']

Bigrams:
['Natural Language', 'Language Processing', 'Processing is', 'is a', 'a fascinating', 'fascinating field', 'field of', 'of study', 'study .']

Trigrams:
['Natural Language Processing', 'Language Processing is', 'Processing is a', 'is a fascinating', 'a fascinating field', 'fascinating field of', 'field of study', 'of study .']

In this example, we use the ngrams function from the nltk library to generate unigrams, bigrams, and trigrams from the sample text. The function takes a list of tokens and the value of N as input and returns a list of N-grams.

4.1.3 N-gram Language Models

N-gram language models are statistical models used in computational linguistics to predict the next item in a sequence, such as the next word in a sentence, based on the previous N-1 items. The primary goal of an N-gram model is to estimate the probability of a word given the preceding words in the sequence.

An N-gram language model estimates the probability of a word given the previous N-1 words. This is useful for predicting the next word in a sequence. The probability of a word sequence ( w_1, w_2, ..., w_T ) is given by:


[
P(w_1, w_2, ..., w_T) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) ... P(w_T|w_1, ..., w_{T-1})
]


For an N-gram model, this can be simplified to:


[
P(w_1, w_2, ..., w_T) \approx \prod_{i=1}^{T} P(w_i|w_{i-N+1}, ..., w_{i-1})
]


In essence, the model breaks down the probability of a sequence of words into the product of conditional probabilities of each word given the previous N-1 words. This simplification allows the model to be more manageable and computationally feasible.

Example: Bigram Model

A bigram model (where N=2) considers the probability of a word given the previous word. For example, the probability of the word "processing" given the word "language" in the sequence "natural language processing" can be written as:


[
P(\text{processing}|\text{language})
]


To calculate these probabilities, the model needs to be trained on a large corpus of text, where it counts the occurrences of pairs of words (bigrams) and normalizes these counts to obtain probabilities. For instance, if the bigram "language processing" appears 50 times in a corpus and the word "language" appears 200 times, the probability of "processing" given "language" would be:


[
P(\text{processing}|\text{language}) = \frac{\text{Count}(\text{language processing})}{\text{Count}(\text{language})} = \frac{50}{200} = 0.25
]


This means that in the given corpus, the word "processing" follows the word "language" 25% of the time.

Challenges with N-gram Models

While N-gram models are simple and effective, they have several limitations:

  • Sparsity: As N increases, the number of possible N-grams grows exponentially, leading to data sparsity. This means that many N-grams may not appear in the training corpus, making it difficult to estimate their probabilities accurately. For instance, if you have a trigram model (N=3), the number of possible trigrams can be extremely large, and many of these trigrams might never occur in your training data, making it challenging to provide reliable probability estimates.
  • Context Limitation: N-gram models capture only a fixed window of context (N-1 words), which can be insufficient for capturing long-range dependencies in language. For example, if you are using a bigram model, it only considers the previous word to predict the next one, which might not be enough to understand the full context of a sentence, especially in complex or lengthy texts.
  • Memory Usage: High-order N-gram models require significant memory to store the probabilities of all possible N-grams. The higher the value of N, the more memory is needed to store these probabilities, which can become a substantial computational burden. For example, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be impractical for large vocabularies.

Despite these limitations, N-gram models are useful for many natural language processing (NLP) tasks and serve as a foundation for more advanced language modeling techniques. They are often used in speech recognition, text prediction, and other areas where understanding the probability of word sequences is crucial.

Furthermore, N-gram models have laid the groundwork for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language.

Applications of N-gram Models

N-gram models are widely used in various Natural Language Processing (NLP) tasks due to their ability to capture and utilize the statistical properties of word sequences. These models are employed in several key applications, including:

  • Text Prediction: N-gram models are instrumental in predicting the next word in a sequence. This capability is often leveraged in predictive text input on mobile devices, where the model suggests possible words to complete the user's input based on the context of previous words. This feature enhances typing efficiency and accuracy.
  • Speech Recognition: In the realm of speech-to-text systems, N-gram models significantly improve the accuracy of transcriptions. By predicting the most likely word sequences, these models help in filtering out improbable word combinations, thereby refining the output of speech recognition software and making it more reliable.
  • Machine Translation: When translating text from one language to another, N-gram models play a crucial role by considering the context provided by contiguous word sequences. This contextual understanding helps in producing translations that are not only accurate but also contextually appropriate, ensuring that the meaning of the original text is preserved.
  • Text Generation: N-gram models are also used for generating coherent and contextually appropriate sentences. This is particularly useful in applications like chatbots and automated content creation, where the ability to produce natural-sounding language is essential. By analyzing patterns in large corpora of text, N-gram models can construct sentences that mimic human language usage, thereby enhancing the user experience.

Overall, the versatility and effectiveness of N-gram models make them a fundamental component in the toolkit of NLP technologies.

In conclusion, N-gram language models are a fundamental tool in NLP that help in understanding and predicting word sequences based on the context provided by previous words. While they have certain limitations, their simplicity and effectiveness make them a valuable starting point for more complex language modeling techniques.

4.1.4 Training an N-gram Language Model

To train an N-gram language model, we need to calculate the probabilities of N-grams from a training corpus. This involves counting the occurrences of N-grams and normalizing these counts to obtain probabilities.

Training an N-gram language model involves a series of steps to calculate the probabilities of N-grams from a training corpus. Here's a detailed explanation of the process:

  1. Tokenizing the Corpus:
    The first step is to tokenize the training corpus into individual words or tokens. This involves splitting the text into words, which will be used to form N-grams.
    import nltk
    nltk.download('punkt')

    # Sample text corpus
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    # Tokenize the text into words
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]
  2. Generating N-grams:
    Once the text is tokenized, the next step is to generate N-grams from these tokens. N-grams are contiguous sequences of N items from the text.
    from nltk.util import ngrams

    # Example of generating bigrams (N=2)
    bigrams = [ngrams(sentence, 2) for sentence in tokenized_corpus]
  3. Counting N-gram Occurrences:
    The core of training an N-gram model is counting the occurrences of each N-gram in the corpus. This involves iterating through the tokenized text and recording how often each N-gram appears.
    from collections import defaultdict

    def count_ngrams(tokenized_corpus, n):
        counts = defaultdict(lambda: defaultdict(int))
        for sentence in tokenized_corpus:
            for ngram in ngrams(sentence, n):
                counts[ngram[:-1]][ngram[-1]] += 1
        return counts

    # Count bigrams
    bigram_counts = count_ngrams(tokenized_corpus, 2)
  4. Calculating Probabilities:
    After counting the N-grams, the next step is to calculate their probabilities. This is done by normalizing the counts, which means dividing the count of each N-gram by the total count of N-grams that share the same prefix (context).
    def calculate_probabilities(counts):
        probabilities = defaultdict(dict)
        for context in counts:
            total_count = float(sum(counts[context].values()))
            for word in counts[context]:
                probabilities[context][word] = counts[context][word] / total_count
        return probabilities

    # Calculate bigram probabilities
    bigram_probabilities = calculate_probabilities(bigram_counts)
  5. Using the Model:
    With the N-gram probabilities calculated, the model can now predict the likelihood of a word following a given context. This is useful for tasks such as text generation and speech recognition.
    def get_ngram_probability(model, context, word):
        return model[context].get(word, 0)

    # Example: Get probability of "NLP" following "for"
    probability = get_ngram_probability(bigram_probabilities, ('for',), 'NLP')
    print("Bigram Probability (NLP | for):", probability)

In summary, training an N-gram language model involves tokenizing the text, generating N-grams, counting their occurrences, and calculating probabilities. This model can then be used to predict the likelihood of subsequent words in a sequence, aiding various NLP tasks.

Example: Training a Bigram Language Model

from collections import defaultdict

# Sample text corpus
corpus = [
    "Natural Language Processing is a fascinating field of study.",
    "Machine learning and NLP are closely related.",
    "Language models are essential for NLP tasks."
]

# Tokenize the text into words
tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

# Function to calculate bigram probabilities
def train_bigram_model(tokenized_corpus):
    model = defaultdict(lambda: defaultdict(lambda: 0))

    # Count bigrams
    for sentence in tokenized_corpus:
        for w1, w2 in ngrams(sentence, 2):
            model[w1][w2] += 1

    # Calculate probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count

    return model

# Train the bigram model
bigram_model = train_bigram_model(tokenized_corpus)

# Function to get the probability of a bigram
def get_bigram_probability(bigram_model, w1, w2):
    return bigram_model[w1][w2]

print("Bigram Probability (NLP | for):")
print(get_bigram_probability(bigram_model, 'for', 'NLP'))

This example code demonstrates how to build a bigram language model using a sample text corpus. Here is a detailed explanation of each part of the code:

Step-by-Step Explanation

  1. Import Required Libraries:
    from collections import defaultdict
    import nltk
    from nltk.util import ngrams
    nltk.download('punkt')
    • defaultdict from the collections module is used to create a nested dictionary that will store the bigram counts and probabilities.
    • The nltk library is used for natural language processing tasks, including tokenization.
    • ngrams from nltk.util helps generate N-grams from the tokenized text.
  2. Sample Text Corpus:
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    A sample text corpus consisting of three sentences is defined. This corpus will be used to train the bigram model.

  3. Tokenize the Text:
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

    The nltk.word_tokenize function tokenizes each sentence in the corpus into individual words. The result is a list of tokenized sentences.

  4. Function to Train the Bigram Model:
    def train_bigram_model(tokenized_corpus):
        model = defaultdict(lambda: defaultdict(lambda: 0))

        # Count bigrams
        for sentence in tokenized_corpus:
            for w1, w2 in ngrams(sentence, 2):
                model[w1][w2] += 1

        # Calculate probabilities
        for w1 in model:
            total_count = float(sum(model[w1].values()))
            for w2 in model[w1]:
                model[w1][w2] /= total_count

        return model
    • This function trains the bigram model by counting the occurrences of word pairs (bigrams) and then calculating their probabilities.
    • A nested defaultdict is used to store the counts and probabilities of bigrams.
    • The function first counts the occurrences of each bigram in the tokenized corpus.
    • It then calculates the probability of each bigram by normalizing the counts. The probability of a bigram is the count of the bigram divided by the total count of all bigrams starting with the same first word.
  5. Train the Bigram Model:
    bigram_model = train_bigram_model(tokenized_corpus)

    The train_bigram_model function is called with the tokenized corpus to train the bigram model. The resulting model is stored in the bigram_model variable.

  6. Function to Get the Probability of a Bigram:
    def get_bigram_probability(bigram_model, w1, w2):
        return bigram_model[w1][w2]

    This function retrieves the probability of a given bigram from the trained model. It takes the model and the two words forming the bigram as input and returns the probability.

  7. Print the Bigram Probability:
    print("Bigram Probability (NLP | for):")
    print(get_bigram_probability(bigram_model, 'for', 'NLP'))

    The probability of the bigram "NLP" following the word "for" is printed using the get_bigram_probability function.

Example Output

Bigram Probability (NLP | for):
0.5

When you run the code, the output will display the probability of the word "NLP" following the word "for" based on the trained bigram model. Given the small sample corpus, the printed probability might not be very informative, but it demonstrates the process of calculating bigram probabilities.

This example illustrates the fundamental steps involved in building a bigram language model: tokenizing text, counting bigrams, calculating their probabilities, and retrieving probabilities from the model. Despite the simplicity of this model, it serves as a foundation for understanding more complex language modeling techniques. Bigrams capture local word dependencies and provide a statistical basis for predicting the next word in a sequence based on the current word. 

4.1.5 Limitations of N-gram Models

While N-gram models are simple and effective, they come with several significant limitations:

  1. Sparsity Issues:
    • As the value of N increases, the number of possible N-grams grows exponentially. This results in data sparsity, where many possible N-grams may not appear in the training corpus. For example, in a trigram model, the number of possible trigrams is extremely large, and many of these trigrams might never occur in the training data. This makes it difficult to estimate their probabilities accurately, leading to unreliable predictions.
  2. Context Limitation:
    • N-gram models capture only a fixed window of context, specifically the previous N-1 words. This fixed window can be insufficient for capturing long-range dependencies in the text. For example, in a bigram model, only the immediately preceding word is considered when predicting the next word. This can be a significant limitation in understanding the full context of a sentence, especially in complex or lengthy texts where important information may be spread out over several words or sentences.
  3. Memory Usage:
    • Higher-order N-gram models require significant memory to store the probabilities of all possible N-grams. The larger the value of N, the more memory is needed to store these probabilities. For instance, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be computationally expensive and impractical for large vocabularies.
  4. Lack of Generalization:
    • N-gram models can struggle to generalize well to unseen data. They rely heavily on the specific sequences observed in the training data, meaning they may not perform well on new or slightly different sequences. This lack of generalization can limit their effectiveness in real-world applications where language use is highly variable and context-dependent.
  5. Handling of Out-of-Vocabulary Words:
    • N-gram models have difficulty handling out-of-vocabulary (OOV) words—words that were not seen during training. This can result in poor performance when the model encounters new words or phrases, as it has no prior knowledge or probabilities associated with them.
  6. Inability to Capture Semantic Meaning:
    • N-gram models operate purely on the basis of word sequences and do not capture the underlying semantic meaning of the words. They treat words as independent tokens without understanding their meanings or relationships, which can limit their ability to perform tasks that require a deeper understanding of language.

Despite these limitations, N-gram models are still valuable for many NLP tasks and serve as a foundation for more advanced language modeling techniques. They are often used in applications such as text prediction, speech recognition, and machine translation, where understanding the probability of word sequences is crucial.

N-gram models have also paved the way for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language. These advanced models leverage large-scale datasets and powerful computational resources to achieve higher accuracy and better generalization in various NLP tasks.

While N-gram models have their drawbacks, their simplicity and foundational role in language modeling make them an essential starting point for understanding and developing more complex NLP techniques.

4.1 N-grams

Language modeling is a fundamental task in Natural Language Processing (NLP) that involves predicting the next word or sequence of words in a sentence. It serves as the backbone for many advanced NLP applications such as speech recognition, machine translation, text generation, and more. A good language model can understand the context and semantics of a given text, enabling it to generate coherent and contextually appropriate sentences.

In this chapter, we will explore different techniques for building language models, starting with simpler methods like N-grams and moving towards more complex approaches such as Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs). By the end of this chapter, you will have a solid understanding of how language models work and how to implement them in Python.

We will begin with N-grams, one of the simplest yet powerful techniques for language modeling.

N-grams are a contiguous sequence of N items derived from a given sample of text or speech. In the context of natural language processing (NLP), these items are typically words, and N can be any integer, such as 1, 2, 3, or more.

The concept of N-grams is fundamental in NLP because it allows the analysis and prediction of sequences of words in a structured manner. N-grams are widely used in various NLP tasks, including language modeling, where they help predict the probability of a sequence of words. 

They are also crucial in text generation, enabling the creation of coherent and contextually appropriate sentences. Additionally, N-grams play a significant role in machine translation, assisting in the accurate translation of text from one language to another by considering the context provided by contiguous word sequences.

Their simplicity and effectiveness make N-grams a powerful tool in the field of NLP, contributing to advancements in understanding and processing human language.

4.1.1 Understanding N-grams

An N-gram is a sequence of N words that appears in a text. This concept is fundamental in the field of Natural Language Processing (NLP) and helps in understanding the structure and meaning of language. Here are some examples of different types of N-grams:

  • Unigram (1-gram): "Natural"
  • Bigram (2-gram): "Natural Language"
  • Trigram (3-gram): "Natural Language Processing"

N-grams capture local word dependencies by considering a fixed window of N words. The choice of N determines the size of the window and the amount of context captured. For instance, unigrams capture individual word frequencies, providing insight into the most common words in a text.

Bigrams, on the other hand, capture relationships between pairs of adjacent words, revealing common word combinations. Trigrams and higher-order N-grams capture even more complex relationships, showing how sequences of three or more words are used together in context.

The use of N-grams is crucial in various applications such as text prediction, machine translation, and speech recognition. By analyzing N-grams, one can better understand the syntactic and semantic properties of text, which is essential for creating more accurate and efficient language models. Therefore, understanding and utilizing N-grams can significantly enhance the performance of NLP tasks.

4.1.2 Generating N-grams in Python

Let's see how to generate N-grams in Python using a sample text.

Example: Generating N-grams

from nltk import ngrams
from collections import Counter
import nltk
nltk.download('punkt')

# Sample text
text = "Natural Language Processing is a fascinating field of study."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Function to generate N-grams
def generate_ngrams(tokens, n):
    n_grams = ngrams(tokens, n)
    return [' '.join(grams) for grams in n_grams]

# Generate unigrams, bigrams, and trigrams
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)
trigrams = generate_ngrams(tokens, 3)

print("Unigrams:")
print(unigrams)
print("\\nBigrams:")
print(bigrams)
print("\\nTrigrams:")
print(trigrams)

This example code demonstrates how to generate unigrams, bigrams, and trigrams using the Natural Language Toolkit (nltk) library. Here’s a step-by-step explanation of the code:

  1. Import Libraries:
    from nltk import ngrams
    from collections import Counter
    import nltk
    nltk.download('punkt')
    • The ngrams function from nltk is used to generate N-grams.
    • The Counter from collections is imported but not used in this specific code.
    • The nltk.download('punkt') line ensures that the Punkt tokenizer models are downloaded, which are necessary for tokenizing the text into words.
  2. Sample Text:
    text = "Natural Language Processing is a fascinating field of study."

    A sample sentence is defined to demonstrate the generation of N-grams.

  3. Tokenize the Text:
    tokens = nltk.word_tokenize(text)

    The text is tokenized into individual words using nltk.word_tokenize.

  4. Function to Generate N-grams:
    def generate_ngrams(tokens, n):
        n_grams = ngrams(tokens, n)
        return [' '.join(grams) for grams in n_grams]
    • A function generate_ngrams is defined that takes a list of tokens and an integer n representing the N-gram size.
    • The ngrams function generates N-grams from the list of tokens.
    • The function returns a list of N-grams joined by spaces.
  5. Generate Unigrams, Bigrams, and Trigrams:
    unigrams = generate_ngrams(tokens, 1)
    bigrams = generate_ngrams(tokens, 2)
    trigrams = generate_ngrams(tokens, 3)

    The generate_ngrams function is called with n values of 1, 2, and 3 to generate unigrams, bigrams, and trigrams, respectively.

  6. Print the N-grams:
    print("Unigrams:")
    print(unigrams)
    print("\\\\nBigrams:")
    print(bigrams)
    print("\\\\nTrigrams:")
    print(trigrams)

    The generated unigrams, bigrams, and trigrams are printed to the console.

Example Output:

Unigrams:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']

Bigrams:
['Natural Language', 'Language Processing', 'Processing is', 'is a', 'a fascinating', 'fascinating field', 'field of', 'of study', 'study .']

Trigrams:
['Natural Language Processing', 'Language Processing is', 'Processing is a', 'is a fascinating', 'a fascinating field', 'fascinating field of', 'field of study', 'of study .']

In this example, we use the ngrams function from the nltk library to generate unigrams, bigrams, and trigrams from the sample text. The function takes a list of tokens and the value of N as input and returns a list of N-grams.

4.1.3 N-gram Language Models

N-gram language models are statistical models used in computational linguistics to predict the next item in a sequence, such as the next word in a sentence, based on the previous N-1 items. The primary goal of an N-gram model is to estimate the probability of a word given the preceding words in the sequence.

An N-gram language model estimates the probability of a word given the previous N-1 words. This is useful for predicting the next word in a sequence. The probability of a word sequence ( w_1, w_2, ..., w_T ) is given by:


[
P(w_1, w_2, ..., w_T) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) ... P(w_T|w_1, ..., w_{T-1})
]


For an N-gram model, this can be simplified to:


[
P(w_1, w_2, ..., w_T) \approx \prod_{i=1}^{T} P(w_i|w_{i-N+1}, ..., w_{i-1})
]


In essence, the model breaks down the probability of a sequence of words into the product of conditional probabilities of each word given the previous N-1 words. This simplification allows the model to be more manageable and computationally feasible.

Example: Bigram Model

A bigram model (where N=2) considers the probability of a word given the previous word. For example, the probability of the word "processing" given the word "language" in the sequence "natural language processing" can be written as:


[
P(\text{processing}|\text{language})
]


To calculate these probabilities, the model needs to be trained on a large corpus of text, where it counts the occurrences of pairs of words (bigrams) and normalizes these counts to obtain probabilities. For instance, if the bigram "language processing" appears 50 times in a corpus and the word "language" appears 200 times, the probability of "processing" given "language" would be:


[
P(\text{processing}|\text{language}) = \frac{\text{Count}(\text{language processing})}{\text{Count}(\text{language})} = \frac{50}{200} = 0.25
]


This means that in the given corpus, the word "processing" follows the word "language" 25% of the time.

Challenges with N-gram Models

While N-gram models are simple and effective, they have several limitations:

  • Sparsity: As N increases, the number of possible N-grams grows exponentially, leading to data sparsity. This means that many N-grams may not appear in the training corpus, making it difficult to estimate their probabilities accurately. For instance, if you have a trigram model (N=3), the number of possible trigrams can be extremely large, and many of these trigrams might never occur in your training data, making it challenging to provide reliable probability estimates.
  • Context Limitation: N-gram models capture only a fixed window of context (N-1 words), which can be insufficient for capturing long-range dependencies in language. For example, if you are using a bigram model, it only considers the previous word to predict the next one, which might not be enough to understand the full context of a sentence, especially in complex or lengthy texts.
  • Memory Usage: High-order N-gram models require significant memory to store the probabilities of all possible N-grams. The higher the value of N, the more memory is needed to store these probabilities, which can become a substantial computational burden. For example, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be impractical for large vocabularies.

Despite these limitations, N-gram models are useful for many natural language processing (NLP) tasks and serve as a foundation for more advanced language modeling techniques. They are often used in speech recognition, text prediction, and other areas where understanding the probability of word sequences is crucial.

Furthermore, N-gram models have laid the groundwork for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language.

Applications of N-gram Models

N-gram models are widely used in various Natural Language Processing (NLP) tasks due to their ability to capture and utilize the statistical properties of word sequences. These models are employed in several key applications, including:

  • Text Prediction: N-gram models are instrumental in predicting the next word in a sequence. This capability is often leveraged in predictive text input on mobile devices, where the model suggests possible words to complete the user's input based on the context of previous words. This feature enhances typing efficiency and accuracy.
  • Speech Recognition: In the realm of speech-to-text systems, N-gram models significantly improve the accuracy of transcriptions. By predicting the most likely word sequences, these models help in filtering out improbable word combinations, thereby refining the output of speech recognition software and making it more reliable.
  • Machine Translation: When translating text from one language to another, N-gram models play a crucial role by considering the context provided by contiguous word sequences. This contextual understanding helps in producing translations that are not only accurate but also contextually appropriate, ensuring that the meaning of the original text is preserved.
  • Text Generation: N-gram models are also used for generating coherent and contextually appropriate sentences. This is particularly useful in applications like chatbots and automated content creation, where the ability to produce natural-sounding language is essential. By analyzing patterns in large corpora of text, N-gram models can construct sentences that mimic human language usage, thereby enhancing the user experience.

Overall, the versatility and effectiveness of N-gram models make them a fundamental component in the toolkit of NLP technologies.

In conclusion, N-gram language models are a fundamental tool in NLP that help in understanding and predicting word sequences based on the context provided by previous words. While they have certain limitations, their simplicity and effectiveness make them a valuable starting point for more complex language modeling techniques.

4.1.4 Training an N-gram Language Model

To train an N-gram language model, we need to calculate the probabilities of N-grams from a training corpus. This involves counting the occurrences of N-grams and normalizing these counts to obtain probabilities.

Training an N-gram language model involves a series of steps to calculate the probabilities of N-grams from a training corpus. Here's a detailed explanation of the process:

  1. Tokenizing the Corpus:
    The first step is to tokenize the training corpus into individual words or tokens. This involves splitting the text into words, which will be used to form N-grams.
    import nltk
    nltk.download('punkt')

    # Sample text corpus
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    # Tokenize the text into words
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]
  2. Generating N-grams:
    Once the text is tokenized, the next step is to generate N-grams from these tokens. N-grams are contiguous sequences of N items from the text.
    from nltk.util import ngrams

    # Example of generating bigrams (N=2)
    bigrams = [ngrams(sentence, 2) for sentence in tokenized_corpus]
  3. Counting N-gram Occurrences:
    The core of training an N-gram model is counting the occurrences of each N-gram in the corpus. This involves iterating through the tokenized text and recording how often each N-gram appears.
    from collections import defaultdict

    def count_ngrams(tokenized_corpus, n):
        counts = defaultdict(lambda: defaultdict(int))
        for sentence in tokenized_corpus:
            for ngram in ngrams(sentence, n):
                counts[ngram[:-1]][ngram[-1]] += 1
        return counts

    # Count bigrams
    bigram_counts = count_ngrams(tokenized_corpus, 2)
  4. Calculating Probabilities:
    After counting the N-grams, the next step is to calculate their probabilities. This is done by normalizing the counts, which means dividing the count of each N-gram by the total count of N-grams that share the same prefix (context).
    def calculate_probabilities(counts):
        probabilities = defaultdict(dict)
        for context in counts:
            total_count = float(sum(counts[context].values()))
            for word in counts[context]:
                probabilities[context][word] = counts[context][word] / total_count
        return probabilities

    # Calculate bigram probabilities
    bigram_probabilities = calculate_probabilities(bigram_counts)
  5. Using the Model:
    With the N-gram probabilities calculated, the model can now predict the likelihood of a word following a given context. This is useful for tasks such as text generation and speech recognition.
    def get_ngram_probability(model, context, word):
        return model[context].get(word, 0)

    # Example: Get probability of "NLP" following "for"
    probability = get_ngram_probability(bigram_probabilities, ('for',), 'NLP')
    print("Bigram Probability (NLP | for):", probability)

In summary, training an N-gram language model involves tokenizing the text, generating N-grams, counting their occurrences, and calculating probabilities. This model can then be used to predict the likelihood of subsequent words in a sequence, aiding various NLP tasks.

Example: Training a Bigram Language Model

from collections import defaultdict

# Sample text corpus
corpus = [
    "Natural Language Processing is a fascinating field of study.",
    "Machine learning and NLP are closely related.",
    "Language models are essential for NLP tasks."
]

# Tokenize the text into words
tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

# Function to calculate bigram probabilities
def train_bigram_model(tokenized_corpus):
    model = defaultdict(lambda: defaultdict(lambda: 0))

    # Count bigrams
    for sentence in tokenized_corpus:
        for w1, w2 in ngrams(sentence, 2):
            model[w1][w2] += 1

    # Calculate probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count

    return model

# Train the bigram model
bigram_model = train_bigram_model(tokenized_corpus)

# Function to get the probability of a bigram
def get_bigram_probability(bigram_model, w1, w2):
    return bigram_model[w1][w2]

print("Bigram Probability (NLP | for):")
print(get_bigram_probability(bigram_model, 'for', 'NLP'))

This example code demonstrates how to build a bigram language model using a sample text corpus. Here is a detailed explanation of each part of the code:

Step-by-Step Explanation

  1. Import Required Libraries:
    from collections import defaultdict
    import nltk
    from nltk.util import ngrams
    nltk.download('punkt')
    • defaultdict from the collections module is used to create a nested dictionary that will store the bigram counts and probabilities.
    • The nltk library is used for natural language processing tasks, including tokenization.
    • ngrams from nltk.util helps generate N-grams from the tokenized text.
  2. Sample Text Corpus:
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    A sample text corpus consisting of three sentences is defined. This corpus will be used to train the bigram model.

  3. Tokenize the Text:
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

    The nltk.word_tokenize function tokenizes each sentence in the corpus into individual words. The result is a list of tokenized sentences.

  4. Function to Train the Bigram Model:
    def train_bigram_model(tokenized_corpus):
        model = defaultdict(lambda: defaultdict(lambda: 0))

        # Count bigrams
        for sentence in tokenized_corpus:
            for w1, w2 in ngrams(sentence, 2):
                model[w1][w2] += 1

        # Calculate probabilities
        for w1 in model:
            total_count = float(sum(model[w1].values()))
            for w2 in model[w1]:
                model[w1][w2] /= total_count

        return model
    • This function trains the bigram model by counting the occurrences of word pairs (bigrams) and then calculating their probabilities.
    • A nested defaultdict is used to store the counts and probabilities of bigrams.
    • The function first counts the occurrences of each bigram in the tokenized corpus.
    • It then calculates the probability of each bigram by normalizing the counts. The probability of a bigram is the count of the bigram divided by the total count of all bigrams starting with the same first word.
  5. Train the Bigram Model:
    bigram_model = train_bigram_model(tokenized_corpus)

    The train_bigram_model function is called with the tokenized corpus to train the bigram model. The resulting model is stored in the bigram_model variable.

  6. Function to Get the Probability of a Bigram:
    def get_bigram_probability(bigram_model, w1, w2):
        return bigram_model[w1][w2]

    This function retrieves the probability of a given bigram from the trained model. It takes the model and the two words forming the bigram as input and returns the probability.

  7. Print the Bigram Probability:
    print("Bigram Probability (NLP | for):")
    print(get_bigram_probability(bigram_model, 'for', 'NLP'))

    The probability of the bigram "NLP" following the word "for" is printed using the get_bigram_probability function.

Example Output

Bigram Probability (NLP | for):
0.5

When you run the code, the output will display the probability of the word "NLP" following the word "for" based on the trained bigram model. Given the small sample corpus, the printed probability might not be very informative, but it demonstrates the process of calculating bigram probabilities.

This example illustrates the fundamental steps involved in building a bigram language model: tokenizing text, counting bigrams, calculating their probabilities, and retrieving probabilities from the model. Despite the simplicity of this model, it serves as a foundation for understanding more complex language modeling techniques. Bigrams capture local word dependencies and provide a statistical basis for predicting the next word in a sequence based on the current word. 

4.1.5 Limitations of N-gram Models

While N-gram models are simple and effective, they come with several significant limitations:

  1. Sparsity Issues:
    • As the value of N increases, the number of possible N-grams grows exponentially. This results in data sparsity, where many possible N-grams may not appear in the training corpus. For example, in a trigram model, the number of possible trigrams is extremely large, and many of these trigrams might never occur in the training data. This makes it difficult to estimate their probabilities accurately, leading to unreliable predictions.
  2. Context Limitation:
    • N-gram models capture only a fixed window of context, specifically the previous N-1 words. This fixed window can be insufficient for capturing long-range dependencies in the text. For example, in a bigram model, only the immediately preceding word is considered when predicting the next word. This can be a significant limitation in understanding the full context of a sentence, especially in complex or lengthy texts where important information may be spread out over several words or sentences.
  3. Memory Usage:
    • Higher-order N-gram models require significant memory to store the probabilities of all possible N-grams. The larger the value of N, the more memory is needed to store these probabilities. For instance, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be computationally expensive and impractical for large vocabularies.
  4. Lack of Generalization:
    • N-gram models can struggle to generalize well to unseen data. They rely heavily on the specific sequences observed in the training data, meaning they may not perform well on new or slightly different sequences. This lack of generalization can limit their effectiveness in real-world applications where language use is highly variable and context-dependent.
  5. Handling of Out-of-Vocabulary Words:
    • N-gram models have difficulty handling out-of-vocabulary (OOV) words—words that were not seen during training. This can result in poor performance when the model encounters new words or phrases, as it has no prior knowledge or probabilities associated with them.
  6. Inability to Capture Semantic Meaning:
    • N-gram models operate purely on the basis of word sequences and do not capture the underlying semantic meaning of the words. They treat words as independent tokens without understanding their meanings or relationships, which can limit their ability to perform tasks that require a deeper understanding of language.

Despite these limitations, N-gram models are still valuable for many NLP tasks and serve as a foundation for more advanced language modeling techniques. They are often used in applications such as text prediction, speech recognition, and machine translation, where understanding the probability of word sequences is crucial.

N-gram models have also paved the way for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language. These advanced models leverage large-scale datasets and powerful computational resources to achieve higher accuracy and better generalization in various NLP tasks.

While N-gram models have their drawbacks, their simplicity and foundational role in language modeling make them an essential starting point for understanding and developing more complex NLP techniques.

4.1 N-grams

Language modeling is a fundamental task in Natural Language Processing (NLP) that involves predicting the next word or sequence of words in a sentence. It serves as the backbone for many advanced NLP applications such as speech recognition, machine translation, text generation, and more. A good language model can understand the context and semantics of a given text, enabling it to generate coherent and contextually appropriate sentences.

In this chapter, we will explore different techniques for building language models, starting with simpler methods like N-grams and moving towards more complex approaches such as Hidden Markov Models (HMMs), Recurrent Neural Networks (RNNs), and Long Short-Term Memory Networks (LSTMs). By the end of this chapter, you will have a solid understanding of how language models work and how to implement them in Python.

We will begin with N-grams, one of the simplest yet powerful techniques for language modeling.

N-grams are a contiguous sequence of N items derived from a given sample of text or speech. In the context of natural language processing (NLP), these items are typically words, and N can be any integer, such as 1, 2, 3, or more.

The concept of N-grams is fundamental in NLP because it allows the analysis and prediction of sequences of words in a structured manner. N-grams are widely used in various NLP tasks, including language modeling, where they help predict the probability of a sequence of words. 

They are also crucial in text generation, enabling the creation of coherent and contextually appropriate sentences. Additionally, N-grams play a significant role in machine translation, assisting in the accurate translation of text from one language to another by considering the context provided by contiguous word sequences.

Their simplicity and effectiveness make N-grams a powerful tool in the field of NLP, contributing to advancements in understanding and processing human language.

4.1.1 Understanding N-grams

An N-gram is a sequence of N words that appears in a text. This concept is fundamental in the field of Natural Language Processing (NLP) and helps in understanding the structure and meaning of language. Here are some examples of different types of N-grams:

  • Unigram (1-gram): "Natural"
  • Bigram (2-gram): "Natural Language"
  • Trigram (3-gram): "Natural Language Processing"

N-grams capture local word dependencies by considering a fixed window of N words. The choice of N determines the size of the window and the amount of context captured. For instance, unigrams capture individual word frequencies, providing insight into the most common words in a text.

Bigrams, on the other hand, capture relationships between pairs of adjacent words, revealing common word combinations. Trigrams and higher-order N-grams capture even more complex relationships, showing how sequences of three or more words are used together in context.

The use of N-grams is crucial in various applications such as text prediction, machine translation, and speech recognition. By analyzing N-grams, one can better understand the syntactic and semantic properties of text, which is essential for creating more accurate and efficient language models. Therefore, understanding and utilizing N-grams can significantly enhance the performance of NLP tasks.

4.1.2 Generating N-grams in Python

Let's see how to generate N-grams in Python using a sample text.

Example: Generating N-grams

from nltk import ngrams
from collections import Counter
import nltk
nltk.download('punkt')

# Sample text
text = "Natural Language Processing is a fascinating field of study."

# Tokenize the text into words
tokens = nltk.word_tokenize(text)

# Function to generate N-grams
def generate_ngrams(tokens, n):
    n_grams = ngrams(tokens, n)
    return [' '.join(grams) for grams in n_grams]

# Generate unigrams, bigrams, and trigrams
unigrams = generate_ngrams(tokens, 1)
bigrams = generate_ngrams(tokens, 2)
trigrams = generate_ngrams(tokens, 3)

print("Unigrams:")
print(unigrams)
print("\\nBigrams:")
print(bigrams)
print("\\nTrigrams:")
print(trigrams)

This example code demonstrates how to generate unigrams, bigrams, and trigrams using the Natural Language Toolkit (nltk) library. Here’s a step-by-step explanation of the code:

  1. Import Libraries:
    from nltk import ngrams
    from collections import Counter
    import nltk
    nltk.download('punkt')
    • The ngrams function from nltk is used to generate N-grams.
    • The Counter from collections is imported but not used in this specific code.
    • The nltk.download('punkt') line ensures that the Punkt tokenizer models are downloaded, which are necessary for tokenizing the text into words.
  2. Sample Text:
    text = "Natural Language Processing is a fascinating field of study."

    A sample sentence is defined to demonstrate the generation of N-grams.

  3. Tokenize the Text:
    tokens = nltk.word_tokenize(text)

    The text is tokenized into individual words using nltk.word_tokenize.

  4. Function to Generate N-grams:
    def generate_ngrams(tokens, n):
        n_grams = ngrams(tokens, n)
        return [' '.join(grams) for grams in n_grams]
    • A function generate_ngrams is defined that takes a list of tokens and an integer n representing the N-gram size.
    • The ngrams function generates N-grams from the list of tokens.
    • The function returns a list of N-grams joined by spaces.
  5. Generate Unigrams, Bigrams, and Trigrams:
    unigrams = generate_ngrams(tokens, 1)
    bigrams = generate_ngrams(tokens, 2)
    trigrams = generate_ngrams(tokens, 3)

    The generate_ngrams function is called with n values of 1, 2, and 3 to generate unigrams, bigrams, and trigrams, respectively.

  6. Print the N-grams:
    print("Unigrams:")
    print(unigrams)
    print("\\\\nBigrams:")
    print(bigrams)
    print("\\\\nTrigrams:")
    print(trigrams)

    The generated unigrams, bigrams, and trigrams are printed to the console.

Example Output:

Unigrams:
['Natural', 'Language', 'Processing', 'is', 'a', 'fascinating', 'field', 'of', 'study', '.']

Bigrams:
['Natural Language', 'Language Processing', 'Processing is', 'is a', 'a fascinating', 'fascinating field', 'field of', 'of study', 'study .']

Trigrams:
['Natural Language Processing', 'Language Processing is', 'Processing is a', 'is a fascinating', 'a fascinating field', 'fascinating field of', 'field of study', 'of study .']

In this example, we use the ngrams function from the nltk library to generate unigrams, bigrams, and trigrams from the sample text. The function takes a list of tokens and the value of N as input and returns a list of N-grams.

4.1.3 N-gram Language Models

N-gram language models are statistical models used in computational linguistics to predict the next item in a sequence, such as the next word in a sentence, based on the previous N-1 items. The primary goal of an N-gram model is to estimate the probability of a word given the preceding words in the sequence.

An N-gram language model estimates the probability of a word given the previous N-1 words. This is useful for predicting the next word in a sequence. The probability of a word sequence ( w_1, w_2, ..., w_T ) is given by:


[
P(w_1, w_2, ..., w_T) = P(w_1) P(w_2|w_1) P(w_3|w_1, w_2) ... P(w_T|w_1, ..., w_{T-1})
]


For an N-gram model, this can be simplified to:


[
P(w_1, w_2, ..., w_T) \approx \prod_{i=1}^{T} P(w_i|w_{i-N+1}, ..., w_{i-1})
]


In essence, the model breaks down the probability of a sequence of words into the product of conditional probabilities of each word given the previous N-1 words. This simplification allows the model to be more manageable and computationally feasible.

Example: Bigram Model

A bigram model (where N=2) considers the probability of a word given the previous word. For example, the probability of the word "processing" given the word "language" in the sequence "natural language processing" can be written as:


[
P(\text{processing}|\text{language})
]


To calculate these probabilities, the model needs to be trained on a large corpus of text, where it counts the occurrences of pairs of words (bigrams) and normalizes these counts to obtain probabilities. For instance, if the bigram "language processing" appears 50 times in a corpus and the word "language" appears 200 times, the probability of "processing" given "language" would be:


[
P(\text{processing}|\text{language}) = \frac{\text{Count}(\text{language processing})}{\text{Count}(\text{language})} = \frac{50}{200} = 0.25
]


This means that in the given corpus, the word "processing" follows the word "language" 25% of the time.

Challenges with N-gram Models

While N-gram models are simple and effective, they have several limitations:

  • Sparsity: As N increases, the number of possible N-grams grows exponentially, leading to data sparsity. This means that many N-grams may not appear in the training corpus, making it difficult to estimate their probabilities accurately. For instance, if you have a trigram model (N=3), the number of possible trigrams can be extremely large, and many of these trigrams might never occur in your training data, making it challenging to provide reliable probability estimates.
  • Context Limitation: N-gram models capture only a fixed window of context (N-1 words), which can be insufficient for capturing long-range dependencies in language. For example, if you are using a bigram model, it only considers the previous word to predict the next one, which might not be enough to understand the full context of a sentence, especially in complex or lengthy texts.
  • Memory Usage: High-order N-gram models require significant memory to store the probabilities of all possible N-grams. The higher the value of N, the more memory is needed to store these probabilities, which can become a substantial computational burden. For example, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be impractical for large vocabularies.

Despite these limitations, N-gram models are useful for many natural language processing (NLP) tasks and serve as a foundation for more advanced language modeling techniques. They are often used in speech recognition, text prediction, and other areas where understanding the probability of word sequences is crucial.

Furthermore, N-gram models have laid the groundwork for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language.

Applications of N-gram Models

N-gram models are widely used in various Natural Language Processing (NLP) tasks due to their ability to capture and utilize the statistical properties of word sequences. These models are employed in several key applications, including:

  • Text Prediction: N-gram models are instrumental in predicting the next word in a sequence. This capability is often leveraged in predictive text input on mobile devices, where the model suggests possible words to complete the user's input based on the context of previous words. This feature enhances typing efficiency and accuracy.
  • Speech Recognition: In the realm of speech-to-text systems, N-gram models significantly improve the accuracy of transcriptions. By predicting the most likely word sequences, these models help in filtering out improbable word combinations, thereby refining the output of speech recognition software and making it more reliable.
  • Machine Translation: When translating text from one language to another, N-gram models play a crucial role by considering the context provided by contiguous word sequences. This contextual understanding helps in producing translations that are not only accurate but also contextually appropriate, ensuring that the meaning of the original text is preserved.
  • Text Generation: N-gram models are also used for generating coherent and contextually appropriate sentences. This is particularly useful in applications like chatbots and automated content creation, where the ability to produce natural-sounding language is essential. By analyzing patterns in large corpora of text, N-gram models can construct sentences that mimic human language usage, thereby enhancing the user experience.

Overall, the versatility and effectiveness of N-gram models make them a fundamental component in the toolkit of NLP technologies.

In conclusion, N-gram language models are a fundamental tool in NLP that help in understanding and predicting word sequences based on the context provided by previous words. While they have certain limitations, their simplicity and effectiveness make them a valuable starting point for more complex language modeling techniques.

4.1.4 Training an N-gram Language Model

To train an N-gram language model, we need to calculate the probabilities of N-grams from a training corpus. This involves counting the occurrences of N-grams and normalizing these counts to obtain probabilities.

Training an N-gram language model involves a series of steps to calculate the probabilities of N-grams from a training corpus. Here's a detailed explanation of the process:

  1. Tokenizing the Corpus:
    The first step is to tokenize the training corpus into individual words or tokens. This involves splitting the text into words, which will be used to form N-grams.
    import nltk
    nltk.download('punkt')

    # Sample text corpus
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    # Tokenize the text into words
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]
  2. Generating N-grams:
    Once the text is tokenized, the next step is to generate N-grams from these tokens. N-grams are contiguous sequences of N items from the text.
    from nltk.util import ngrams

    # Example of generating bigrams (N=2)
    bigrams = [ngrams(sentence, 2) for sentence in tokenized_corpus]
  3. Counting N-gram Occurrences:
    The core of training an N-gram model is counting the occurrences of each N-gram in the corpus. This involves iterating through the tokenized text and recording how often each N-gram appears.
    from collections import defaultdict

    def count_ngrams(tokenized_corpus, n):
        counts = defaultdict(lambda: defaultdict(int))
        for sentence in tokenized_corpus:
            for ngram in ngrams(sentence, n):
                counts[ngram[:-1]][ngram[-1]] += 1
        return counts

    # Count bigrams
    bigram_counts = count_ngrams(tokenized_corpus, 2)
  4. Calculating Probabilities:
    After counting the N-grams, the next step is to calculate their probabilities. This is done by normalizing the counts, which means dividing the count of each N-gram by the total count of N-grams that share the same prefix (context).
    def calculate_probabilities(counts):
        probabilities = defaultdict(dict)
        for context in counts:
            total_count = float(sum(counts[context].values()))
            for word in counts[context]:
                probabilities[context][word] = counts[context][word] / total_count
        return probabilities

    # Calculate bigram probabilities
    bigram_probabilities = calculate_probabilities(bigram_counts)
  5. Using the Model:
    With the N-gram probabilities calculated, the model can now predict the likelihood of a word following a given context. This is useful for tasks such as text generation and speech recognition.
    def get_ngram_probability(model, context, word):
        return model[context].get(word, 0)

    # Example: Get probability of "NLP" following "for"
    probability = get_ngram_probability(bigram_probabilities, ('for',), 'NLP')
    print("Bigram Probability (NLP | for):", probability)

In summary, training an N-gram language model involves tokenizing the text, generating N-grams, counting their occurrences, and calculating probabilities. This model can then be used to predict the likelihood of subsequent words in a sequence, aiding various NLP tasks.

Example: Training a Bigram Language Model

from collections import defaultdict

# Sample text corpus
corpus = [
    "Natural Language Processing is a fascinating field of study.",
    "Machine learning and NLP are closely related.",
    "Language models are essential for NLP tasks."
]

# Tokenize the text into words
tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

# Function to calculate bigram probabilities
def train_bigram_model(tokenized_corpus):
    model = defaultdict(lambda: defaultdict(lambda: 0))

    # Count bigrams
    for sentence in tokenized_corpus:
        for w1, w2 in ngrams(sentence, 2):
            model[w1][w2] += 1

    # Calculate probabilities
    for w1 in model:
        total_count = float(sum(model[w1].values()))
        for w2 in model[w1]:
            model[w1][w2] /= total_count

    return model

# Train the bigram model
bigram_model = train_bigram_model(tokenized_corpus)

# Function to get the probability of a bigram
def get_bigram_probability(bigram_model, w1, w2):
    return bigram_model[w1][w2]

print("Bigram Probability (NLP | for):")
print(get_bigram_probability(bigram_model, 'for', 'NLP'))

This example code demonstrates how to build a bigram language model using a sample text corpus. Here is a detailed explanation of each part of the code:

Step-by-Step Explanation

  1. Import Required Libraries:
    from collections import defaultdict
    import nltk
    from nltk.util import ngrams
    nltk.download('punkt')
    • defaultdict from the collections module is used to create a nested dictionary that will store the bigram counts and probabilities.
    • The nltk library is used for natural language processing tasks, including tokenization.
    • ngrams from nltk.util helps generate N-grams from the tokenized text.
  2. Sample Text Corpus:
    corpus = [
        "Natural Language Processing is a fascinating field of study.",
        "Machine learning and NLP are closely related.",
        "Language models are essential for NLP tasks."
    ]

    A sample text corpus consisting of three sentences is defined. This corpus will be used to train the bigram model.

  3. Tokenize the Text:
    tokenized_corpus = [nltk.word_tokenize(sentence) for sentence in corpus]

    The nltk.word_tokenize function tokenizes each sentence in the corpus into individual words. The result is a list of tokenized sentences.

  4. Function to Train the Bigram Model:
    def train_bigram_model(tokenized_corpus):
        model = defaultdict(lambda: defaultdict(lambda: 0))

        # Count bigrams
        for sentence in tokenized_corpus:
            for w1, w2 in ngrams(sentence, 2):
                model[w1][w2] += 1

        # Calculate probabilities
        for w1 in model:
            total_count = float(sum(model[w1].values()))
            for w2 in model[w1]:
                model[w1][w2] /= total_count

        return model
    • This function trains the bigram model by counting the occurrences of word pairs (bigrams) and then calculating their probabilities.
    • A nested defaultdict is used to store the counts and probabilities of bigrams.
    • The function first counts the occurrences of each bigram in the tokenized corpus.
    • It then calculates the probability of each bigram by normalizing the counts. The probability of a bigram is the count of the bigram divided by the total count of all bigrams starting with the same first word.
  5. Train the Bigram Model:
    bigram_model = train_bigram_model(tokenized_corpus)

    The train_bigram_model function is called with the tokenized corpus to train the bigram model. The resulting model is stored in the bigram_model variable.

  6. Function to Get the Probability of a Bigram:
    def get_bigram_probability(bigram_model, w1, w2):
        return bigram_model[w1][w2]

    This function retrieves the probability of a given bigram from the trained model. It takes the model and the two words forming the bigram as input and returns the probability.

  7. Print the Bigram Probability:
    print("Bigram Probability (NLP | for):")
    print(get_bigram_probability(bigram_model, 'for', 'NLP'))

    The probability of the bigram "NLP" following the word "for" is printed using the get_bigram_probability function.

Example Output

Bigram Probability (NLP | for):
0.5

When you run the code, the output will display the probability of the word "NLP" following the word "for" based on the trained bigram model. Given the small sample corpus, the printed probability might not be very informative, but it demonstrates the process of calculating bigram probabilities.

This example illustrates the fundamental steps involved in building a bigram language model: tokenizing text, counting bigrams, calculating their probabilities, and retrieving probabilities from the model. Despite the simplicity of this model, it serves as a foundation for understanding more complex language modeling techniques. Bigrams capture local word dependencies and provide a statistical basis for predicting the next word in a sequence based on the current word. 

4.1.5 Limitations of N-gram Models

While N-gram models are simple and effective, they come with several significant limitations:

  1. Sparsity Issues:
    • As the value of N increases, the number of possible N-grams grows exponentially. This results in data sparsity, where many possible N-grams may not appear in the training corpus. For example, in a trigram model, the number of possible trigrams is extremely large, and many of these trigrams might never occur in the training data. This makes it difficult to estimate their probabilities accurately, leading to unreliable predictions.
  2. Context Limitation:
    • N-gram models capture only a fixed window of context, specifically the previous N-1 words. This fixed window can be insufficient for capturing long-range dependencies in the text. For example, in a bigram model, only the immediately preceding word is considered when predicting the next word. This can be a significant limitation in understanding the full context of a sentence, especially in complex or lengthy texts where important information may be spread out over several words or sentences.
  3. Memory Usage:
    • Higher-order N-gram models require significant memory to store the probabilities of all possible N-grams. The larger the value of N, the more memory is needed to store these probabilities. For instance, a 4-gram model would need to store the probabilities of all possible sequences of four words, which can be computationally expensive and impractical for large vocabularies.
  4. Lack of Generalization:
    • N-gram models can struggle to generalize well to unseen data. They rely heavily on the specific sequences observed in the training data, meaning they may not perform well on new or slightly different sequences. This lack of generalization can limit their effectiveness in real-world applications where language use is highly variable and context-dependent.
  5. Handling of Out-of-Vocabulary Words:
    • N-gram models have difficulty handling out-of-vocabulary (OOV) words—words that were not seen during training. This can result in poor performance when the model encounters new words or phrases, as it has no prior knowledge or probabilities associated with them.
  6. Inability to Capture Semantic Meaning:
    • N-gram models operate purely on the basis of word sequences and do not capture the underlying semantic meaning of the words. They treat words as independent tokens without understanding their meanings or relationships, which can limit their ability to perform tasks that require a deeper understanding of language.

Despite these limitations, N-gram models are still valuable for many NLP tasks and serve as a foundation for more advanced language modeling techniques. They are often used in applications such as text prediction, speech recognition, and machine translation, where understanding the probability of word sequences is crucial.

N-gram models have also paved the way for the development of more sophisticated models, such as neural networks and transformers, which address some of the inherent limitations of N-grams by capturing more complex patterns and dependencies in language. These advanced models leverage large-scale datasets and powerful computational resources to achieve higher accuracy and better generalization in various NLP tasks.

While N-gram models have their drawbacks, their simplicity and foundational role in language modeling make them an essential starting point for understanding and developing more complex NLP techniques.