Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP with Transformers: Fundamentals and Core Applications
NLP with Transformers: Fundamentals and Core Applications

Chapter 1: Introduction to NLP and Its Evolution

1.4 Practical Exercises for Chapter 1

Now that we’ve explored the fundamentals of NLP, its historical development, and traditional approaches, let’s solidify your understanding with practical exercises. Each exercise is designed to help you apply the concepts covered in this chapter. Take your time to work through them, and refer to the solutions when needed.

Exercise 1: Tokenization and Stopword Removal

Task:

Write a Python program to tokenize a given sentence into words and remove common stopwords using the NLTK library.

Input Example:

"I enjoy learning about natural language processing."

Steps:

  1. Tokenize the sentence.
  2. Remove English stopwords.

Solution:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Expected Output:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Exercise 2: Rule-Based Sentiment Analysis

Task:

Create a rule-based sentiment analyzer that classifies a sentence as PositiveNegative, or Neutral based on predefined lists of positive and negative words.

Input Example:

"This movie was excellent and truly inspiring."

Solution:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Expected Output:

Sentiment: Positive

Exercise 3: Building a Bag-of-Words Model

Task:

Using the CountVectorizer from scikit-learn, build a Bag-of-Words (BoW) representation for the following sentences:

  1. "I love programming in Python."
  2. "Python is an excellent programming language."
  3. "Programming in Python is fun."

Steps:

  1. Tokenize the sentences and create a vocabulary.
  2. Represent each sentence as a vector.

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Exercise 4: Generating N-Grams

Task:

Write a Python program to generate bigrams from the given text:

"Natural language processing is fascinating."

Steps:

  1. Tokenize the sentence into words.
  2. Generate bigrams (n=2).

Solution:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Expected Output:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Exercise 5: Calculating TF-IDF

Task:

Use the TfidfVectorizer from scikit-learn to compute the TF-IDF scores for the following sentences:

  1. "I love Python programming."
  2. "Python is a great programming language."
  3. "Programming in Python is fun."

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Expected Output:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

These exercises are designed to reinforce your understanding of tokenization, rule-based methods, Bag-of-Words, n-grams, and TF-IDF. These foundational techniques are vital building blocks for more advanced NLP methods discussed in later chapters. Keep experimenting with different inputs and datasets to deepen your comprehension!

1.4 Practical Exercises for Chapter 1

Now that we’ve explored the fundamentals of NLP, its historical development, and traditional approaches, let’s solidify your understanding with practical exercises. Each exercise is designed to help you apply the concepts covered in this chapter. Take your time to work through them, and refer to the solutions when needed.

Exercise 1: Tokenization and Stopword Removal

Task:

Write a Python program to tokenize a given sentence into words and remove common stopwords using the NLTK library.

Input Example:

"I enjoy learning about natural language processing."

Steps:

  1. Tokenize the sentence.
  2. Remove English stopwords.

Solution:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Expected Output:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Exercise 2: Rule-Based Sentiment Analysis

Task:

Create a rule-based sentiment analyzer that classifies a sentence as PositiveNegative, or Neutral based on predefined lists of positive and negative words.

Input Example:

"This movie was excellent and truly inspiring."

Solution:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Expected Output:

Sentiment: Positive

Exercise 3: Building a Bag-of-Words Model

Task:

Using the CountVectorizer from scikit-learn, build a Bag-of-Words (BoW) representation for the following sentences:

  1. "I love programming in Python."
  2. "Python is an excellent programming language."
  3. "Programming in Python is fun."

Steps:

  1. Tokenize the sentences and create a vocabulary.
  2. Represent each sentence as a vector.

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Exercise 4: Generating N-Grams

Task:

Write a Python program to generate bigrams from the given text:

"Natural language processing is fascinating."

Steps:

  1. Tokenize the sentence into words.
  2. Generate bigrams (n=2).

Solution:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Expected Output:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Exercise 5: Calculating TF-IDF

Task:

Use the TfidfVectorizer from scikit-learn to compute the TF-IDF scores for the following sentences:

  1. "I love Python programming."
  2. "Python is a great programming language."
  3. "Programming in Python is fun."

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Expected Output:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

These exercises are designed to reinforce your understanding of tokenization, rule-based methods, Bag-of-Words, n-grams, and TF-IDF. These foundational techniques are vital building blocks for more advanced NLP methods discussed in later chapters. Keep experimenting with different inputs and datasets to deepen your comprehension!

1.4 Practical Exercises for Chapter 1

Now that we’ve explored the fundamentals of NLP, its historical development, and traditional approaches, let’s solidify your understanding with practical exercises. Each exercise is designed to help you apply the concepts covered in this chapter. Take your time to work through them, and refer to the solutions when needed.

Exercise 1: Tokenization and Stopword Removal

Task:

Write a Python program to tokenize a given sentence into words and remove common stopwords using the NLTK library.

Input Example:

"I enjoy learning about natural language processing."

Steps:

  1. Tokenize the sentence.
  2. Remove English stopwords.

Solution:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Expected Output:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Exercise 2: Rule-Based Sentiment Analysis

Task:

Create a rule-based sentiment analyzer that classifies a sentence as PositiveNegative, or Neutral based on predefined lists of positive and negative words.

Input Example:

"This movie was excellent and truly inspiring."

Solution:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Expected Output:

Sentiment: Positive

Exercise 3: Building a Bag-of-Words Model

Task:

Using the CountVectorizer from scikit-learn, build a Bag-of-Words (BoW) representation for the following sentences:

  1. "I love programming in Python."
  2. "Python is an excellent programming language."
  3. "Programming in Python is fun."

Steps:

  1. Tokenize the sentences and create a vocabulary.
  2. Represent each sentence as a vector.

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Exercise 4: Generating N-Grams

Task:

Write a Python program to generate bigrams from the given text:

"Natural language processing is fascinating."

Steps:

  1. Tokenize the sentence into words.
  2. Generate bigrams (n=2).

Solution:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Expected Output:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Exercise 5: Calculating TF-IDF

Task:

Use the TfidfVectorizer from scikit-learn to compute the TF-IDF scores for the following sentences:

  1. "I love Python programming."
  2. "Python is a great programming language."
  3. "Programming in Python is fun."

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Expected Output:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

These exercises are designed to reinforce your understanding of tokenization, rule-based methods, Bag-of-Words, n-grams, and TF-IDF. These foundational techniques are vital building blocks for more advanced NLP methods discussed in later chapters. Keep experimenting with different inputs and datasets to deepen your comprehension!

1.4 Practical Exercises for Chapter 1

Now that we’ve explored the fundamentals of NLP, its historical development, and traditional approaches, let’s solidify your understanding with practical exercises. Each exercise is designed to help you apply the concepts covered in this chapter. Take your time to work through them, and refer to the solutions when needed.

Exercise 1: Tokenization and Stopword Removal

Task:

Write a Python program to tokenize a given sentence into words and remove common stopwords using the NLTK library.

Input Example:

"I enjoy learning about natural language processing."

Steps:

  1. Tokenize the sentence.
  2. Remove English stopwords.

Solution:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Expected Output:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Exercise 2: Rule-Based Sentiment Analysis

Task:

Create a rule-based sentiment analyzer that classifies a sentence as PositiveNegative, or Neutral based on predefined lists of positive and negative words.

Input Example:

"This movie was excellent and truly inspiring."

Solution:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Expected Output:

Sentiment: Positive

Exercise 3: Building a Bag-of-Words Model

Task:

Using the CountVectorizer from scikit-learn, build a Bag-of-Words (BoW) representation for the following sentences:

  1. "I love programming in Python."
  2. "Python is an excellent programming language."
  3. "Programming in Python is fun."

Steps:

  1. Tokenize the sentences and create a vocabulary.
  2. Represent each sentence as a vector.

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Exercise 4: Generating N-Grams

Task:

Write a Python program to generate bigrams from the given text:

"Natural language processing is fascinating."

Steps:

  1. Tokenize the sentence into words.
  2. Generate bigrams (n=2).

Solution:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Expected Output:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Exercise 5: Calculating TF-IDF

Task:

Use the TfidfVectorizer from scikit-learn to compute the TF-IDF scores for the following sentences:

  1. "I love Python programming."
  2. "Python is a great programming language."
  3. "Programming in Python is fun."

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Expected Output:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

These exercises are designed to reinforce your understanding of tokenization, rule-based methods, Bag-of-Words, n-grams, and TF-IDF. These foundational techniques are vital building blocks for more advanced NLP methods discussed in later chapters. Keep experimenting with different inputs and datasets to deepen your comprehension!