Click here to view the next lesson.

Chapter 2: Basic Text Processing

Practical Exercises

Exercise 1: Stop Word Removal

Task: Use the nltk library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

Solution:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# Tokenize the text
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:")
print(tokens)

print("\\nFiltered Tokens:")
print(filtered_tokens)

Output:

Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']

Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']

Exercise 2: Stemming

Task: Use the nltk library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

Solution:

from nltk.stem import PorterStemmer

# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# Tokenize the text
tokens = text.split()

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nStemmed Tokens:")
print(stemmed_tokens)

Output:

Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']

Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']

Exercise 3: Lemmatization

Task: Use the nltk library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."

Solution:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."

# Tokenize the text
tokens = text.split()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nLemmatized Tokens:")
print(lemmatized_tokens)

Output:

Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']

Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']

Exercise 4: Regular Expressions

Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."

Solution:

import re

# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."

# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"

# Use re.findall() to find all matches
dates = re.findall(pattern, text)

print("Extracted Dates:")
print(dates)

Output:

Extracted Dates:
['2021-01-15', '2021-12-31']

Exercise 5: Word Tokenization

Task: Use the nltk library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Sample text
text = "Tokenization is the first step in text preprocessing."

# Perform word tokenization
tokens = word_tokenize(text)

print("Word Tokens:")
print(tokens)

Output:

Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']

Exercise 6: Sentence Tokenization

Task: Use the nltk library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."

# Perform sentence tokenization
sentences = sent_tokenize(text)

print("Sentences:")
print(sentences)

Output:

Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']

Exercise 7: Character Tokenization

Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."

Solution:

def character_tokenization(text):
    # Perform character tokenization
    characters = list(text)
    return characters

# Sample text
text = "Character tokenization is useful for certain tasks."

# Tokenize the text into characters
characters = character_tokenization(text)

print("Character Tokens:")
print(characters)

Output:

Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']

These practical exercises provide hands-on experience with different aspects of text preprocessing, including stop word removal, stemming, lemmatization, regular expressions, and tokenization. Each exercise is designed to reinforce the concepts discussed in the chapter and help you become proficient in implementing these techniques using Python.

Practical Exercises

Exercise 1: Stop Word Removal

Task: Use the nltk library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

Solution:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# Tokenize the text
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:")
print(tokens)

print("\\nFiltered Tokens:")
print(filtered_tokens)

Output:

Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']

Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']

Exercise 2: Stemming

Task: Use the nltk library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

Solution:

from nltk.stem import PorterStemmer

# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# Tokenize the text
tokens = text.split()

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nStemmed Tokens:")
print(stemmed_tokens)

Output:

Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']

Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']

Exercise 3: Lemmatization

Task: Use the nltk library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."

Solution:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."

# Tokenize the text
tokens = text.split()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nLemmatized Tokens:")
print(lemmatized_tokens)

Output:

Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']

Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']

Exercise 4: Regular Expressions

Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."

Solution:

import re

# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."

# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"

# Use re.findall() to find all matches
dates = re.findall(pattern, text)

print("Extracted Dates:")
print(dates)

Output:

Extracted Dates:
['2021-01-15', '2021-12-31']

Exercise 5: Word Tokenization

Task: Use the nltk library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Sample text
text = "Tokenization is the first step in text preprocessing."

# Perform word tokenization
tokens = word_tokenize(text)

print("Word Tokens:")
print(tokens)

Output:

Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']

Exercise 6: Sentence Tokenization

Task: Use the nltk library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."

# Perform sentence tokenization
sentences = sent_tokenize(text)

print("Sentences:")
print(sentences)

Output:

Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']

Exercise 7: Character Tokenization

Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."

Solution:

def character_tokenization(text):
    # Perform character tokenization
    characters = list(text)
    return characters

# Sample text
text = "Character tokenization is useful for certain tasks."

# Tokenize the text into characters
characters = character_tokenization(text)

print("Character Tokens:")
print(characters)

Output:

Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']

Practical Exercises

Exercise 1: Stop Word Removal

Task: Use the nltk library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

Solution:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# Tokenize the text
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:")
print(tokens)

print("\\nFiltered Tokens:")
print(filtered_tokens)

Output:

Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']

Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']

Exercise 2: Stemming

Task: Use the nltk library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

Solution:

from nltk.stem import PorterStemmer

# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# Tokenize the text
tokens = text.split()

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nStemmed Tokens:")
print(stemmed_tokens)

Output:

Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']

Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']

Exercise 3: Lemmatization

Task: Use the nltk library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."

Solution:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."

# Tokenize the text
tokens = text.split()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nLemmatized Tokens:")
print(lemmatized_tokens)

Output:

Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']

Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']

Exercise 4: Regular Expressions

Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."

Solution:

import re

# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."

# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"

# Use re.findall() to find all matches
dates = re.findall(pattern, text)

print("Extracted Dates:")
print(dates)

Output:

Extracted Dates:
['2021-01-15', '2021-12-31']

Exercise 5: Word Tokenization

Task: Use the nltk library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Sample text
text = "Tokenization is the first step in text preprocessing."

# Perform word tokenization
tokens = word_tokenize(text)

print("Word Tokens:")
print(tokens)

Output:

Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']

Exercise 6: Sentence Tokenization

Task: Use the nltk library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."

# Perform sentence tokenization
sentences = sent_tokenize(text)

print("Sentences:")
print(sentences)

Output:

Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']

Exercise 7: Character Tokenization

Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."

Solution:

def character_tokenization(text):
    # Perform character tokenization
    characters = list(text)
    return characters

# Sample text
text = "Character tokenization is useful for certain tasks."

# Tokenize the text into characters
characters = character_tokenization(text)

print("Character Tokens:")
print(characters)

Output:

Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']

Practical Exercises

Exercise 1: Stop Word Removal

Task: Use the nltk library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

Solution:

import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')

# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."

# Tokenize the text
tokens = text.split()

# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:")
print(tokens)

print("\\nFiltered Tokens:")
print(filtered_tokens)

Output:

Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']

Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']

Exercise 2: Stemming

Task: Use the nltk library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

Solution:

from nltk.stem import PorterStemmer

# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."

# Tokenize the text
tokens = text.split()

# Initialize the stemmer
stemmer = PorterStemmer()

# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nStemmed Tokens:")
print(stemmed_tokens)

Output:

Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']

Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']

Exercise 3: Lemmatization

Task: Use the nltk library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."

Solution:

from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')

# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."

# Tokenize the text
tokens = text.split()

# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()

# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]

print("Original Tokens:")
print(tokens)

print("\\nLemmatized Tokens:")
print(lemmatized_tokens)

Output:

Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']

Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']

Exercise 4: Regular Expressions

Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."

Solution:

import re

# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."

# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"

# Use re.findall() to find all matches
dates = re.findall(pattern, text)

print("Extracted Dates:")
print(dates)

Output:

Extracted Dates:
['2021-01-15', '2021-12-31']

Exercise 5: Word Tokenization

Task: Use the nltk library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

# Sample text
text = "Tokenization is the first step in text preprocessing."

# Perform word tokenization
tokens = word_tokenize(text)

print("Word Tokens:")
print(tokens)

Output:

Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']

Exercise 6: Sentence Tokenization

Task: Use the nltk library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."

Solution:

import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."

# Perform sentence tokenization
sentences = sent_tokenize(text)

print("Sentences:")
print(sentences)

Output:

Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']

Exercise 7: Character Tokenization

Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."

Solution:

def character_tokenization(text):
    # Perform character tokenization
    characters = list(text)
    return characters

# Sample text
text = "Character tokenization is useful for certain tasks."

# Tokenize the text into characters
characters = character_tokenization(text)

print("Character Tokens:")
print(characters)

Output:

Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 2: Basic Text Processing

Practical Exercises

Exercise 1: Stop Word Removal

Exercise 2: Stemming

Exercise 3: Lemmatization

Exercise 4: Regular Expressions

Exercise 5: Word Tokenization

Exercise 6: Sentence Tokenization

Exercise 7: Character Tokenization

Practical Exercises

Exercise 1: Stop Word Removal

Exercise 2: Stemming

Exercise 3: Lemmatization

Exercise 4: Regular Expressions

Exercise 5: Word Tokenization

Exercise 6: Sentence Tokenization

Exercise 7: Character Tokenization

Practical Exercises

Exercise 1: Stop Word Removal

Exercise 2: Stemming

Exercise 3: Lemmatization

Exercise 4: Regular Expressions

Exercise 5: Word Tokenization

Exercise 6: Sentence Tokenization

Exercise 7: Character Tokenization

Practical Exercises

Exercise 1: Stop Word Removal

Exercise 2: Stemming

Exercise 3: Lemmatization

Exercise 4: Regular Expressions

Exercise 5: Word Tokenization

Exercise 6: Sentence Tokenization

Exercise 7: Character Tokenization