Chapter 2: Basic Text Processing
Practical Exercises
Exercise 1: Stop Word Removal
Task: Use the nltk
library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
Solution:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
# Tokenize the text
tokens = text.split()
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Original Tokens:")
print(tokens)
print("\\nFiltered Tokens:")
print(filtered_tokens)
Output:
Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']
Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']
Exercise 2: Stemming
Task: Use the nltk
library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
Solution:
from nltk.stem import PorterStemmer
# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
# Tokenize the text
tokens = text.split()
# Initialize the stemmer
stemmer = PorterStemmer()
# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nStemmed Tokens:")
print(stemmed_tokens)
Output:
Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']
Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']
Exercise 3: Lemmatization
Task: Use the nltk
library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."
Solution:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."
# Tokenize the text
tokens = text.split()
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nLemmatized Tokens:")
print(lemmatized_tokens)
Output:
Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']
Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']
Exercise 4: Regular Expressions
Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."
Solution:
import re
# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."
# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"
# Use re.findall() to find all matches
dates = re.findall(pattern, text)
print("Extracted Dates:")
print(dates)
Output:
Extracted Dates:
['2021-01-15', '2021-12-31']
Exercise 5: Word Tokenization
Task: Use the nltk
library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization is the first step in text preprocessing."
# Perform word tokenization
tokens = word_tokenize(text)
print("Word Tokens:")
print(tokens)
Output:
Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']
Exercise 6: Sentence Tokenization
Task: Use the nltk
library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."
# Perform sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)
Output:
Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']
Exercise 7: Character Tokenization
Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."
Solution:
def character_tokenization(text):
# Perform character tokenization
characters = list(text)
return characters
# Sample text
text = "Character tokenization is useful for certain tasks."
# Tokenize the text into characters
characters = character_tokenization(text)
print("Character Tokens:")
print(characters)
Output:
Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']
These practical exercises provide hands-on experience with different aspects of text preprocessing, including stop word removal, stemming, lemmatization, regular expressions, and tokenization. Each exercise is designed to reinforce the concepts discussed in the chapter and help you become proficient in implementing these techniques using Python.
Practical Exercises
Exercise 1: Stop Word Removal
Task: Use the nltk
library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
Solution:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
# Tokenize the text
tokens = text.split()
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Original Tokens:")
print(tokens)
print("\\nFiltered Tokens:")
print(filtered_tokens)
Output:
Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']
Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']
Exercise 2: Stemming
Task: Use the nltk
library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
Solution:
from nltk.stem import PorterStemmer
# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
# Tokenize the text
tokens = text.split()
# Initialize the stemmer
stemmer = PorterStemmer()
# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nStemmed Tokens:")
print(stemmed_tokens)
Output:
Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']
Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']
Exercise 3: Lemmatization
Task: Use the nltk
library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."
Solution:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."
# Tokenize the text
tokens = text.split()
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nLemmatized Tokens:")
print(lemmatized_tokens)
Output:
Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']
Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']
Exercise 4: Regular Expressions
Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."
Solution:
import re
# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."
# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"
# Use re.findall() to find all matches
dates = re.findall(pattern, text)
print("Extracted Dates:")
print(dates)
Output:
Extracted Dates:
['2021-01-15', '2021-12-31']
Exercise 5: Word Tokenization
Task: Use the nltk
library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization is the first step in text preprocessing."
# Perform word tokenization
tokens = word_tokenize(text)
print("Word Tokens:")
print(tokens)
Output:
Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']
Exercise 6: Sentence Tokenization
Task: Use the nltk
library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."
# Perform sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)
Output:
Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']
Exercise 7: Character Tokenization
Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."
Solution:
def character_tokenization(text):
# Perform character tokenization
characters = list(text)
return characters
# Sample text
text = "Character tokenization is useful for certain tasks."
# Tokenize the text into characters
characters = character_tokenization(text)
print("Character Tokens:")
print(characters)
Output:
Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']
These practical exercises provide hands-on experience with different aspects of text preprocessing, including stop word removal, stemming, lemmatization, regular expressions, and tokenization. Each exercise is designed to reinforce the concepts discussed in the chapter and help you become proficient in implementing these techniques using Python.
Practical Exercises
Exercise 1: Stop Word Removal
Task: Use the nltk
library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
Solution:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
# Tokenize the text
tokens = text.split()
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Original Tokens:")
print(tokens)
print("\\nFiltered Tokens:")
print(filtered_tokens)
Output:
Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']
Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']
Exercise 2: Stemming
Task: Use the nltk
library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
Solution:
from nltk.stem import PorterStemmer
# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
# Tokenize the text
tokens = text.split()
# Initialize the stemmer
stemmer = PorterStemmer()
# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nStemmed Tokens:")
print(stemmed_tokens)
Output:
Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']
Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']
Exercise 3: Lemmatization
Task: Use the nltk
library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."
Solution:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."
# Tokenize the text
tokens = text.split()
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nLemmatized Tokens:")
print(lemmatized_tokens)
Output:
Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']
Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']
Exercise 4: Regular Expressions
Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."
Solution:
import re
# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."
# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"
# Use re.findall() to find all matches
dates = re.findall(pattern, text)
print("Extracted Dates:")
print(dates)
Output:
Extracted Dates:
['2021-01-15', '2021-12-31']
Exercise 5: Word Tokenization
Task: Use the nltk
library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization is the first step in text preprocessing."
# Perform word tokenization
tokens = word_tokenize(text)
print("Word Tokens:")
print(tokens)
Output:
Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']
Exercise 6: Sentence Tokenization
Task: Use the nltk
library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."
# Perform sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)
Output:
Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']
Exercise 7: Character Tokenization
Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."
Solution:
def character_tokenization(text):
# Perform character tokenization
characters = list(text)
return characters
# Sample text
text = "Character tokenization is useful for certain tasks."
# Tokenize the text into characters
characters = character_tokenization(text)
print("Character Tokens:")
print(characters)
Output:
Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']
These practical exercises provide hands-on experience with different aspects of text preprocessing, including stop word removal, stemming, lemmatization, regular expressions, and tokenization. Each exercise is designed to reinforce the concepts discussed in the chapter and help you become proficient in implementing these techniques using Python.
Practical Exercises
Exercise 1: Stop Word Removal
Task: Use the nltk
library to remove stop words from the following text: "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
Solution:
import nltk
from nltk.corpus import stopwords
nltk.download('stopwords')
# Sample text
text = "NLP enables computers to understand human language, which is a crucial aspect of artificial intelligence."
# Tokenize the text
tokens = text.split()
# Remove stop words
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
print("Original Tokens:")
print(tokens)
print("\\nFiltered Tokens:")
print(filtered_tokens)
Output:
Original Tokens:
['NLP', 'enables', 'computers', 'to', 'understand', 'human', 'language,', 'which', 'is', 'a', 'crucial', 'aspect', 'of', 'artificial', 'intelligence.']
Filtered Tokens:
['NLP', 'enables', 'computers', 'understand', 'human', 'language,', 'crucial', 'aspect', 'artificial', 'intelligence.']
Exercise 2: Stemming
Task: Use the nltk
library to perform stemming on the following text: "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
Solution:
from nltk.stem import PorterStemmer
# Sample text
text = "Stemming helps in reducing words to their root form, which can be beneficial for text processing."
# Tokenize the text
tokens = text.split()
# Initialize the stemmer
stemmer = PorterStemmer()
# Stem the tokens
stemmed_tokens = [stemmer.stem(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nStemmed Tokens:")
print(stemmed_tokens)
Output:
Original Tokens:
['Stemming', 'helps', 'in', 'reducing', 'words', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'beneficial', 'for', 'text', 'processing.']
Stemmed Tokens:
['stem', 'help', 'in', 'reduc', 'word', 'to', 'their', 'root', 'form,', 'which', 'can', 'be', 'benefici', 'for', 'text', 'process.']
Exercise 3: Lemmatization
Task: Use the nltk
library to perform lemmatization on the following text: "Lemmatization is the process of reducing words to their base or root form."
Solution:
from nltk.stem import WordNetLemmatizer
nltk.download('wordnet')
# Sample text
text = "Lemmatization is the process of reducing words to their base or root form."
# Tokenize the text
tokens = text.split()
# Initialize the lemmatizer
lemmatizer = WordNetLemmatizer()
# Lemmatize the tokens
lemmatized_tokens = [lemmatizer.lemmatize(word) for word in tokens]
print("Original Tokens:")
print(tokens)
print("\\nLemmatized Tokens:")
print(lemmatized_tokens)
Output:
Original Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'words', 'to', 'their', 'base', 'or', 'root', 'form.']
Lemmatized Tokens:
['Lemmatization', 'is', 'the', 'process', 'of', 'reducing', 'word', 'to', 'their', 'base', 'or', 'root', 'form.']
Exercise 4: Regular Expressions
Task: Use regular expressions to extract all the dates in the format "YYYY-MM-DD" from the following text: "The project started on 2021-01-15 and ended on 2021-12-31."
Solution:
import re
# Sample text
text = "The project started on 2021-01-15 and ended on 2021-12-31."
# Define a regex pattern to match dates in the format YYYY-MM-DD
pattern = r"\\b\\d{4}-\\d{2}-\\d{2}\\b"
# Use re.findall() to find all matches
dates = re.findall(pattern, text)
print("Extracted Dates:")
print(dates)
Output:
Extracted Dates:
['2021-01-15', '2021-12-31']
Exercise 5: Word Tokenization
Task: Use the nltk
library to perform word tokenization on the following text: "Tokenization is the first step in text preprocessing."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
# Sample text
text = "Tokenization is the first step in text preprocessing."
# Perform word tokenization
tokens = word_tokenize(text)
print("Word Tokens:")
print(tokens)
Output:
Word Tokens:
['Tokenization', 'is', 'the', 'first', 'step', 'in', 'text', 'preprocessing', '.']
Exercise 6: Sentence Tokenization
Task: Use the nltk
library to perform sentence tokenization on the following text: "Tokenization is essential. It breaks down text into smaller units."
Solution:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
# Sample text
text = "Tokenization is essential. It breaks down text into smaller units."
# Perform sentence tokenization
sentences = sent_tokenize(text)
print("Sentences:")
print(sentences)
Output:
Sentences:
['Tokenization is essential.', 'It breaks down text into smaller units.']
Exercise 7: Character Tokenization
Task: Write a Python function that performs character tokenization on the input text: "Character tokenization is useful for certain tasks."
Solution:
def character_tokenization(text):
# Perform character tokenization
characters = list(text)
return characters
# Sample text
text = "Character tokenization is useful for certain tasks."
# Tokenize the text into characters
characters = character_tokenization(text)
print("Character Tokens:")
print(characters)
Output:
Character Tokens:
['C', 'h', 'a', 'r', 'a', 'c', 't', 'e', 'r', ' ', 't', 'o', 'k', 'e', 'n', 'i', 'z', 'a', 't', 'i', 'o', 'n', ' ', 'i', 's', ' ', 'u', 's', 'e', 'f', 'u', 'l', ' ', 'f', 'o', 'r', ' ', 'c', 'e', 'r', 't', 'a', 'i', 'n', ' ', 't', 'a', 's', 'k', 's', '.']
These practical exercises provide hands-on experience with different aspects of text preprocessing, including stop word removal, stemming, lemmatization, regular expressions, and tokenization. Each exercise is designed to reinforce the concepts discussed in the chapter and help you become proficient in implementing these techniques using Python.