Chapter 3: Basic Text Processing
3.5. Practical Exercises of Chapter 3: Basic Text Processing
Exercise 1: Text Cleaning
Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?
Example:
For this exercise, you can use a text of your choice
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
text = "The quick brown fox jumps over the lazy dog."
# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))
# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))
Exercise 2: Regular Expression
Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.
Example:
To extract emails, phone numbers, and dates from a text, we can use Python's re
module:
import re
text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""
# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)
# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)
# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)
Exercise 3: Tokenization
Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?
Example:
You can use a text of your choice for this exercise:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is fascinating. It has many applications."
# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)
# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)
# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)
3.5. Practical Exercises of Chapter 3: Basic Text Processing
Exercise 1: Text Cleaning
Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?
Example:
For this exercise, you can use a text of your choice
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
text = "The quick brown fox jumps over the lazy dog."
# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))
# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))
Exercise 2: Regular Expression
Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.
Example:
To extract emails, phone numbers, and dates from a text, we can use Python's re
module:
import re
text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""
# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)
# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)
# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)
Exercise 3: Tokenization
Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?
Example:
You can use a text of your choice for this exercise:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is fascinating. It has many applications."
# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)
# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)
# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)
3.5. Practical Exercises of Chapter 3: Basic Text Processing
Exercise 1: Text Cleaning
Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?
Example:
For this exercise, you can use a text of your choice
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
text = "The quick brown fox jumps over the lazy dog."
# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))
# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))
Exercise 2: Regular Expression
Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.
Example:
To extract emails, phone numbers, and dates from a text, we can use Python's re
module:
import re
text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""
# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)
# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)
# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)
Exercise 3: Tokenization
Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?
Example:
You can use a text of your choice for this exercise:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is fascinating. It has many applications."
# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)
# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)
# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)
3.5. Practical Exercises of Chapter 3: Basic Text Processing
Exercise 1: Text Cleaning
Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?
Example:
For this exercise, you can use a text of your choice
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
text = "The quick brown fox jumps over the lazy dog."
# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))
# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))
# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))
Exercise 2: Regular Expression
Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.
Example:
To extract emails, phone numbers, and dates from a text, we can use Python's re
module:
import re
text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""
# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)
# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)
# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)
Exercise 3: Tokenization
Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?
Example:
You can use a text of your choice for this exercise:
import nltk
from nltk.tokenize import word_tokenize, sent_tokenize
text = "Natural Language Processing is fascinating. It has many applications."
# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)
# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)
# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)