Menu iconMenu iconNatural Language Processing with Python
Natural Language Processing with Python

Chapter 3: Basic Text Processing

3.5. Practical Exercises of Chapter 3: Basic Text Processing

Exercise 1: Text Cleaning

Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?

Example:

For this exercise, you can use a text of your choice

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

text = "The quick brown fox jumps over the lazy dog."

# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))

# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))

Exercise 2: Regular Expression

Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.

Example:

To extract emails, phone numbers, and dates from a text, we can use Python's re module:

import re

text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)

# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)

# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)

Exercise 3: Tokenization

Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?

Example:

You can use a text of your choice for this exercise:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It has many applications."

# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)

# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)

# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)

3.5. Practical Exercises of Chapter 3: Basic Text Processing

Exercise 1: Text Cleaning

Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?

Example:

For this exercise, you can use a text of your choice

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

text = "The quick brown fox jumps over the lazy dog."

# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))

# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))

Exercise 2: Regular Expression

Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.

Example:

To extract emails, phone numbers, and dates from a text, we can use Python's re module:

import re

text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)

# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)

# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)

Exercise 3: Tokenization

Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?

Example:

You can use a text of your choice for this exercise:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It has many applications."

# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)

# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)

# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)

3.5. Practical Exercises of Chapter 3: Basic Text Processing

Exercise 1: Text Cleaning

Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?

Example:

For this exercise, you can use a text of your choice

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

text = "The quick brown fox jumps over the lazy dog."

# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))

# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))

Exercise 2: Regular Expression

Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.

Example:

To extract emails, phone numbers, and dates from a text, we can use Python's re module:

import re

text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)

# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)

# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)

Exercise 3: Tokenization

Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?

Example:

You can use a text of your choice for this exercise:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It has many applications."

# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)

# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)

# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)

3.5. Practical Exercises of Chapter 3: Basic Text Processing

Exercise 1: Text Cleaning

Take a paragraph of text and write a Python program to remove stopwords, perform stemming, and lemmatization. Compare the results of stemming and lemmatization - how do they differ?

Example:

For this exercise, you can use a text of your choice

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer

text = "The quick brown fox jumps over the lazy dog."

# Remove stopwords
stop_words = set(stopwords.words('english'))
word_tokens = word_tokenize(text)
filtered_text = [w for w in word_tokens if not w in stop_words]
print("Text after stopword removal: ", ' '.join(filtered_text))

# Perform stemming
ps = PorterStemmer()
stemmed_text = [ps.stem(w) for w in filtered_text]
print("Text after stemming: ", ' '.join(stemmed_text))

# Perform lemmatization
lemmatizer = WordNetLemmatizer()
lemmatized_text = [lemmatizer.lemmatize(w) for w in filtered_text]
print("Text after lemmatization: ", ' '.join(lemmatized_text))

Exercise 2: Regular Expression

Write regular expressions to extract all email addresses, phone numbers, and dates from a piece of text.

Example:

To extract emails, phone numbers, and dates from a text, we can use Python's re module:

import re

text = """
John Doe
123 Main St.
Anytown, USA 12345
john.doe@example.com
(123) 456-7890
Meeting on 2023-05-12
"""

# Extract email addresses
emails = re.findall(r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b', text)
print("Emails: ", emails)

# Extract phone numbers
phone_numbers = re.findall(r'\(?\d{3}\)?[-.\s]?\d{3}[-.\s]?\d{4}', text)
print("Phone Numbers: ", phone_numbers)

# Extract dates
dates = re.findall(r'\b\d{4}[-]\d{2}[-]\d{2}\b', text)
print("Dates: ", dates)

Exercise 3: Tokenization

Take a paragraph of text and write a Python program to perform word, sentence, and subword tokenization. When might each type of tokenization be beneficial?

Example:

You can use a text of your choice for this exercise:

import nltk
from nltk.tokenize import word_tokenize, sent_tokenize

text = "Natural Language Processing is fascinating. It has many applications."

# Perform word tokenization
word_tokens = word_tokenize(text)
print("Word tokens: ", word_tokens)

# Perform sentence tokenization
sentence_tokens = sent_tokenize(text)
print("Sentence tokens: ", sentence_tokens)

# For subword tokenization, we need a library that supports it, such as BPEmb
# This will require internet access and may take a while to download the model
from bpemb import BPEmb
bpemb_en = BPEmb(lang="en", vs=5000)
subword_tokens = bpemb_en.encode(text)
print("Subword tokens: ", subword_tokens)