Chapter 8: Text Summarization
Practical Exercises
Exercise 1: Extractive Summarization with NLTK
Task: Perform extractive summarization on the following text using term frequency:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Sentence scoring based on term frequency
def score_sentences(sentences):
sentence_scores = []
word_frequencies = FreqDist([word for sentence in sentences for word in preprocess_sentence(sentence)])
for sentence in sentences:
words = preprocess_sentence(sentence)
sentence_score = sum(word_frequencies[word] for word in words)
sentence_scores.append((sentence, sentence_score))
return sentence_scores
# Select top-ranked sentences
def select_sentences(sentence_scores, num_sentences=2):
sentence_scores.sort(key=lambda x: x[1], reverse=True)
selected_sentences = [sentence[0] for sentence in sentence_scores[:num_sentences]]
return selected_sentences
# Generate summary
sentence_scores = score_sentences(sentences)
summary_sentences = select_sentences(sentence_scores)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 2: Extractive Summarization with TextRank
Task: Perform extractive summarization on the following text using the TextRank algorithm:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Build sentence similarity matrix
def build_similarity_matrix(sentences):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i, sentence1 in enumerate(sentences):
for j, sentence2 in enumerate(sentences):
if i != j:
words1 = preprocess_sentence(sentence1)
words2 = preprocess_sentence(sentence2)
similarity_matrix[i][j] = 1 - cosine_distance(words1, words2)
return similarity_matrix
# Apply TextRank algorithm
def textrank(sentences, num_sentences=2):
similarity_matrix = build_similarity_matrix(sentences)
similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(similarity_graph)
ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
selected_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]
return selected_sentences
# Generate summary
summary_sentences = textrank(sentences)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 3: Abstractive Summarization with BART
Task: Perform abstractive summarization on the following text using the BART model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import BartForConditionalGeneration, BartTokenizer
# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
Exercise 4: Abstractive Summarization with T5
Task: Perform abstractive summarization on the following text using the T5 model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.
Exercise 5: Evaluating Abstractive Summarization
Task: Compare the summaries generated by BART and T5 for the following text and discuss which one provides a more coherent and informative summary:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
First, generate summaries using the BART and T5 models as shown in Exercises 3 and 4. Then, compare the summaries:
BART Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
T5 Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous
driving, relying on patterns and inference.
Discussion:
Both summaries generated by BART and T5 provide a coherent and informative overview of the original text. However, the BART summary includes a bit more detail by mentioning "instead of predefined rules," which adds clarity to the explanation. The T5 summary is slightly more concise but still effectively captures the key points. Depending on the specific requirements for conciseness or detail, either summary could be considered superior.
These exercises provide hands-on experience with extractive and abstractive summarization techniques, reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Extractive Summarization with NLTK
Task: Perform extractive summarization on the following text using term frequency:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Sentence scoring based on term frequency
def score_sentences(sentences):
sentence_scores = []
word_frequencies = FreqDist([word for sentence in sentences for word in preprocess_sentence(sentence)])
for sentence in sentences:
words = preprocess_sentence(sentence)
sentence_score = sum(word_frequencies[word] for word in words)
sentence_scores.append((sentence, sentence_score))
return sentence_scores
# Select top-ranked sentences
def select_sentences(sentence_scores, num_sentences=2):
sentence_scores.sort(key=lambda x: x[1], reverse=True)
selected_sentences = [sentence[0] for sentence in sentence_scores[:num_sentences]]
return selected_sentences
# Generate summary
sentence_scores = score_sentences(sentences)
summary_sentences = select_sentences(sentence_scores)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 2: Extractive Summarization with TextRank
Task: Perform extractive summarization on the following text using the TextRank algorithm:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Build sentence similarity matrix
def build_similarity_matrix(sentences):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i, sentence1 in enumerate(sentences):
for j, sentence2 in enumerate(sentences):
if i != j:
words1 = preprocess_sentence(sentence1)
words2 = preprocess_sentence(sentence2)
similarity_matrix[i][j] = 1 - cosine_distance(words1, words2)
return similarity_matrix
# Apply TextRank algorithm
def textrank(sentences, num_sentences=2):
similarity_matrix = build_similarity_matrix(sentences)
similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(similarity_graph)
ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
selected_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]
return selected_sentences
# Generate summary
summary_sentences = textrank(sentences)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 3: Abstractive Summarization with BART
Task: Perform abstractive summarization on the following text using the BART model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import BartForConditionalGeneration, BartTokenizer
# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
Exercise 4: Abstractive Summarization with T5
Task: Perform abstractive summarization on the following text using the T5 model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.
Exercise 5: Evaluating Abstractive Summarization
Task: Compare the summaries generated by BART and T5 for the following text and discuss which one provides a more coherent and informative summary:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
First, generate summaries using the BART and T5 models as shown in Exercises 3 and 4. Then, compare the summaries:
BART Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
T5 Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous
driving, relying on patterns and inference.
Discussion:
Both summaries generated by BART and T5 provide a coherent and informative overview of the original text. However, the BART summary includes a bit more detail by mentioning "instead of predefined rules," which adds clarity to the explanation. The T5 summary is slightly more concise but still effectively captures the key points. Depending on the specific requirements for conciseness or detail, either summary could be considered superior.
These exercises provide hands-on experience with extractive and abstractive summarization techniques, reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Extractive Summarization with NLTK
Task: Perform extractive summarization on the following text using term frequency:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Sentence scoring based on term frequency
def score_sentences(sentences):
sentence_scores = []
word_frequencies = FreqDist([word for sentence in sentences for word in preprocess_sentence(sentence)])
for sentence in sentences:
words = preprocess_sentence(sentence)
sentence_score = sum(word_frequencies[word] for word in words)
sentence_scores.append((sentence, sentence_score))
return sentence_scores
# Select top-ranked sentences
def select_sentences(sentence_scores, num_sentences=2):
sentence_scores.sort(key=lambda x: x[1], reverse=True)
selected_sentences = [sentence[0] for sentence in sentence_scores[:num_sentences]]
return selected_sentences
# Generate summary
sentence_scores = score_sentences(sentences)
summary_sentences = select_sentences(sentence_scores)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 2: Extractive Summarization with TextRank
Task: Perform extractive summarization on the following text using the TextRank algorithm:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Build sentence similarity matrix
def build_similarity_matrix(sentences):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i, sentence1 in enumerate(sentences):
for j, sentence2 in enumerate(sentences):
if i != j:
words1 = preprocess_sentence(sentence1)
words2 = preprocess_sentence(sentence2)
similarity_matrix[i][j] = 1 - cosine_distance(words1, words2)
return similarity_matrix
# Apply TextRank algorithm
def textrank(sentences, num_sentences=2):
similarity_matrix = build_similarity_matrix(sentences)
similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(similarity_graph)
ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
selected_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]
return selected_sentences
# Generate summary
summary_sentences = textrank(sentences)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 3: Abstractive Summarization with BART
Task: Perform abstractive summarization on the following text using the BART model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import BartForConditionalGeneration, BartTokenizer
# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
Exercise 4: Abstractive Summarization with T5
Task: Perform abstractive summarization on the following text using the T5 model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.
Exercise 5: Evaluating Abstractive Summarization
Task: Compare the summaries generated by BART and T5 for the following text and discuss which one provides a more coherent and informative summary:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
First, generate summaries using the BART and T5 models as shown in Exercises 3 and 4. Then, compare the summaries:
BART Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
T5 Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous
driving, relying on patterns and inference.
Discussion:
Both summaries generated by BART and T5 provide a coherent and informative overview of the original text. However, the BART summary includes a bit more detail by mentioning "instead of predefined rules," which adds clarity to the explanation. The T5 summary is slightly more concise but still effectively captures the key points. Depending on the specific requirements for conciseness or detail, either summary could be considered superior.
These exercises provide hands-on experience with extractive and abstractive summarization techniques, reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Extractive Summarization with NLTK
Task: Perform extractive summarization on the following text using term frequency:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Sentence scoring based on term frequency
def score_sentences(sentences):
sentence_scores = []
word_frequencies = FreqDist([word for sentence in sentences for word in preprocess_sentence(sentence)])
for sentence in sentences:
words = preprocess_sentence(sentence)
sentence_score = sum(word_frequencies[word] for word in words)
sentence_scores.append((sentence, sentence_score))
return sentence_scores
# Select top-ranked sentences
def select_sentences(sentence_scores, num_sentences=2):
sentence_scores.sort(key=lambda x: x[1], reverse=True)
selected_sentences = [sentence[0] for sentence in sentence_scores[:num_sentences]]
return selected_sentences
# Generate summary
sentence_scores = score_sentences(sentences)
summary_sentences = select_sentences(sentence_scores)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 2: Extractive Summarization with TextRank
Task: Perform extractive summarization on the following text using the TextRank algorithm:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx
nltk.download('punkt')
nltk.download('stopwords')
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))
def preprocess_sentence(sentence):
words = word_tokenize(sentence.lower())
words = [word for word in words if word.isalnum() and word not in stop_words]
return words
# Build sentence similarity matrix
def build_similarity_matrix(sentences):
similarity_matrix = np.zeros((len(sentences), len(sentences)))
for i, sentence1 in enumerate(sentences):
for j, sentence2 in enumerate(sentences):
if i != j:
words1 = preprocess_sentence(sentence1)
words2 = preprocess_sentence(sentence2)
similarity_matrix[i][j] = 1 - cosine_distance(words1, words2)
return similarity_matrix
# Apply TextRank algorithm
def textrank(sentences, num_sentences=2):
similarity_matrix = build_similarity_matrix(sentences)
similarity_graph = nx.from_numpy_array(similarity_matrix)
scores = nx.pagerank(similarity_graph)
ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
selected_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]
return selected_sentences
# Generate summary
summary_sentences = textrank(sentences)
summary = ' '.join(summary_sentences)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.
Exercise 3: Abstractive Summarization with BART
Task: Perform abstractive summarization on the following text using the BART model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import BartForConditionalGeneration, BartTokenizer
# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
Exercise 4: Abstractive Summarization with T5
Task: Perform abstractive summarization on the following text using the T5 model:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
from transformers import T5ForConditionalGeneration, T5Tokenizer
# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)
# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""
# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)
# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)
print("Summary:")
print(summary)
Output:
Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.
Exercise 5: Evaluating Abstractive Summarization
Task: Compare the summaries generated by BART and T5 for the following text and discuss which one provides a more coherent and informative summary:
"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."
Solution:
First, generate summaries using the BART and T5 models as shown in Exercises 3 and 4. Then, compare the summaries:
BART Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.
T5 Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous
driving, relying on patterns and inference.
Discussion:
Both summaries generated by BART and T5 provide a coherent and informative overview of the original text. However, the BART summary includes a bit more detail by mentioning "instead of predefined rules," which adds clarity to the explanation. The T5 summary is slightly more concise but still effectively captures the key points. Depending on the specific requirements for conciseness or detail, either summary could be considered superior.
These exercises provide hands-on experience with extractive and abstractive summarization techniques, reinforcing the concepts covered in this chapter.