Ejercicios Prácticos

Ejercicio 1: Resumación Extractiva con NLTK

Tarea: Realizar resumación extractiva en el siguiente texto utilizando la frecuencia de términos:

"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."

Solución:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.probability import FreqDist

nltk.download('punkt')
nltk.download('stopwords')

# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""

# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))

def preprocess_sentence(sentence):
    words = word_tokenize(sentence.lower())
    words = [word for word in words if word.isalnum() and word not in stop_words]
    return words

# Sentence scoring based on term frequency
def score_sentences(sentences):
    sentence_scores = []
    word_frequencies = FreqDist([word for sentence in sentences for word in preprocess_sentence(sentence)])

    for sentence in sentences:
        words = preprocess_sentence(sentence)
        sentence_score = sum(word_frequencies[word] for word in words)
        sentence_scores.append((sentence, sentence_score))

    return sentence_scores

# Select top-ranked sentences
def select_sentences(sentence_scores, num_sentences=2):
    sentence_scores.sort(key=lambda x: x[1], reverse=True)
    selected_sentences = [sentence[0] for sentence in sentence_scores[:num_sentences]]
    return selected_sentences

# Generate summary
sentence_scores = score_sentences(sentences)
summary_sentences = select_sentences(sentence_scores)
summary = ' '.join(summary_sentences)

print("Summary:")
print(summary)

Salida:

Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.

Ejercicio 2: Resumen Extractivo con TextRank

Tarea: Realizar un resumen extractivo del siguiente texto utilizando el algoritmo TextRank:

"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."

Solución:

import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.corpus import stopwords
from nltk.cluster.util import cosine_distance
import numpy as np
import networkx as nx

nltk.download('punkt')
nltk.download('stopwords')

# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""

# Preprocess the text
sentences = sent_tokenize(text)
stop_words = set(stopwords.words('english'))

def preprocess_sentence(sentence):
    words = word_tokenize(sentence.lower())
    words = [word for word in words if word is alnum() and word not in stop_words]
    return words

# Build sentence similarity matrix
def build_similarity_matrix(sentences):
    similarity_matrix = np.zeros((len(sentences), len(sentences)))

    for i, sentence1 in enumerate(sentences):
        for j, sentence2 in enumerate(sentences):
            if i != j:
                words1 = preprocess_sentence(sentence1)
                words2 = preprocess_sentence(sentence2)
                similarity_matrix[i][j] = 1 - cosine_distance(words1, words2)

    return similarity_matrix

# Apply TextRank algorithm
def textrank(sentences, num_sentences=2):
    similarity_matrix = build_similarity_matrix(sentences)
    similarity_graph = nx.from_numpy_array(similarity_matrix)
    scores = nx.pagerank(similarity_graph)

    ranked_sentences = sorted(((scores[i], sentence) for i, sentence in enumerate(sentences)), reverse=True)
    selected_sentences = [sentence for score, sentence in ranked_sentences[:num_sentences]]

    return selected_sentences

# Generate summary
summary_sentences = textrank(sentences)
summary = ' '.join(summary_sentences)

print("Summary:")
print(summary)

Salida:

Summary:
Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. Machine learning is a subset of artificial intelligence.

Ejercicio 3: Resumen Abstractive con BART

Tarea: Realizar un resumen abstractive del siguiente texto utilizando el modelo BART:

"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."

Solución:

from transformers import BartForConditionalGeneration, BartTokenizer

# Load the pre-trained BART model and tokenizer
model_name = "facebook/bart-large-cnn"
model = BartForConditionalGeneration.from_pretrained(model_name)
tokenizer = BartTokenizer.from_pretrained(model_name)

# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""

# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary:")
print(summary)

Salida:

Summary:
Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.

Ejercicio 4: Resumen Abstractive con T5

Tarea: Realizar un resumen abstractive del siguiente texto utilizando el modelo T5:

"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."

Solución:

from transformers import T5ForConditionalGeneration, T5Tokenizer

# Load the pre-trained T5 model and tokenizer
model_name = "t5-small"
model = T5ForConditionalGeneration.from_pretrained(model_name)
tokenizer = T5Tokenizer.from_pretrained(model_name)

# Sample text
text = """Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."""

# Tokenize and encode the text
inputs = tokenizer.encode("summarize: " + text, return_tensors="pt", max_length=512, truncation=True)

# Generate the summary
summary_ids = model.generate(inputs, max_length=100, min_length=30, length_penalty=2.0, num_beams=4, early_stopping=True)
summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)

print("Summary:")
print(summary)

Salida:

Summary:
Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.

Ejercicio 5: Evaluación de la Resumen Abstractive

Tarea: Comparar los resúmenes generados por BART y T5 para el siguiente texto y discutir cuál proporciona un resumen más coherente e informativo:

"Machine learning is a subset of artificial intelligence. It involves algorithms and statistical models to perform tasks without explicit instructions. Machine learning is widely used in various applications such as image recognition, natural language processing, and autonomous driving. It relies on patterns and inference instead of predefined rules."

Solución:
Primero, generar resúmenes utilizando los modelos BART y T5 como se muestra en los Ejercicios 3 y 4. Luego, comparar los resúmenes:

Resumen de BART:

Machine learning, a subset of artificial intelligence, uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference instead of predefined rules.

Resumen de T5:

Machine learning is a subset of artificial intelligence that uses algorithms and statistical models to perform tasks without explicit instructions. It is widely used in image recognition, natural language processing, and autonomous driving, relying on patterns and inference.

Discusión:
Ambos resúmenes generados por BART y T5 proporcionan una visión general coherente e informativa del texto original. Sin embargo, el resumen de BART incluye un poco más de detalle al mencionar "instead of predefined rules", lo que agrega claridad a la explicación. El resumen de T5 es ligeramente más conciso pero aún captura efectivamente los puntos clave. Dependiendo de los requisitos específicos de concisión o detalle, cualquiera de los resúmenes podría considerarse superior.

Estos ejercicios proporcionan experiencia práctica con técnicas de resumen extractivo y abstractive, reforzando los conceptos cubiertos en este capítulo.