Click here to view the next lesson.

Capítulo 1: Introducción a la PNL y su Evolución

1.4 Ejercicios prácticos del capítulo 1

Ahora que hemos explorado los fundamentos del NLP, su desarrollo histórico y enfoques tradicionales, consolidemos tu comprensión con ejercicios prácticos. Cada ejercicio está diseñado para ayudarte a aplicar los conceptos tratados en este capítulo. Tómate tu tiempo para resolverlos y consulta las soluciones cuando sea necesario.

Ejercicio 1: Tokenización y eliminación de palabras vacías

Tarea:

Escribe un programa en Python para tokenizar una oración dada en palabras y eliminar palabras vacías comunes usando la biblioteca NLTK.

Ejemplo de entrada:

"I enjoy learning about natural language processing."

Pasos:

Tokeniza la oración.
Elimina las palabras vacías en inglés.

Solución:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Salida esperada:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Ejercicio 2: Análisis de sentimientos basado en reglas

Tarea:

Crea un analizador de sentimientos basado en reglas que clasifique una oración como Positiva, Negativa o Neutral según listas predefinidas de palabras positivas y negativas.

Ejemplo de entrada:

"This movie was excellent and truly inspiring."

Solución:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Salida esperada:

Sentiment: Positive

Ejercicio 3: Construcción de un modelo Bag-of-Words

Tarea:

Usando el CountVectorizer de scikit-learn, construye una representación Bag-of-Words (BoW) para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un excelente lenguaje de programación."
"Programar en Python es divertido."

Pasos:

Tokeniza las oraciones y crea un vocabulario.
Representa cada oración como un vector.

Solución:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Salida esperada:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Ejercicio 4: Generación de N-Gramas

Tarea:

Escribe un programa en Python para generar bigramas a partir del texto dado:

"Natural language processing is fascinating."

Pasos:

Tokeniza la oración en palabras.
Genera bigramas (n=2).

Solución:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Salida esperada:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Ejercicio 5: Cálculo de TF-IDF

Tarea:

Usa el TfidfVectorizer de scikit-learn para calcular las puntuaciones TF-IDF para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un gran lenguaje de programación."
"Programar en Python es divertido."

Solución:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Salida esperada:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

Estos ejercicios están diseñados para reforzar tu comprensión de la tokenización, métodos basados en reglas, Bag-of-Words, n-gramas y TF-IDF. Estas técnicas fundamentales son bloques esenciales para los métodos de NLP más avanzados que se discuten en capítulos posteriores. ¡Sigue experimentando con diferentes entradas y conjuntos de datos para profundizar tu comprensión!

1.4 Ejercicios prácticos del capítulo 1

Ejercicio 1: Tokenización y eliminación de palabras vacías

Tarea:

Escribe un programa en Python para tokenizar una oración dada en palabras y eliminar palabras vacías comunes usando la biblioteca NLTK.

Ejemplo de entrada:

"I enjoy learning about natural language processing."

Pasos:

Tokeniza la oración.
Elimina las palabras vacías en inglés.

Solución:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Salida esperada:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Ejercicio 2: Análisis de sentimientos basado en reglas

Tarea:

Crea un analizador de sentimientos basado en reglas que clasifique una oración como Positiva, Negativa o Neutral según listas predefinidas de palabras positivas y negativas.

Ejemplo de entrada:

"This movie was excellent and truly inspiring."

Solución:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Salida esperada:

Sentiment: Positive

Ejercicio 3: Construcción de un modelo Bag-of-Words

Tarea:

Usando el CountVectorizer de scikit-learn, construye una representación Bag-of-Words (BoW) para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un excelente lenguaje de programación."
"Programar en Python es divertido."

Pasos:

Tokeniza las oraciones y crea un vocabulario.
Representa cada oración como un vector.

Solución:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Salida esperada:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Ejercicio 4: Generación de N-Gramas

Tarea:

Escribe un programa en Python para generar bigramas a partir del texto dado:

"Natural language processing is fascinating."

Pasos:

Tokeniza la oración en palabras.
Genera bigramas (n=2).

Solución:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Salida esperada:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Ejercicio 5: Cálculo de TF-IDF

Tarea:

Usa el TfidfVectorizer de scikit-learn para calcular las puntuaciones TF-IDF para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un gran lenguaje de programación."
"Programar en Python es divertido."

Solución:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Salida esperada:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

1.4 Ejercicios prácticos del capítulo 1

Ejercicio 1: Tokenización y eliminación de palabras vacías

Tarea:

Escribe un programa en Python para tokenizar una oración dada en palabras y eliminar palabras vacías comunes usando la biblioteca NLTK.

Ejemplo de entrada:

"I enjoy learning about natural language processing."

Pasos:

Tokeniza la oración.
Elimina las palabras vacías en inglés.

Solución:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Salida esperada:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Ejercicio 2: Análisis de sentimientos basado en reglas

Tarea:

Crea un analizador de sentimientos basado en reglas que clasifique una oración como Positiva, Negativa o Neutral según listas predefinidas de palabras positivas y negativas.

Ejemplo de entrada:

"This movie was excellent and truly inspiring."

Solución:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Salida esperada:

Sentiment: Positive

Ejercicio 3: Construcción de un modelo Bag-of-Words

Tarea:

Usando el CountVectorizer de scikit-learn, construye una representación Bag-of-Words (BoW) para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un excelente lenguaje de programación."
"Programar en Python es divertido."

Pasos:

Tokeniza las oraciones y crea un vocabulario.
Representa cada oración como un vector.

Solución:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Salida esperada:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Ejercicio 4: Generación de N-Gramas

Tarea:

Escribe un programa en Python para generar bigramas a partir del texto dado:

"Natural language processing is fascinating."

Pasos:

Tokeniza la oración en palabras.
Genera bigramas (n=2).

Solución:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Salida esperada:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Ejercicio 5: Cálculo de TF-IDF

Tarea:

Usa el TfidfVectorizer de scikit-learn para calcular las puntuaciones TF-IDF para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un gran lenguaje de programación."
"Programar en Python es divertido."

Solución:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Salida esperada:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

1.4 Ejercicios prácticos del capítulo 1

Ejercicio 1: Tokenización y eliminación de palabras vacías

Tarea:

Escribe un programa en Python para tokenizar una oración dada en palabras y eliminar palabras vacías comunes usando la biblioteca NLTK.

Ejemplo de entrada:

"I enjoy learning about natural language processing."

Pasos:

Tokeniza la oración.
Elimina las palabras vacías en inglés.

Solución:

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input sentence
sentence = "I enjoy learning about natural language processing."

# Tokenize
tokens = word_tokenize(sentence)

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]

print("Original Tokens:", tokens)
print("Filtered Tokens:", filtered_tokens)

Salida esperada:

Original Tokens: ['I', 'enjoy', 'learning', 'about', 'natural', 'language', 'processing', '.']
Filtered Tokens: ['enjoy', 'learning', 'natural', 'language', 'processing']

Ejercicio 2: Análisis de sentimientos basado en reglas

Tarea:

Crea un analizador de sentimientos basado en reglas que clasifique una oración como Positiva, Negativa o Neutral según listas predefinidas de palabras positivas y negativas.

Ejemplo de entrada:

"This movie was excellent and truly inspiring."

Solución:

def rule_based_sentiment(sentence):
    positive_words = ["excellent", "great", "inspiring", "good", "amazing"]
    negative_words = ["bad", "terrible", "poor", "awful", "sad"]

    words = sentence.lower().split()

    # Count positive and negative words
    positive_count = sum(1 for word in words if word in positive_words)
    negative_count = sum(1 for word in words if word in negative_words)

    # Determine sentiment
    if positive_count > negative_count:
        return "Positive"
    elif negative_count > positive_count:
        return "Negative"
    else:
        return "Neutral"

# Test the analyzer
sentence = "This movie was excellent and truly inspiring."
print("Sentiment:", rule_based_sentiment(sentence))

Salida esperada:

Sentiment: Positive

Ejercicio 3: Construcción de un modelo Bag-of-Words

Tarea:

Usando el CountVectorizer de scikit-learn, construye una representación Bag-of-Words (BoW) para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un excelente lenguaje de programación."
"Programar en Python es divertido."

Pasos:

Tokeniza las oraciones y crea un vocabulario.
Representa cada oración como un vector.

Solución:

from sklearn.feature_extraction.text import CountVectorizer

# Sample sentences
documents = [
    "I love programming in Python.",
    "Python is an excellent programming language.",
    "Programming in Python is fun."
]

# Create a BoW representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and matrix
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Salida esperada:

Vocabulary: {'love': 3, 'programming': 4, 'python': 5, 'is': 2, 'an': 0, 'excellent': 1, 'language': 6, 'fun': 7}
BoW Matrix:
 [[1 0 0 1 1 1 0 0]
  [0 1 1 0 1 1 1 0]
  [0 0 1 0 1 1 0 1]]

Ejercicio 4: Generación de N-Gramas

Tarea:

Escribe un programa en Python para generar bigramas a partir del texto dado:

"Natural language processing is fascinating."

Pasos:

Tokeniza la oración en palabras.
Genera bigramas (n=2).

Solución:

from nltk.util import ngrams
from nltk.tokenize import word_tokenize

# Input sentence
sentence = "Natural language processing is fascinating."

# Tokenize and generate bigrams
tokens = word_tokenize(sentence)
bigrams = list(ngrams(tokens, 2))

print("Bigrams:", bigrams)

Salida esperada:

Bigrams: [('Natural', 'language'), ('language', 'processing'), ('processing', 'is'), ('is', 'fascinating'), ('.')]

Ejercicio 5: Cálculo de TF-IDF

Tarea:

Usa el TfidfVectorizer de scikit-learn para calcular las puntuaciones TF-IDF para las siguientes oraciones:

"Me encanta programar en Python."
"Python es un gran lenguaje de programación."
"Programar en Python es divertido."

Solución:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample sentences
documents = [
    "I love Python programming.",
    "Python is a great programming language.",
    "Programming in Python is fun."
]

# Calculate TF-IDF
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)

# Display vocabulary and TF-IDF matrix
print("TF-IDF Vocabulary:", vectorizer.vocabulary_)
print("TF-IDF Matrix:\n", tfidf_matrix.toarray())

Salida esperada:

TF-IDF Vocabulary: {'love': 3, 'python': 5, 'programming': 4, 'is': 2, 'great': 1, 'language': 0, 'fun': 6}
TF-IDF Matrix:
 [[0.    0.    0.    0.707 0.707 0.707 0.   ]
  [0.707 0.707 0.    0.    0.707 0.707 0.   ]
  [0.    0.    0.707 0.    0.707 0.707 0.707]]

Compra este libro