Click here to view the next lesson.

Capítulo 7: Modelado de Temas

Ejercicios Prácticos

Ejercicio 1: Análisis Semántico Latente (LSA)

Tarea: Realiza un Análisis Semántico Latente (LSA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Data science is an interdisciplinary field."
"Machine learning is a subset of data science."
"Artificial intelligence is a broader concept than machine learning."
"Deep learning is a subset of machine learning."

Solución:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample text corpus
corpus = [
    "Data science is an interdisciplinary field.",
    "Machine learning is a subset of data science.",
    "Artificial intelligence is a broader concept than machine learning.",
    "Deep learning is a subset of machine learning."
]

# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)

# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
    print(f"Topic {i}:")
    for term, weight in sorted_terms:
        print(f" - {term}: {weight:.4f}")

Salida:

Topic 0:
 - machine: 0.5126
 - learning: 0.5126
 - is: 0.3561
 - of: 0.2715
 - science: 0.2121
Topic 1:
 - data: 0.5462
 - science: 0.5462
 - interdisciplinary: 0.3583
 - field: 0.3583
 - is: -0.0000

Ejercicio 2: Asignación de Dirichlet Latente (LDA)

Tarea: Realiza una Asignación de Dirichlet Latente (LDA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Natural language processing enables computers to understand human language."
"Computer vision allows machines to interpret and make decisions based on visual data."
"Robotics combines engineering and computer science to create intelligent machines."
"Quantum computing leverages quantum mechanics to perform complex calculations."

Solución:

import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret and make decisions based on visual data.",
    "Robotics combines engineering and computer science to create intelligent machines.",
    "Quantum computing leverages quantum mechanics to perform complex calculations."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)

# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
 (1,
  '0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]

Ejercicio 3: Proceso Dirichlet Hierárquico (HDP)

Tarea: Realiza un Proceso Dirichlet Hierárquico (HDP) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Climate change impacts global weather patterns."
"Renewable energy sources reduce carbon emissions."
"Biodiversity is essential for ecosystem balance."
"Conservation efforts protect endangered species."

Solución:

import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Climate change impacts global weather patterns.",
    "Renewable energy sources reduce carbon emissions.",
    "Biodiversity is essential for ecosystem balance.",
    "Conservation efforts protect endangered species."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
 (1,
  '0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]

Ejercicio 4: Evaluación de la Coherencia del Tema

Tarea: Calcula el puntaje de coherencia para los temas generados por el modelo LDA en el Ejercicio 2.

Solución:

from gensim.models.coherencemodel import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")

Salida:

Coherence Score: 0.4705210160925866

Ejercicio 5: Asignación de Temas a Nuevos Documentos

Tarea: Asigna temas al siguiente nuevo documento utilizando el modelo HDP entrenado en el Ejercicio 3:

"Renewable energy is crucial for combating climate change."

Solución:

# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())

# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])

Salida:

Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]

Estos ejercicios proporcionan experiencia práctica con el Análisis Semántico Latente (LSA), la Asignación de Dirichlet Latente (LDA) y el Proceso Dirichlet Hierárquico (HDP), reforzando los conceptos cubiertos en este capítulo.

Ejercicios Prácticos

Ejercicio 1: Análisis Semántico Latente (LSA)

Tarea: Realiza un Análisis Semántico Latente (LSA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Data science is an interdisciplinary field."
"Machine learning is a subset of data science."
"Artificial intelligence is a broader concept than machine learning."
"Deep learning is a subset of machine learning."

Solución:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample text corpus
corpus = [
    "Data science is an interdisciplinary field.",
    "Machine learning is a subset of data science.",
    "Artificial intelligence is a broader concept than machine learning.",
    "Deep learning is a subset of machine learning."
]

# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)

# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
    print(f"Topic {i}:")
    for term, weight in sorted_terms:
        print(f" - {term}: {weight:.4f}")

Salida:

Topic 0:
 - machine: 0.5126
 - learning: 0.5126
 - is: 0.3561
 - of: 0.2715
 - science: 0.2121
Topic 1:
 - data: 0.5462
 - science: 0.5462
 - interdisciplinary: 0.3583
 - field: 0.3583
 - is: -0.0000

Ejercicio 2: Asignación de Dirichlet Latente (LDA)

Tarea: Realiza una Asignación de Dirichlet Latente (LDA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Natural language processing enables computers to understand human language."
"Computer vision allows machines to interpret and make decisions based on visual data."
"Robotics combines engineering and computer science to create intelligent machines."
"Quantum computing leverages quantum mechanics to perform complex calculations."

Solución:

import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret and make decisions based on visual data.",
    "Robotics combines engineering and computer science to create intelligent machines.",
    "Quantum computing leverages quantum mechanics to perform complex calculations."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)

# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
 (1,
  '0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]

Ejercicio 3: Proceso Dirichlet Hierárquico (HDP)

Tarea: Realiza un Proceso Dirichlet Hierárquico (HDP) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Climate change impacts global weather patterns."
"Renewable energy sources reduce carbon emissions."
"Biodiversity is essential for ecosystem balance."
"Conservation efforts protect endangered species."

Solución:

import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Climate change impacts global weather patterns.",
    "Renewable energy sources reduce carbon emissions.",
    "Biodiversity is essential for ecosystem balance.",
    "Conservation efforts protect endangered species."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
 (1,
  '0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]

Ejercicio 4: Evaluación de la Coherencia del Tema

Tarea: Calcula el puntaje de coherencia para los temas generados por el modelo LDA en el Ejercicio 2.

Solución:

from gensim.models.coherencemodel import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")

Salida:

Coherence Score: 0.4705210160925866

Ejercicio 5: Asignación de Temas a Nuevos Documentos

Tarea: Asigna temas al siguiente nuevo documento utilizando el modelo HDP entrenado en el Ejercicio 3:

"Renewable energy is crucial for combating climate change."

Solución:

# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())

# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])

Salida:

Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]

Ejercicios Prácticos

Ejercicio 1: Análisis Semántico Latente (LSA)

Tarea: Realiza un Análisis Semántico Latente (LSA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Data science is an interdisciplinary field."
"Machine learning is a subset of data science."
"Artificial intelligence is a broader concept than machine learning."
"Deep learning is a subset of machine learning."

Solución:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample text corpus
corpus = [
    "Data science is an interdisciplinary field.",
    "Machine learning is a subset of data science.",
    "Artificial intelligence is a broader concept than machine learning.",
    "Deep learning is a subset of machine learning."
]

# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)

# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
    print(f"Topic {i}:")
    for term, weight in sorted_terms:
        print(f" - {term}: {weight:.4f}")

Salida:

Topic 0:
 - machine: 0.5126
 - learning: 0.5126
 - is: 0.3561
 - of: 0.2715
 - science: 0.2121
Topic 1:
 - data: 0.5462
 - science: 0.5462
 - interdisciplinary: 0.3583
 - field: 0.3583
 - is: -0.0000

Ejercicio 2: Asignación de Dirichlet Latente (LDA)

Tarea: Realiza una Asignación de Dirichlet Latente (LDA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Natural language processing enables computers to understand human language."
"Computer vision allows machines to interpret and make decisions based on visual data."
"Robotics combines engineering and computer science to create intelligent machines."
"Quantum computing leverages quantum mechanics to perform complex calculations."

Solución:

import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret and make decisions based on visual data.",
    "Robotics combines engineering and computer science to create intelligent machines.",
    "Quantum computing leverages quantum mechanics to perform complex calculations."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)

# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
 (1,
  '0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]

Ejercicio 3: Proceso Dirichlet Hierárquico (HDP)

Tarea: Realiza un Proceso Dirichlet Hierárquico (HDP) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Climate change impacts global weather patterns."
"Renewable energy sources reduce carbon emissions."
"Biodiversity is essential for ecosystem balance."
"Conservation efforts protect endangered species."

Solución:

import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Climate change impacts global weather patterns.",
    "Renewable energy sources reduce carbon emissions.",
    "Biodiversity is essential for ecosystem balance.",
    "Conservation efforts protect endangered species."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
 (1,
  '0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]

Ejercicio 4: Evaluación de la Coherencia del Tema

Tarea: Calcula el puntaje de coherencia para los temas generados por el modelo LDA en el Ejercicio 2.

Solución:

from gensim.models.coherencemodel import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")

Salida:

Coherence Score: 0.4705210160925866

Ejercicio 5: Asignación de Temas a Nuevos Documentos

Tarea: Asigna temas al siguiente nuevo documento utilizando el modelo HDP entrenado en el Ejercicio 3:

"Renewable energy is crucial for combating climate change."

Solución:

# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())

# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])

Salida:

Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]

Ejercicios Prácticos

Ejercicio 1: Análisis Semántico Latente (LSA)

Tarea: Realiza un Análisis Semántico Latente (LSA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Data science is an interdisciplinary field."
"Machine learning is a subset of data science."
"Artificial intelligence is a broader concept than machine learning."
"Deep learning is a subset of machine learning."

Solución:

import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD

# Sample text corpus
corpus = [
    "Data science is an interdisciplinary field.",
    "Machine learning is a subset of data science.",
    "Artificial intelligence is a broader concept than machine learning.",
    "Deep learning is a subset of machine learning."
]

# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)

# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    terms_comp = zip(terms, comp)
    sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
    print(f"Topic {i}:")
    for term, weight in sorted_terms:
        print(f" - {term}: {weight:.4f}")

Salida:

Topic 0:
 - machine: 0.5126
 - learning: 0.5126
 - is: 0.3561
 - of: 0.2715
 - science: 0.2121
Topic 1:
 - data: 0.5462
 - science: 0.5462
 - interdisciplinary: 0.3583
 - field: 0.3583
 - is: -0.0000

Ejercicio 2: Asignación de Dirichlet Latente (LDA)

Tarea: Realiza una Asignación de Dirichlet Latente (LDA) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Natural language processing enables computers to understand human language."
"Computer vision allows machines to interpret and make decisions based on visual data."
"Robotics combines engineering and computer science to create intelligent machines."
"Quantum computing leverages quantum mechanics to perform complex calculations."

Solución:

import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Natural language processing enables computers to understand human language.",
    "Computer vision allows machines to interpret and make decisions based on visual data.",
    "Robotics combines engineering and computer science to create intelligent machines.",
    "Quantum computing leverages quantum mechanics to perform complex calculations."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)

# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
 (1,
  '0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]

Ejercicio 3: Proceso Dirichlet Hierárquico (HDP)

Tarea: Realiza un Proceso Dirichlet Hierárquico (HDP) en el siguiente corpus de texto e identifica los términos principales para cada tema:

"Climate change impacts global weather patterns."
"Renewable energy sources reduce carbon emissions."
"Biodiversity is essential for ecosystem balance."
"Conservation efforts protect endangered species."

Solución:

import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint

# Sample text corpus
corpus = [
    "Climate change impacts global weather patterns.",
    "Renewable energy sources reduce carbon emissions.",
    "Biodiversity is essential for ecosystem balance.",
    "Conservation efforts protect endangered species."
]

# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]

# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)

# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]

# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)

# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))

Salida:

Topics:
[(0,
  '0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
 (1,
  '0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]

Ejercicio 4: Evaluación de la Coherencia del Tema

Tarea: Calcula el puntaje de coherencia para los temas generados por el modelo LDA en el Ejercicio 2.

Solución:

from gensim.models.coherencemodel import CoherenceModel

# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")

Salida:

Coherence Score: 0.4705210160925866

Ejercicio 5: Asignación de Temas a Nuevos Documentos

Tarea: Asigna temas al siguiente nuevo documento utilizando el modelo HDP entrenado en el Ejercicio 3:

"Renewable energy is crucial for combating climate change."

Solución:

# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())

# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])

Salida:

Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]

Compra este libro