Chapter 7: Topic Modeling
Practical Exercises
Exercise 1: Latent Semantic Analysis (LSA)
Task: Perform Latent Semantic Analysis (LSA) on the following text corpus and identify the top terms for each topic:
- "Data science is an interdisciplinary field."
- "Machine learning is a subset of data science."
- "Artificial intelligence is a broader concept than machine learning."
- "Deep learning is a subset of machine learning."
Solution:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Sample text corpus
corpus = [
"Data science is an interdisciplinary field.",
"Machine learning is a subset of data science.",
"Artificial intelligence is a broader concept than machine learning.",
"Deep learning is a subset of machine learning."
]
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)
# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
terms_comp = zip(terms, comp)
sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
print(f"Topic {i}:")
for term, weight in sorted_terms:
print(f" - {term}: {weight:.4f}")
Output:
Topic 0:
- machine: 0.5126
- learning: 0.5126
- is: 0.3561
- of: 0.2715
- science: 0.2121
Topic 1:
- data: 0.5462
- science: 0.5462
- interdisciplinary: 0.3583
- field: 0.3583
- is: -0.0000
Exercise 2: Latent Dirichlet Allocation (LDA)
Task: Perform Latent Dirichlet Allocation (LDA) on the following text corpus and identify the top terms for each topic:
- "Natural language processing enables computers to understand human language."
- "Computer vision allows machines to interpret and make decisions based on visual data."
- "Robotics combines engineering and computer science to create intelligent machines."
- "Quantum computing leverages quantum mechanics to perform complex calculations."
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
# Sample text corpus
corpus = [
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and make decisions based on visual data.",
"Robotics combines engineering and computer science to create intelligent machines.",
"Quantum computing leverages quantum mechanics to perform complex calculations."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)
# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
(1,
'0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]
Exercise 3: Hierarchical Dirichlet Process (HDP)
Task: Perform Hierarchical Dirichlet Process (HDP) on the following text corpus and identify the top terms for each topic:
- "Climate change impacts global weather patterns."
- "Renewable energy sources reduce carbon emissions."
- "Biodiversity is essential for ecosystem balance."
- "Conservation efforts protect endangered species."
Solution:
import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint
# Sample text corpus
corpus = [
"Climate change impacts global weather patterns.",
"Renewable energy sources reduce carbon emissions.",
"Biodiversity is essential for ecosystem balance.",
"Conservation efforts protect endangered species."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)
# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
(1,
'0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]
Exercise 4: Evaluating Topic Coherence
Task: Compute the coherence score for the topics generated by the LDA model in Exercise 2.
Solution:
from gensim.models.coherencemodel import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")
Output:
Coherence Score: 0.4705210160925866
Exercise 5: Assigning Topics to New Documents
Task: Assign topics to the following new document using the HDP model trained in Exercise 3:
- "Renewable energy is crucial for combating climate change."
Solution:
# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])
Output:
Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]
These exercises provide hands-on experience with Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP), reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Latent Semantic Analysis (LSA)
Task: Perform Latent Semantic Analysis (LSA) on the following text corpus and identify the top terms for each topic:
- "Data science is an interdisciplinary field."
- "Machine learning is a subset of data science."
- "Artificial intelligence is a broader concept than machine learning."
- "Deep learning is a subset of machine learning."
Solution:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Sample text corpus
corpus = [
"Data science is an interdisciplinary field.",
"Machine learning is a subset of data science.",
"Artificial intelligence is a broader concept than machine learning.",
"Deep learning is a subset of machine learning."
]
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)
# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
terms_comp = zip(terms, comp)
sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
print(f"Topic {i}:")
for term, weight in sorted_terms:
print(f" - {term}: {weight:.4f}")
Output:
Topic 0:
- machine: 0.5126
- learning: 0.5126
- is: 0.3561
- of: 0.2715
- science: 0.2121
Topic 1:
- data: 0.5462
- science: 0.5462
- interdisciplinary: 0.3583
- field: 0.3583
- is: -0.0000
Exercise 2: Latent Dirichlet Allocation (LDA)
Task: Perform Latent Dirichlet Allocation (LDA) on the following text corpus and identify the top terms for each topic:
- "Natural language processing enables computers to understand human language."
- "Computer vision allows machines to interpret and make decisions based on visual data."
- "Robotics combines engineering and computer science to create intelligent machines."
- "Quantum computing leverages quantum mechanics to perform complex calculations."
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
# Sample text corpus
corpus = [
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and make decisions based on visual data.",
"Robotics combines engineering and computer science to create intelligent machines.",
"Quantum computing leverages quantum mechanics to perform complex calculations."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)
# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
(1,
'0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]
Exercise 3: Hierarchical Dirichlet Process (HDP)
Task: Perform Hierarchical Dirichlet Process (HDP) on the following text corpus and identify the top terms for each topic:
- "Climate change impacts global weather patterns."
- "Renewable energy sources reduce carbon emissions."
- "Biodiversity is essential for ecosystem balance."
- "Conservation efforts protect endangered species."
Solution:
import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint
# Sample text corpus
corpus = [
"Climate change impacts global weather patterns.",
"Renewable energy sources reduce carbon emissions.",
"Biodiversity is essential for ecosystem balance.",
"Conservation efforts protect endangered species."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)
# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
(1,
'0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]
Exercise 4: Evaluating Topic Coherence
Task: Compute the coherence score for the topics generated by the LDA model in Exercise 2.
Solution:
from gensim.models.coherencemodel import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")
Output:
Coherence Score: 0.4705210160925866
Exercise 5: Assigning Topics to New Documents
Task: Assign topics to the following new document using the HDP model trained in Exercise 3:
- "Renewable energy is crucial for combating climate change."
Solution:
# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])
Output:
Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]
These exercises provide hands-on experience with Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP), reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Latent Semantic Analysis (LSA)
Task: Perform Latent Semantic Analysis (LSA) on the following text corpus and identify the top terms for each topic:
- "Data science is an interdisciplinary field."
- "Machine learning is a subset of data science."
- "Artificial intelligence is a broader concept than machine learning."
- "Deep learning is a subset of machine learning."
Solution:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Sample text corpus
corpus = [
"Data science is an interdisciplinary field.",
"Machine learning is a subset of data science.",
"Artificial intelligence is a broader concept than machine learning.",
"Deep learning is a subset of machine learning."
]
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)
# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
terms_comp = zip(terms, comp)
sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
print(f"Topic {i}:")
for term, weight in sorted_terms:
print(f" - {term}: {weight:.4f}")
Output:
Topic 0:
- machine: 0.5126
- learning: 0.5126
- is: 0.3561
- of: 0.2715
- science: 0.2121
Topic 1:
- data: 0.5462
- science: 0.5462
- interdisciplinary: 0.3583
- field: 0.3583
- is: -0.0000
Exercise 2: Latent Dirichlet Allocation (LDA)
Task: Perform Latent Dirichlet Allocation (LDA) on the following text corpus and identify the top terms for each topic:
- "Natural language processing enables computers to understand human language."
- "Computer vision allows machines to interpret and make decisions based on visual data."
- "Robotics combines engineering and computer science to create intelligent machines."
- "Quantum computing leverages quantum mechanics to perform complex calculations."
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
# Sample text corpus
corpus = [
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and make decisions based on visual data.",
"Robotics combines engineering and computer science to create intelligent machines.",
"Quantum computing leverages quantum mechanics to perform complex calculations."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)
# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
(1,
'0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]
Exercise 3: Hierarchical Dirichlet Process (HDP)
Task: Perform Hierarchical Dirichlet Process (HDP) on the following text corpus and identify the top terms for each topic:
- "Climate change impacts global weather patterns."
- "Renewable energy sources reduce carbon emissions."
- "Biodiversity is essential for ecosystem balance."
- "Conservation efforts protect endangered species."
Solution:
import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint
# Sample text corpus
corpus = [
"Climate change impacts global weather patterns.",
"Renewable energy sources reduce carbon emissions.",
"Biodiversity is essential for ecosystem balance.",
"Conservation efforts protect endangered species."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)
# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
(1,
'0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]
Exercise 4: Evaluating Topic Coherence
Task: Compute the coherence score for the topics generated by the LDA model in Exercise 2.
Solution:
from gensim.models.coherencemodel import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")
Output:
Coherence Score: 0.4705210160925866
Exercise 5: Assigning Topics to New Documents
Task: Assign topics to the following new document using the HDP model trained in Exercise 3:
- "Renewable energy is crucial for combating climate change."
Solution:
# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])
Output:
Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]
These exercises provide hands-on experience with Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP), reinforcing the concepts covered in this chapter.
Practical Exercises
Exercise 1: Latent Semantic Analysis (LSA)
Task: Perform Latent Semantic Analysis (LSA) on the following text corpus and identify the top terms for each topic:
- "Data science is an interdisciplinary field."
- "Machine learning is a subset of data science."
- "Artificial intelligence is a broader concept than machine learning."
- "Deep learning is a subset of machine learning."
Solution:
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import TruncatedSVD
# Sample text corpus
corpus = [
"Data science is an interdisciplinary field.",
"Machine learning is a subset of data science.",
"Artificial intelligence is a broader concept than machine learning.",
"Deep learning is a subset of machine learning."
]
# Create a TF-IDF Vectorizer
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
# Apply LSA using TruncatedSVD
lsa = TruncatedSVD(n_components=2, random_state=42)
X_reduced = lsa.fit_transform(X)
# Print the terms and their corresponding components
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
terms_comp = zip(terms, comp)
sorted_terms = sorted(terms_comp, key=lambda x: x[1], reverse=True)[:5]
print(f"Topic {i}:")
for term, weight in sorted_terms:
print(f" - {term}: {weight:.4f}")
Output:
Topic 0:
- machine: 0.5126
- learning: 0.5126
- is: 0.3561
- of: 0.2715
- science: 0.2121
Topic 1:
- data: 0.5462
- science: 0.5462
- interdisciplinary: 0.3583
- field: 0.3583
- is: -0.0000
Exercise 2: Latent Dirichlet Allocation (LDA)
Task: Perform Latent Dirichlet Allocation (LDA) on the following text corpus and identify the top terms for each topic:
- "Natural language processing enables computers to understand human language."
- "Computer vision allows machines to interpret and make decisions based on visual data."
- "Robotics combines engineering and computer science to create intelligent machines."
- "Quantum computing leverages quantum mechanics to perform complex calculations."
Solution:
import gensim
from gensim import corpora
from gensim.models import LdaModel
from pprint import pprint
# Sample text corpus
corpus = [
"Natural language processing enables computers to understand human language.",
"Computer vision allows machines to interpret and make decisions based on visual data.",
"Robotics combines engineering and computer science to create intelligent machines.",
"Quantum computing leverages quantum mechanics to perform complex calculations."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the LDA model
lda_model = LdaModel(corpus=corpus_bow, id2word=dictionary, num_topics=2, random_state=42, passes=10)
# Print the topics
print("Topics:")
pprint(lda_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.067*"language" + 0.067*"natural" + 0.067*"processing" + 0.067*"enables" + 0.067*"computers"'),
(1,
'0.070*"machines" + 0.070*"computer" + 0.070*"science" + 0.070*"engineering" + 0.070*"combines"')]
Exercise 3: Hierarchical Dirichlet Process (HDP)
Task: Perform Hierarchical Dirichlet Process (HDP) on the following text corpus and identify the top terms for each topic:
- "Climate change impacts global weather patterns."
- "Renewable energy sources reduce carbon emissions."
- "Biodiversity is essential for ecosystem balance."
- "Conservation efforts protect endangered species."
Solution:
import gensim
from gensim import corpora
from gensim.models import HdpModel
from pprint import pprint
# Sample text corpus
corpus = [
"Climate change impacts global weather patterns.",
"Renewable energy sources reduce carbon emissions.",
"Biodiversity is essential for ecosystem balance.",
"Conservation efforts protect endangered species."
]
# Tokenize the text and remove stop words
texts = [[word for word in document.lower().split()] for document in corpus]
# Create a dictionary representation of the documents
dictionary = corpora.Dictionary(texts)
# Convert the dictionary to a bag-of-words representation of the corpus
corpus_bow = [dictionary.doc2bow(text) for text in texts]
# Train the HDP model
hdp_model = HdpModel(corpus=corpus_bow, id2word=dictionary)
# Print the topics
print("Topics:")
pprint(hdp_model.print_topics(num_topics=2, num_words=5))
Output:
Topics:
[(0,
'0.120*species + 0.120*protect + 0.120*efforts + 0.120*endangered + 0.120*conservation'),
(1,
'0.107*emissions + 0.107*carbon + 0.107*reduce + 0.107*sources + 0.107*energy')]
Exercise 4: Evaluating Topic Coherence
Task: Compute the coherence score for the topics generated by the LDA model in Exercise 2.
Solution:
from gensim.models.coherencemodel import CoherenceModel
# Compute Coherence Score
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
print(f"Coherence Score: {coherence_lda}")
Output:
Coherence Score: 0.4705210160925866
Exercise 5: Assigning Topics to New Documents
Task: Assign topics to the following new document using the HDP model trained in Exercise 3:
- "Renewable energy is crucial for combating climate change."
Solution:
# New document
new_doc = "Renewable energy is crucial for combating climate change."
new_doc_bow = dictionary.doc2bow(new_doc.lower().split())
# Assign topics to the new document
print("Topic Distribution for the new document:")
pprint(hdp_model[new_doc_bow])
Output:
Topic Distribution for the new document:
[(0, 0.6891036362224583), (1, 0.31089636377754165)]
These exercises provide hands-on experience with Latent Semantic Analysis (LSA), Latent Dirichlet Allocation (LDA), and Hierarchical Dirichlet Process (HDP), reinforcing the concepts covered in this chapter.