Menu iconMenu iconNatural Language Processing with Python
Natural Language Processing with Python

Chapter 8: Topic Modelling

8.5 Practical Exercises of Chapter 8: Topic Modelling

8.5.1 Practical Exercise 1: Implement LSA for Topic Modeling

In this exercise, we will implement Latent Semantic Analysis (LSA) to extract topics from a corpus of text.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip (terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Topic %d:" % i)
    for term in sortedTerms:
        print(term[0])
    print(" ")

8.5.2 Practical Exercise 2: Implement LDA for Topic Modeling

In this exercise, you will implement Latent Dirichlet Allocation (LDA) to extract topics from a corpus of text.

from sklearn.decomposition import LatentDirichletAllocation

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=10, random_state=0)
lda.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    print(f"Top 10 words for topic #{i}:")
    print([terms[i] for i in topic.argsort()[-10:]])
    print("\n")

8.5.3 Practical Exercise 3: Implement NMF for Topic Modeling

In this exercise, you will implement Non-negative Matrix Factorization (NMF) to extract topics from a corpus of text.

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

# assuming documents is our preprocessed corpus
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)

nmf = NMF(n_components=10, random_state=1).fit(tfidf)

# Print the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))

In each of these exercises, the goal is to familiarize yourself with different topic modeling techniques. You can experiment with different parameters and preprocessing techniques to see how they affect the results. Always remember, the quality and cleanliness of your data play a crucial role in the effectiveness of these models.

8.5 Practical Exercises of Chapter 8: Topic Modelling

8.5.1 Practical Exercise 1: Implement LSA for Topic Modeling

In this exercise, we will implement Latent Semantic Analysis (LSA) to extract topics from a corpus of text.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip (terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Topic %d:" % i)
    for term in sortedTerms:
        print(term[0])
    print(" ")

8.5.2 Practical Exercise 2: Implement LDA for Topic Modeling

In this exercise, you will implement Latent Dirichlet Allocation (LDA) to extract topics from a corpus of text.

from sklearn.decomposition import LatentDirichletAllocation

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=10, random_state=0)
lda.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    print(f"Top 10 words for topic #{i}:")
    print([terms[i] for i in topic.argsort()[-10:]])
    print("\n")

8.5.3 Practical Exercise 3: Implement NMF for Topic Modeling

In this exercise, you will implement Non-negative Matrix Factorization (NMF) to extract topics from a corpus of text.

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

# assuming documents is our preprocessed corpus
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)

nmf = NMF(n_components=10, random_state=1).fit(tfidf)

# Print the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))

In each of these exercises, the goal is to familiarize yourself with different topic modeling techniques. You can experiment with different parameters and preprocessing techniques to see how they affect the results. Always remember, the quality and cleanliness of your data play a crucial role in the effectiveness of these models.

8.5 Practical Exercises of Chapter 8: Topic Modelling

8.5.1 Practical Exercise 1: Implement LSA for Topic Modeling

In this exercise, we will implement Latent Semantic Analysis (LSA) to extract topics from a corpus of text.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip (terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Topic %d:" % i)
    for term in sortedTerms:
        print(term[0])
    print(" ")

8.5.2 Practical Exercise 2: Implement LDA for Topic Modeling

In this exercise, you will implement Latent Dirichlet Allocation (LDA) to extract topics from a corpus of text.

from sklearn.decomposition import LatentDirichletAllocation

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=10, random_state=0)
lda.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    print(f"Top 10 words for topic #{i}:")
    print([terms[i] for i in topic.argsort()[-10:]])
    print("\n")

8.5.3 Practical Exercise 3: Implement NMF for Topic Modeling

In this exercise, you will implement Non-negative Matrix Factorization (NMF) to extract topics from a corpus of text.

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

# assuming documents is our preprocessed corpus
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)

nmf = NMF(n_components=10, random_state=1).fit(tfidf)

# Print the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))

In each of these exercises, the goal is to familiarize yourself with different topic modeling techniques. You can experiment with different parameters and preprocessing techniques to see how they affect the results. Always remember, the quality and cleanliness of your data play a crucial role in the effectiveness of these models.

8.5 Practical Exercises of Chapter 8: Topic Modelling

8.5.1 Practical Exercise 1: Implement LSA for Topic Modeling

In this exercise, we will implement Latent Semantic Analysis (LSA) to extract topics from a corpus of text.

from sklearn.decomposition import TruncatedSVD
from sklearn.feature_extraction.text import CountVectorizer

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lsa = TruncatedSVD(n_components=10, n_iter=100)
lsa.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, comp in enumerate(lsa.components_):
    termsInComp = zip (terms,comp)
    sortedTerms = sorted(termsInComp, key=lambda x: x[1], reverse=True) [:10]
    print("Topic %d:" % i)
    for term in sortedTerms:
        print(term[0])
    print(" ")

8.5.2 Practical Exercise 2: Implement LDA for Topic Modeling

In this exercise, you will implement Latent Dirichlet Allocation (LDA) to extract topics from a corpus of text.

from sklearn.decomposition import LatentDirichletAllocation

# assuming documents is our preprocessed corpus
vectorizer = CountVectorizer(max_df=0.5, min_df=2, stop_words='english')
dtm = vectorizer.fit_transform(documents)

lda = LatentDirichletAllocation(n_components=10, random_state=0)
lda.fit(dtm)

# Print the top words for each topic
terms = vectorizer.get_feature_names_out()
for i, topic in enumerate(lda.components_):
    print(f"Top 10 words for topic #{i}:")
    print([terms[i] for i in topic.argsort()[-10:]])
    print("\n")

8.5.3 Practical Exercise 3: Implement NMF for Topic Modeling

In this exercise, you will implement Non-negative Matrix Factorization (NMF) to extract topics from a corpus of text.

from sklearn.decomposition import NMF
from sklearn.feature_extraction.text import TfidfVectorizer

# assuming documents is our preprocessed corpus
tfidf_vectorizer = TfidfVectorizer(max_df=0.95, min_df=2, stop_words='english')
tfidf = tfidf_vectorizer.fit_transform(documents)

nmf = NMF(n_components=10, random_state=1).fit(tfidf)

# Print the top words for each topic
feature_names = tfidf_vectorizer.get_feature_names_out()

for topic_idx, topic in enumerate(nmf.components_):
    print("Topic #%d:" % topic_idx)
    print(" ".join([feature_names[i] for i in topic.argsort()[:-10 - 1:-1]]))

In each of these exercises, the goal is to familiarize yourself with different topic modeling techniques. You can experiment with different parameters and preprocessing techniques to see how they affect the results. Always remember, the quality and cleanliness of your data play a crucial role in the effectiveness of these models.