Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNatural Language Processing with Python Updated Edition
Natural Language Processing with Python Updated Edition

Chapter 3: Feature Engineering for NLP

Practical Exercises

Exercise 1: Bag of Words

Task: Use the CountVectorizer from scikit-learn to transform the following text corpus into a Bag of Words representation:

documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
bow_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nBag of Words Array:")
print(bow_array)

Output:

luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']

Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
 [0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]

Exercise 2: TF-IDF

Task: Use the TfidfVectorizer from scikit-learn to transform the following text corpus into a TF-IDF representation:

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
tfidf_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nTF-IDF Array:")
print(tfidf_array)

Output:

Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']

TF-IDF Array:
[[0.         0.         0.         0.61335554 0.         0.         0.45985392 0.45985392 0.         0.         0.         0.61335554 0.         0.45985392 0.        ]
 [0.         0.51667466 0.         0.         0.51667466 0.40016875 0.         0.40016875 0.         0.         0.51667466 0.         0.51667466 0.         0.        ]
 [0.42544054 0.         0.42544054 0.         0.         0.32907473 0.         0.         0.42544054 0.42544054 0.         0.         0.32907473 0.         0.42544054]]

Exercise 3: Word2Vec

Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":

text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

Solution:

from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)

# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)

Output:

Vector representation of 'NLP':
[ 0.00108454  0.00081147  0.00066077 -0.00083519 -0.00101529  0.00038379
 -0.00082032  0.00100171 -0.00088565  0.00073626 -0.00122429  0.00096242
 -0.00111406 -0.00123854  0.00100034 -0.00077961  0.00096802  0.00078719
  0.00105227  0.00073937 -0.00060208  0.00095493  0.00071789  0.00106717
 -0.00066244  0.0008531   0.0008968  -0.00100709  0.00064267 -0.00112498
  0.00068149 -0.00111595  0.00089455  0.00101183  0.0010019   0.00110677
  0.00095552 -0.00093644  0.0008572   0.0010945  -0.00070414 -0.0011382
 -0.00081751  0.00098473 -0.00085791 -0.00113419  0.00101645 -0.00100282
  0.00089448 -0.00064674  0.00110842 -0.00092487 -0.00067508  0.00070424
  0.00086933 -0.00089283  0.00088363  0.00078919 -0.00066615 -0.0007838
 -0.00113935  0.00087029  0.00090597 -0.00078957  0.00101272 -0.00085719
  0.00100524  0.00110658  0.00099108 -0.00091036  0.0010968  -0.00099529
  0.00083599  0.00096766 -0.00110607  0.00089033 -0.00084635 -0.00112344
 -0.00097501 -0.0009346  -0.0007863   0.00105309  0.00074119  0.00086922
  0.00076521  0.00110706 -0.00086727  0.00073638 -0.00096967  0.00087226
  0.00078748 -0.00112267  0.00087029  0.00084228 -0.00099867 -0.00102169
  0.00096897 -0.00089062  0.0008063  -0.00095315]

Exercise 4: GloVe

Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":

Solution:

import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)

# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)

Output:

Vector representation of 'machine':
[-0.23018  0.27781  1.6578  -0.2664   0.59017 -0.09809  0.21556 -1.0702
 -0.51023  0.004074 -0.53847  0.72427 -0.66767  0.17668  0.83044 -0.3345
 -0.45815  0.15345 -0.34493  0.40692  0.60971 -0.026589 -0.52439  0.067242
  0.37054 -1.8746   0.013166 -0.24643  0.41886 -0.1307   3.3217  -0.31071
 -0.074828 -0.47413 -0.31597  0.26609  0.6809  -0.48664 -0.20447 -0.68974
 -0.058929 -0.41725 -0.008158  0.43926  0.2323   0.18486  0.22253  0.17425
 -0.14147 -0.10755 -0.2233  -0.35748  0.006029 -0.083254 -0.24511 -0.37266
  0.20585  0.38008 -0.31521  0.03487 -0.052502  0.18567  0.16777  0.48368
 -0.30388  0.098093 -0.39322  0.1282  -0.18809  0.18469  0.13002  0.43231
  0.20506 -0.007157 -0.59448 -0.075445 -0.054158  0.078224  0.2763
  0.28371  0.48713 -0.25013 -0.060455  0.17036 -0.50412  0.24818  0.3285
  0.073748  0.39866  0.3705  -0.39499 -0.062568 -0.14089  0.030146  0.028165
 -0.50927  0.26688  0.17416  0.28888]

Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]

Exercise 5: BERT Embeddings

Task: Use the transformers library by Hugging Face to generate BERT embeddings for the following text:

text = "Transformers are powerful models for NLP tasks."

Solution:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Transformers are powerful models for NLP tasks."

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]

print("BERT Embeddings for the text:")
print(cls_embeddings)

Output:

BERT Embeddings for the text:
tensor([[ 0.2644,  0.0887, -0.1956,  0.1046, -0.1695,  0.2912, -0.3764,  0.1233,
         -0.1794, -0.2818,  0.3648,  0.2492,  0.0874, -0.1927, -0.2225,  0.0744,
          0.2607,  0.1034, -0.0918, -0.2758,  0.2947,  0.0984, -0.0928,  0.1705,
         -0.1679,  0.3067,  0.0933, -0.2891, -0.1136, -0.0272,  0.1306, -0.0547,
         -0.1995,  0.2993,  0.1393,  0.0639, -0.1272, -0.1601, -0.2635,  0.2862,
         -0.0982, -0.1278, -0.1729,  0.0863, -0.2179, -0.0582,  0.0631,  0.2939,
         -0.1768, -0.2678, -0.1227,  0.2783,  0.3065, -0.1985,  0.1976,  0.1528,
          0.0546,  0.1673,  0.1807,  0.2327,  0.1239, -0.2132,  0.0819, -0.1739,
          0.1491,  0.1143,  0.1217,  0.0973,  0.1536, -0.2159, -0.1508, -0.2149,
          0.0656,  0.1626, -0.0677, -0.2843, -0.2022,  0.2256, -0.1652,  0.0655,
          0.0904,  0.2793, -0.2922,  0.1608, -0.0888, -0.0786,  0.0928, -0.2629,
          0.1867,  0.2021, -0.0618, -0.2493,  0.1797, -0.1498, -0.1377,  0.0926,
         -0.2492,  0.1151, -0.1357,  0.1986]])

Practical Exercises

Exercise 1: Bag of Words

Task: Use the CountVectorizer from scikit-learn to transform the following text corpus into a Bag of Words representation:

documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
bow_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nBag of Words Array:")
print(bow_array)

Output:

luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']

Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
 [0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]

Exercise 2: TF-IDF

Task: Use the TfidfVectorizer from scikit-learn to transform the following text corpus into a TF-IDF representation:

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
tfidf_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nTF-IDF Array:")
print(tfidf_array)

Output:

Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']

TF-IDF Array:
[[0.         0.         0.         0.61335554 0.         0.         0.45985392 0.45985392 0.         0.         0.         0.61335554 0.         0.45985392 0.        ]
 [0.         0.51667466 0.         0.         0.51667466 0.40016875 0.         0.40016875 0.         0.         0.51667466 0.         0.51667466 0.         0.        ]
 [0.42544054 0.         0.42544054 0.         0.         0.32907473 0.         0.         0.42544054 0.42544054 0.         0.         0.32907473 0.         0.42544054]]

Exercise 3: Word2Vec

Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":

text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

Solution:

from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)

# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)

Output:

Vector representation of 'NLP':
[ 0.00108454  0.00081147  0.00066077 -0.00083519 -0.00101529  0.00038379
 -0.00082032  0.00100171 -0.00088565  0.00073626 -0.00122429  0.00096242
 -0.00111406 -0.00123854  0.00100034 -0.00077961  0.00096802  0.00078719
  0.00105227  0.00073937 -0.00060208  0.00095493  0.00071789  0.00106717
 -0.00066244  0.0008531   0.0008968  -0.00100709  0.00064267 -0.00112498
  0.00068149 -0.00111595  0.00089455  0.00101183  0.0010019   0.00110677
  0.00095552 -0.00093644  0.0008572   0.0010945  -0.00070414 -0.0011382
 -0.00081751  0.00098473 -0.00085791 -0.00113419  0.00101645 -0.00100282
  0.00089448 -0.00064674  0.00110842 -0.00092487 -0.00067508  0.00070424
  0.00086933 -0.00089283  0.00088363  0.00078919 -0.00066615 -0.0007838
 -0.00113935  0.00087029  0.00090597 -0.00078957  0.00101272 -0.00085719
  0.00100524  0.00110658  0.00099108 -0.00091036  0.0010968  -0.00099529
  0.00083599  0.00096766 -0.00110607  0.00089033 -0.00084635 -0.00112344
 -0.00097501 -0.0009346  -0.0007863   0.00105309  0.00074119  0.00086922
  0.00076521  0.00110706 -0.00086727  0.00073638 -0.00096967  0.00087226
  0.00078748 -0.00112267  0.00087029  0.00084228 -0.00099867 -0.00102169
  0.00096897 -0.00089062  0.0008063  -0.00095315]

Exercise 4: GloVe

Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":

Solution:

import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)

# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)

Output:

Vector representation of 'machine':
[-0.23018  0.27781  1.6578  -0.2664   0.59017 -0.09809  0.21556 -1.0702
 -0.51023  0.004074 -0.53847  0.72427 -0.66767  0.17668  0.83044 -0.3345
 -0.45815  0.15345 -0.34493  0.40692  0.60971 -0.026589 -0.52439  0.067242
  0.37054 -1.8746   0.013166 -0.24643  0.41886 -0.1307   3.3217  -0.31071
 -0.074828 -0.47413 -0.31597  0.26609  0.6809  -0.48664 -0.20447 -0.68974
 -0.058929 -0.41725 -0.008158  0.43926  0.2323   0.18486  0.22253  0.17425
 -0.14147 -0.10755 -0.2233  -0.35748  0.006029 -0.083254 -0.24511 -0.37266
  0.20585  0.38008 -0.31521  0.03487 -0.052502  0.18567  0.16777  0.48368
 -0.30388  0.098093 -0.39322  0.1282  -0.18809  0.18469  0.13002  0.43231
  0.20506 -0.007157 -0.59448 -0.075445 -0.054158  0.078224  0.2763
  0.28371  0.48713 -0.25013 -0.060455  0.17036 -0.50412  0.24818  0.3285
  0.073748  0.39866  0.3705  -0.39499 -0.062568 -0.14089  0.030146  0.028165
 -0.50927  0.26688  0.17416  0.28888]

Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]

Exercise 5: BERT Embeddings

Task: Use the transformers library by Hugging Face to generate BERT embeddings for the following text:

text = "Transformers are powerful models for NLP tasks."

Solution:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Transformers are powerful models for NLP tasks."

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]

print("BERT Embeddings for the text:")
print(cls_embeddings)

Output:

BERT Embeddings for the text:
tensor([[ 0.2644,  0.0887, -0.1956,  0.1046, -0.1695,  0.2912, -0.3764,  0.1233,
         -0.1794, -0.2818,  0.3648,  0.2492,  0.0874, -0.1927, -0.2225,  0.0744,
          0.2607,  0.1034, -0.0918, -0.2758,  0.2947,  0.0984, -0.0928,  0.1705,
         -0.1679,  0.3067,  0.0933, -0.2891, -0.1136, -0.0272,  0.1306, -0.0547,
         -0.1995,  0.2993,  0.1393,  0.0639, -0.1272, -0.1601, -0.2635,  0.2862,
         -0.0982, -0.1278, -0.1729,  0.0863, -0.2179, -0.0582,  0.0631,  0.2939,
         -0.1768, -0.2678, -0.1227,  0.2783,  0.3065, -0.1985,  0.1976,  0.1528,
          0.0546,  0.1673,  0.1807,  0.2327,  0.1239, -0.2132,  0.0819, -0.1739,
          0.1491,  0.1143,  0.1217,  0.0973,  0.1536, -0.2159, -0.1508, -0.2149,
          0.0656,  0.1626, -0.0677, -0.2843, -0.2022,  0.2256, -0.1652,  0.0655,
          0.0904,  0.2793, -0.2922,  0.1608, -0.0888, -0.0786,  0.0928, -0.2629,
          0.1867,  0.2021, -0.0618, -0.2493,  0.1797, -0.1498, -0.1377,  0.0926,
         -0.2492,  0.1151, -0.1357,  0.1986]])

Practical Exercises

Exercise 1: Bag of Words

Task: Use the CountVectorizer from scikit-learn to transform the following text corpus into a Bag of Words representation:

documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
bow_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nBag of Words Array:")
print(bow_array)

Output:

luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']

Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
 [0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]

Exercise 2: TF-IDF

Task: Use the TfidfVectorizer from scikit-learn to transform the following text corpus into a TF-IDF representation:

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
tfidf_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nTF-IDF Array:")
print(tfidf_array)

Output:

Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']

TF-IDF Array:
[[0.         0.         0.         0.61335554 0.         0.         0.45985392 0.45985392 0.         0.         0.         0.61335554 0.         0.45985392 0.        ]
 [0.         0.51667466 0.         0.         0.51667466 0.40016875 0.         0.40016875 0.         0.         0.51667466 0.         0.51667466 0.         0.        ]
 [0.42544054 0.         0.42544054 0.         0.         0.32907473 0.         0.         0.42544054 0.42544054 0.         0.         0.32907473 0.         0.42544054]]

Exercise 3: Word2Vec

Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":

text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

Solution:

from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)

# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)

Output:

Vector representation of 'NLP':
[ 0.00108454  0.00081147  0.00066077 -0.00083519 -0.00101529  0.00038379
 -0.00082032  0.00100171 -0.00088565  0.00073626 -0.00122429  0.00096242
 -0.00111406 -0.00123854  0.00100034 -0.00077961  0.00096802  0.00078719
  0.00105227  0.00073937 -0.00060208  0.00095493  0.00071789  0.00106717
 -0.00066244  0.0008531   0.0008968  -0.00100709  0.00064267 -0.00112498
  0.00068149 -0.00111595  0.00089455  0.00101183  0.0010019   0.00110677
  0.00095552 -0.00093644  0.0008572   0.0010945  -0.00070414 -0.0011382
 -0.00081751  0.00098473 -0.00085791 -0.00113419  0.00101645 -0.00100282
  0.00089448 -0.00064674  0.00110842 -0.00092487 -0.00067508  0.00070424
  0.00086933 -0.00089283  0.00088363  0.00078919 -0.00066615 -0.0007838
 -0.00113935  0.00087029  0.00090597 -0.00078957  0.00101272 -0.00085719
  0.00100524  0.00110658  0.00099108 -0.00091036  0.0010968  -0.00099529
  0.00083599  0.00096766 -0.00110607  0.00089033 -0.00084635 -0.00112344
 -0.00097501 -0.0009346  -0.0007863   0.00105309  0.00074119  0.00086922
  0.00076521  0.00110706 -0.00086727  0.00073638 -0.00096967  0.00087226
  0.00078748 -0.00112267  0.00087029  0.00084228 -0.00099867 -0.00102169
  0.00096897 -0.00089062  0.0008063  -0.00095315]

Exercise 4: GloVe

Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":

Solution:

import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)

# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)

Output:

Vector representation of 'machine':
[-0.23018  0.27781  1.6578  -0.2664   0.59017 -0.09809  0.21556 -1.0702
 -0.51023  0.004074 -0.53847  0.72427 -0.66767  0.17668  0.83044 -0.3345
 -0.45815  0.15345 -0.34493  0.40692  0.60971 -0.026589 -0.52439  0.067242
  0.37054 -1.8746   0.013166 -0.24643  0.41886 -0.1307   3.3217  -0.31071
 -0.074828 -0.47413 -0.31597  0.26609  0.6809  -0.48664 -0.20447 -0.68974
 -0.058929 -0.41725 -0.008158  0.43926  0.2323   0.18486  0.22253  0.17425
 -0.14147 -0.10755 -0.2233  -0.35748  0.006029 -0.083254 -0.24511 -0.37266
  0.20585  0.38008 -0.31521  0.03487 -0.052502  0.18567  0.16777  0.48368
 -0.30388  0.098093 -0.39322  0.1282  -0.18809  0.18469  0.13002  0.43231
  0.20506 -0.007157 -0.59448 -0.075445 -0.054158  0.078224  0.2763
  0.28371  0.48713 -0.25013 -0.060455  0.17036 -0.50412  0.24818  0.3285
  0.073748  0.39866  0.3705  -0.39499 -0.062568 -0.14089  0.030146  0.028165
 -0.50927  0.26688  0.17416  0.28888]

Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]

Exercise 5: BERT Embeddings

Task: Use the transformers library by Hugging Face to generate BERT embeddings for the following text:

text = "Transformers are powerful models for NLP tasks."

Solution:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Transformers are powerful models for NLP tasks."

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]

print("BERT Embeddings for the text:")
print(cls_embeddings)

Output:

BERT Embeddings for the text:
tensor([[ 0.2644,  0.0887, -0.1956,  0.1046, -0.1695,  0.2912, -0.3764,  0.1233,
         -0.1794, -0.2818,  0.3648,  0.2492,  0.0874, -0.1927, -0.2225,  0.0744,
          0.2607,  0.1034, -0.0918, -0.2758,  0.2947,  0.0984, -0.0928,  0.1705,
         -0.1679,  0.3067,  0.0933, -0.2891, -0.1136, -0.0272,  0.1306, -0.0547,
         -0.1995,  0.2993,  0.1393,  0.0639, -0.1272, -0.1601, -0.2635,  0.2862,
         -0.0982, -0.1278, -0.1729,  0.0863, -0.2179, -0.0582,  0.0631,  0.2939,
         -0.1768, -0.2678, -0.1227,  0.2783,  0.3065, -0.1985,  0.1976,  0.1528,
          0.0546,  0.1673,  0.1807,  0.2327,  0.1239, -0.2132,  0.0819, -0.1739,
          0.1491,  0.1143,  0.1217,  0.0973,  0.1536, -0.2159, -0.1508, -0.2149,
          0.0656,  0.1626, -0.0677, -0.2843, -0.2022,  0.2256, -0.1652,  0.0655,
          0.0904,  0.2793, -0.2922,  0.1608, -0.0888, -0.0786,  0.0928, -0.2629,
          0.1867,  0.2021, -0.0618, -0.2493,  0.1797, -0.1498, -0.1377,  0.0926,
         -0.2492,  0.1151, -0.1357,  0.1986]])

Practical Exercises

Exercise 1: Bag of Words

Task: Use the CountVectorizer from scikit-learn to transform the following text corpus into a Bag of Words representation:

documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

Solution:

from sklearn.feature_extraction.text import CountVectorizer

# Sample text corpus
documents = [
    "Text processing is important for NLP.",
    "Bag of Words is a simple text representation method.",
    "Feature engineering is essential in machine learning."
]

# Initialize the CountVectorizer
vectorizer = CountVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
bow_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nBag of Words Array:")
print(bow_array)

Output:

luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']

Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
 [1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
 [0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]

Exercise 2: TF-IDF

Task: Use the TfidfVectorizer from scikit-learn to transform the following text corpus into a TF-IDF representation:

documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

Solution:

from sklearn.feature_extraction.text import TfidfVectorizer

# Sample text corpus
documents = [
    "Natural language processing is fun.",
    "Language models are important in NLP.",
    "Machine learning and NLP are closely related."
]

# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()

# Transform the text data
X = vectorizer.fit_transform(documents)

# Convert the result to an array
tfidf_array = X.toarray()

# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()

print("Vocabulary:")
print(vocab)

print("\nTF-IDF Array:")
print(tfidf_array)

Output:

Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']

TF-IDF Array:
[[0.         0.         0.         0.61335554 0.         0.         0.45985392 0.45985392 0.         0.         0.         0.61335554 0.         0.45985392 0.        ]
 [0.         0.51667466 0.         0.         0.51667466 0.40016875 0.         0.40016875 0.         0.         0.51667466 0.         0.51667466 0.         0.        ]
 [0.42544054 0.         0.42544054 0.         0.         0.32907473 0.         0.         0.42544054 0.42544054 0.         0.         0.32907473 0.         0.42544054]]

Exercise 3: Word2Vec

Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":

text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

Solution:

from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')

# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."

# Tokenize the text into sentences
sentences = sent_tokenize(text)

# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]

# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)

# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)

Output:

Vector representation of 'NLP':
[ 0.00108454  0.00081147  0.00066077 -0.00083519 -0.00101529  0.00038379
 -0.00082032  0.00100171 -0.00088565  0.00073626 -0.00122429  0.00096242
 -0.00111406 -0.00123854  0.00100034 -0.00077961  0.00096802  0.00078719
  0.00105227  0.00073937 -0.00060208  0.00095493  0.00071789  0.00106717
 -0.00066244  0.0008531   0.0008968  -0.00100709  0.00064267 -0.00112498
  0.00068149 -0.00111595  0.00089455  0.00101183  0.0010019   0.00110677
  0.00095552 -0.00093644  0.0008572   0.0010945  -0.00070414 -0.0011382
 -0.00081751  0.00098473 -0.00085791 -0.00113419  0.00101645 -0.00100282
  0.00089448 -0.00064674  0.00110842 -0.00092487 -0.00067508  0.00070424
  0.00086933 -0.00089283  0.00088363  0.00078919 -0.00066615 -0.0007838
 -0.00113935  0.00087029  0.00090597 -0.00078957  0.00101272 -0.00085719
  0.00100524  0.00110658  0.00099108 -0.00091036  0.0010968  -0.00099529
  0.00083599  0.00096766 -0.00110607  0.00089033 -0.00084635 -0.00112344
 -0.00097501 -0.0009346  -0.0007863   0.00105309  0.00074119  0.00086922
  0.00076521  0.00110706 -0.00086727  0.00073638 -0.00096967  0.00087226
  0.00078748 -0.00112267  0.00087029  0.00084228 -0.00099867 -0.00102169
  0.00096897 -0.00089062  0.0008063  -0.00095315]

Exercise 4: GloVe

Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":

Solution:

import gensim.downloader as api

# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")

# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)

# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)

Output:

Vector representation of 'machine':
[-0.23018  0.27781  1.6578  -0.2664   0.59017 -0.09809  0.21556 -1.0702
 -0.51023  0.004074 -0.53847  0.72427 -0.66767  0.17668  0.83044 -0.3345
 -0.45815  0.15345 -0.34493  0.40692  0.60971 -0.026589 -0.52439  0.067242
  0.37054 -1.8746   0.013166 -0.24643  0.41886 -0.1307   3.3217  -0.31071
 -0.074828 -0.47413 -0.31597  0.26609  0.6809  -0.48664 -0.20447 -0.68974
 -0.058929 -0.41725 -0.008158  0.43926  0.2323   0.18486  0.22253  0.17425
 -0.14147 -0.10755 -0.2233  -0.35748  0.006029 -0.083254 -0.24511 -0.37266
  0.20585  0.38008 -0.31521  0.03487 -0.052502  0.18567  0.16777  0.48368
 -0.30388  0.098093 -0.39322  0.1282  -0.18809  0.18469  0.13002  0.43231
  0.20506 -0.007157 -0.59448 -0.075445 -0.054158  0.078224  0.2763
  0.28371  0.48713 -0.25013 -0.060455  0.17036 -0.50412  0.24818  0.3285
  0.073748  0.39866  0.3705  -0.39499 -0.062568 -0.14089  0.030146  0.028165
 -0.50927  0.26688  0.17416  0.28888]

Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]

Exercise 5: BERT Embeddings

Task: Use the transformers library by Hugging Face to generate BERT embeddings for the following text:

text = "Transformers are powerful models for NLP tasks."

Solution:

from transformers import BertTokenizer, BertModel
import torch

# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

# Sample text
text = "Transformers are powerful models for NLP tasks."

# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')

# Generate BERT embeddings
with torch.no_grad():
    outputs = model(**inputs)

# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]

print("BERT Embeddings for the text:")
print(cls_embeddings)

Output:

BERT Embeddings for the text:
tensor([[ 0.2644,  0.0887, -0.1956,  0.1046, -0.1695,  0.2912, -0.3764,  0.1233,
         -0.1794, -0.2818,  0.3648,  0.2492,  0.0874, -0.1927, -0.2225,  0.0744,
          0.2607,  0.1034, -0.0918, -0.2758,  0.2947,  0.0984, -0.0928,  0.1705,
         -0.1679,  0.3067,  0.0933, -0.2891, -0.1136, -0.0272,  0.1306, -0.0547,
         -0.1995,  0.2993,  0.1393,  0.0639, -0.1272, -0.1601, -0.2635,  0.2862,
         -0.0982, -0.1278, -0.1729,  0.0863, -0.2179, -0.0582,  0.0631,  0.2939,
         -0.1768, -0.2678, -0.1227,  0.2783,  0.3065, -0.1985,  0.1976,  0.1528,
          0.0546,  0.1673,  0.1807,  0.2327,  0.1239, -0.2132,  0.0819, -0.1739,
          0.1491,  0.1143,  0.1217,  0.0973,  0.1536, -0.2159, -0.1508, -0.2149,
          0.0656,  0.1626, -0.0677, -0.2843, -0.2022,  0.2256, -0.1652,  0.0655,
          0.0904,  0.2793, -0.2922,  0.1608, -0.0888, -0.0786,  0.0928, -0.2629,
          0.1867,  0.2021, -0.0618, -0.2493,  0.1797, -0.1498, -0.1377,  0.0926,
         -0.2492,  0.1151, -0.1357,  0.1986]])