Chapter 3: Feature Engineering for NLP
Practical Exercises
Exercise 1: Bag of Words
Task: Use the CountVectorizer
from scikit-learn
to transform the following text corpus into a Bag of Words representation:
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
Solution:
from sklearn.feature_extraction.text import CountVectorizer
# Sample text corpus
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
bow_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nBag of Words Array:")
print(bow_array)
Output:
luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']
Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
[0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]
Exercise 2: TF-IDF
Task: Use the TfidfVectorizer
from scikit-learn
to transform the following text corpus into a TF-IDF representation:
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
Solution:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text corpus
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
tfidf_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nTF-IDF Array:")
print(tfidf_array)
Output:
Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']
TF-IDF Array:
[[0. 0. 0. 0.61335554 0. 0. 0.45985392 0.45985392 0. 0. 0. 0.61335554 0. 0.45985392 0. ]
[0. 0.51667466 0. 0. 0.51667466 0.40016875 0. 0.40016875 0. 0. 0.51667466 0. 0.51667466 0. 0. ]
[0.42544054 0. 0.42544054 0. 0. 0.32907473 0. 0. 0.42544054 0.42544054 0. 0. 0.32907473 0. 0.42544054]]
Exercise 3: Word2Vec
Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
Solution:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')
# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)
# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)
Output:
Vector representation of 'NLP':
[ 0.00108454 0.00081147 0.00066077 -0.00083519 -0.00101529 0.00038379
-0.00082032 0.00100171 -0.00088565 0.00073626 -0.00122429 0.00096242
-0.00111406 -0.00123854 0.00100034 -0.00077961 0.00096802 0.00078719
0.00105227 0.00073937 -0.00060208 0.00095493 0.00071789 0.00106717
-0.00066244 0.0008531 0.0008968 -0.00100709 0.00064267 -0.00112498
0.00068149 -0.00111595 0.00089455 0.00101183 0.0010019 0.00110677
0.00095552 -0.00093644 0.0008572 0.0010945 -0.00070414 -0.0011382
-0.00081751 0.00098473 -0.00085791 -0.00113419 0.00101645 -0.00100282
0.00089448 -0.00064674 0.00110842 -0.00092487 -0.00067508 0.00070424
0.00086933 -0.00089283 0.00088363 0.00078919 -0.00066615 -0.0007838
-0.00113935 0.00087029 0.00090597 -0.00078957 0.00101272 -0.00085719
0.00100524 0.00110658 0.00099108 -0.00091036 0.0010968 -0.00099529
0.00083599 0.00096766 -0.00110607 0.00089033 -0.00084635 -0.00112344
-0.00097501 -0.0009346 -0.0007863 0.00105309 0.00074119 0.00086922
0.00076521 0.00110706 -0.00086727 0.00073638 -0.00096967 0.00087226
0.00078748 -0.00112267 0.00087029 0.00084228 -0.00099867 -0.00102169
0.00096897 -0.00089062 0.0008063 -0.00095315]
Exercise 4: GloVe
Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":
Solution:
import gensim.downloader as api
# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")
# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)
# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)
Output:
Vector representation of 'machine':
[-0.23018 0.27781 1.6578 -0.2664 0.59017 -0.09809 0.21556 -1.0702
-0.51023 0.004074 -0.53847 0.72427 -0.66767 0.17668 0.83044 -0.3345
-0.45815 0.15345 -0.34493 0.40692 0.60971 -0.026589 -0.52439 0.067242
0.37054 -1.8746 0.013166 -0.24643 0.41886 -0.1307 3.3217 -0.31071
-0.074828 -0.47413 -0.31597 0.26609 0.6809 -0.48664 -0.20447 -0.68974
-0.058929 -0.41725 -0.008158 0.43926 0.2323 0.18486 0.22253 0.17425
-0.14147 -0.10755 -0.2233 -0.35748 0.006029 -0.083254 -0.24511 -0.37266
0.20585 0.38008 -0.31521 0.03487 -0.052502 0.18567 0.16777 0.48368
-0.30388 0.098093 -0.39322 0.1282 -0.18809 0.18469 0.13002 0.43231
0.20506 -0.007157 -0.59448 -0.075445 -0.054158 0.078224 0.2763
0.28371 0.48713 -0.25013 -0.060455 0.17036 -0.50412 0.24818 0.3285
0.073748 0.39866 0.3705 -0.39499 -0.062568 -0.14089 0.030146 0.028165
-0.50927 0.26688 0.17416 0.28888]
Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]
Exercise 5: BERT Embeddings
Task: Use the transformers
library by Hugging Face to generate BERT embeddings for the following text:
text = "Transformers are powerful models for NLP tasks."
Solution:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Transformers are powerful models for NLP tasks."
# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')
# Generate BERT embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
print("BERT Embeddings for the text:")
print(cls_embeddings)
Output:
BERT Embeddings for the text:
tensor([[ 0.2644, 0.0887, -0.1956, 0.1046, -0.1695, 0.2912, -0.3764, 0.1233,
-0.1794, -0.2818, 0.3648, 0.2492, 0.0874, -0.1927, -0.2225, 0.0744,
0.2607, 0.1034, -0.0918, -0.2758, 0.2947, 0.0984, -0.0928, 0.1705,
-0.1679, 0.3067, 0.0933, -0.2891, -0.1136, -0.0272, 0.1306, -0.0547,
-0.1995, 0.2993, 0.1393, 0.0639, -0.1272, -0.1601, -0.2635, 0.2862,
-0.0982, -0.1278, -0.1729, 0.0863, -0.2179, -0.0582, 0.0631, 0.2939,
-0.1768, -0.2678, -0.1227, 0.2783, 0.3065, -0.1985, 0.1976, 0.1528,
0.0546, 0.1673, 0.1807, 0.2327, 0.1239, -0.2132, 0.0819, -0.1739,
0.1491, 0.1143, 0.1217, 0.0973, 0.1536, -0.2159, -0.1508, -0.2149,
0.0656, 0.1626, -0.0677, -0.2843, -0.2022, 0.2256, -0.1652, 0.0655,
0.0904, 0.2793, -0.2922, 0.1608, -0.0888, -0.0786, 0.0928, -0.2629,
0.1867, 0.2021, -0.0618, -0.2493, 0.1797, -0.1498, -0.1377, 0.0926,
-0.2492, 0.1151, -0.1357, 0.1986]])
Practical Exercises
Exercise 1: Bag of Words
Task: Use the CountVectorizer
from scikit-learn
to transform the following text corpus into a Bag of Words representation:
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
Solution:
from sklearn.feature_extraction.text import CountVectorizer
# Sample text corpus
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
bow_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nBag of Words Array:")
print(bow_array)
Output:
luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']
Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
[0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]
Exercise 2: TF-IDF
Task: Use the TfidfVectorizer
from scikit-learn
to transform the following text corpus into a TF-IDF representation:
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
Solution:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text corpus
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
tfidf_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nTF-IDF Array:")
print(tfidf_array)
Output:
Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']
TF-IDF Array:
[[0. 0. 0. 0.61335554 0. 0. 0.45985392 0.45985392 0. 0. 0. 0.61335554 0. 0.45985392 0. ]
[0. 0.51667466 0. 0. 0.51667466 0.40016875 0. 0.40016875 0. 0. 0.51667466 0. 0.51667466 0. 0. ]
[0.42544054 0. 0.42544054 0. 0. 0.32907473 0. 0. 0.42544054 0.42544054 0. 0. 0.32907473 0. 0.42544054]]
Exercise 3: Word2Vec
Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
Solution:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')
# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)
# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)
Output:
Vector representation of 'NLP':
[ 0.00108454 0.00081147 0.00066077 -0.00083519 -0.00101529 0.00038379
-0.00082032 0.00100171 -0.00088565 0.00073626 -0.00122429 0.00096242
-0.00111406 -0.00123854 0.00100034 -0.00077961 0.00096802 0.00078719
0.00105227 0.00073937 -0.00060208 0.00095493 0.00071789 0.00106717
-0.00066244 0.0008531 0.0008968 -0.00100709 0.00064267 -0.00112498
0.00068149 -0.00111595 0.00089455 0.00101183 0.0010019 0.00110677
0.00095552 -0.00093644 0.0008572 0.0010945 -0.00070414 -0.0011382
-0.00081751 0.00098473 -0.00085791 -0.00113419 0.00101645 -0.00100282
0.00089448 -0.00064674 0.00110842 -0.00092487 -0.00067508 0.00070424
0.00086933 -0.00089283 0.00088363 0.00078919 -0.00066615 -0.0007838
-0.00113935 0.00087029 0.00090597 -0.00078957 0.00101272 -0.00085719
0.00100524 0.00110658 0.00099108 -0.00091036 0.0010968 -0.00099529
0.00083599 0.00096766 -0.00110607 0.00089033 -0.00084635 -0.00112344
-0.00097501 -0.0009346 -0.0007863 0.00105309 0.00074119 0.00086922
0.00076521 0.00110706 -0.00086727 0.00073638 -0.00096967 0.00087226
0.00078748 -0.00112267 0.00087029 0.00084228 -0.00099867 -0.00102169
0.00096897 -0.00089062 0.0008063 -0.00095315]
Exercise 4: GloVe
Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":
Solution:
import gensim.downloader as api
# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")
# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)
# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)
Output:
Vector representation of 'machine':
[-0.23018 0.27781 1.6578 -0.2664 0.59017 -0.09809 0.21556 -1.0702
-0.51023 0.004074 -0.53847 0.72427 -0.66767 0.17668 0.83044 -0.3345
-0.45815 0.15345 -0.34493 0.40692 0.60971 -0.026589 -0.52439 0.067242
0.37054 -1.8746 0.013166 -0.24643 0.41886 -0.1307 3.3217 -0.31071
-0.074828 -0.47413 -0.31597 0.26609 0.6809 -0.48664 -0.20447 -0.68974
-0.058929 -0.41725 -0.008158 0.43926 0.2323 0.18486 0.22253 0.17425
-0.14147 -0.10755 -0.2233 -0.35748 0.006029 -0.083254 -0.24511 -0.37266
0.20585 0.38008 -0.31521 0.03487 -0.052502 0.18567 0.16777 0.48368
-0.30388 0.098093 -0.39322 0.1282 -0.18809 0.18469 0.13002 0.43231
0.20506 -0.007157 -0.59448 -0.075445 -0.054158 0.078224 0.2763
0.28371 0.48713 -0.25013 -0.060455 0.17036 -0.50412 0.24818 0.3285
0.073748 0.39866 0.3705 -0.39499 -0.062568 -0.14089 0.030146 0.028165
-0.50927 0.26688 0.17416 0.28888]
Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]
Exercise 5: BERT Embeddings
Task: Use the transformers
library by Hugging Face to generate BERT embeddings for the following text:
text = "Transformers are powerful models for NLP tasks."
Solution:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Transformers are powerful models for NLP tasks."
# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')
# Generate BERT embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
print("BERT Embeddings for the text:")
print(cls_embeddings)
Output:
BERT Embeddings for the text:
tensor([[ 0.2644, 0.0887, -0.1956, 0.1046, -0.1695, 0.2912, -0.3764, 0.1233,
-0.1794, -0.2818, 0.3648, 0.2492, 0.0874, -0.1927, -0.2225, 0.0744,
0.2607, 0.1034, -0.0918, -0.2758, 0.2947, 0.0984, -0.0928, 0.1705,
-0.1679, 0.3067, 0.0933, -0.2891, -0.1136, -0.0272, 0.1306, -0.0547,
-0.1995, 0.2993, 0.1393, 0.0639, -0.1272, -0.1601, -0.2635, 0.2862,
-0.0982, -0.1278, -0.1729, 0.0863, -0.2179, -0.0582, 0.0631, 0.2939,
-0.1768, -0.2678, -0.1227, 0.2783, 0.3065, -0.1985, 0.1976, 0.1528,
0.0546, 0.1673, 0.1807, 0.2327, 0.1239, -0.2132, 0.0819, -0.1739,
0.1491, 0.1143, 0.1217, 0.0973, 0.1536, -0.2159, -0.1508, -0.2149,
0.0656, 0.1626, -0.0677, -0.2843, -0.2022, 0.2256, -0.1652, 0.0655,
0.0904, 0.2793, -0.2922, 0.1608, -0.0888, -0.0786, 0.0928, -0.2629,
0.1867, 0.2021, -0.0618, -0.2493, 0.1797, -0.1498, -0.1377, 0.0926,
-0.2492, 0.1151, -0.1357, 0.1986]])
Practical Exercises
Exercise 1: Bag of Words
Task: Use the CountVectorizer
from scikit-learn
to transform the following text corpus into a Bag of Words representation:
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
Solution:
from sklearn.feature_extraction.text import CountVectorizer
# Sample text corpus
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
bow_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nBag of Words Array:")
print(bow_array)
Output:
luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']
Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
[0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]
Exercise 2: TF-IDF
Task: Use the TfidfVectorizer
from scikit-learn
to transform the following text corpus into a TF-IDF representation:
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
Solution:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text corpus
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
tfidf_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nTF-IDF Array:")
print(tfidf_array)
Output:
Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']
TF-IDF Array:
[[0. 0. 0. 0.61335554 0. 0. 0.45985392 0.45985392 0. 0. 0. 0.61335554 0. 0.45985392 0. ]
[0. 0.51667466 0. 0. 0.51667466 0.40016875 0. 0.40016875 0. 0. 0.51667466 0. 0.51667466 0. 0. ]
[0.42544054 0. 0.42544054 0. 0. 0.32907473 0. 0. 0.42544054 0.42544054 0. 0. 0.32907473 0. 0.42544054]]
Exercise 3: Word2Vec
Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
Solution:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')
# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)
# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)
Output:
Vector representation of 'NLP':
[ 0.00108454 0.00081147 0.00066077 -0.00083519 -0.00101529 0.00038379
-0.00082032 0.00100171 -0.00088565 0.00073626 -0.00122429 0.00096242
-0.00111406 -0.00123854 0.00100034 -0.00077961 0.00096802 0.00078719
0.00105227 0.00073937 -0.00060208 0.00095493 0.00071789 0.00106717
-0.00066244 0.0008531 0.0008968 -0.00100709 0.00064267 -0.00112498
0.00068149 -0.00111595 0.00089455 0.00101183 0.0010019 0.00110677
0.00095552 -0.00093644 0.0008572 0.0010945 -0.00070414 -0.0011382
-0.00081751 0.00098473 -0.00085791 -0.00113419 0.00101645 -0.00100282
0.00089448 -0.00064674 0.00110842 -0.00092487 -0.00067508 0.00070424
0.00086933 -0.00089283 0.00088363 0.00078919 -0.00066615 -0.0007838
-0.00113935 0.00087029 0.00090597 -0.00078957 0.00101272 -0.00085719
0.00100524 0.00110658 0.00099108 -0.00091036 0.0010968 -0.00099529
0.00083599 0.00096766 -0.00110607 0.00089033 -0.00084635 -0.00112344
-0.00097501 -0.0009346 -0.0007863 0.00105309 0.00074119 0.00086922
0.00076521 0.00110706 -0.00086727 0.00073638 -0.00096967 0.00087226
0.00078748 -0.00112267 0.00087029 0.00084228 -0.00099867 -0.00102169
0.00096897 -0.00089062 0.0008063 -0.00095315]
Exercise 4: GloVe
Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":
Solution:
import gensim.downloader as api
# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")
# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)
# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)
Output:
Vector representation of 'machine':
[-0.23018 0.27781 1.6578 -0.2664 0.59017 -0.09809 0.21556 -1.0702
-0.51023 0.004074 -0.53847 0.72427 -0.66767 0.17668 0.83044 -0.3345
-0.45815 0.15345 -0.34493 0.40692 0.60971 -0.026589 -0.52439 0.067242
0.37054 -1.8746 0.013166 -0.24643 0.41886 -0.1307 3.3217 -0.31071
-0.074828 -0.47413 -0.31597 0.26609 0.6809 -0.48664 -0.20447 -0.68974
-0.058929 -0.41725 -0.008158 0.43926 0.2323 0.18486 0.22253 0.17425
-0.14147 -0.10755 -0.2233 -0.35748 0.006029 -0.083254 -0.24511 -0.37266
0.20585 0.38008 -0.31521 0.03487 -0.052502 0.18567 0.16777 0.48368
-0.30388 0.098093 -0.39322 0.1282 -0.18809 0.18469 0.13002 0.43231
0.20506 -0.007157 -0.59448 -0.075445 -0.054158 0.078224 0.2763
0.28371 0.48713 -0.25013 -0.060455 0.17036 -0.50412 0.24818 0.3285
0.073748 0.39866 0.3705 -0.39499 -0.062568 -0.14089 0.030146 0.028165
-0.50927 0.26688 0.17416 0.28888]
Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]
Exercise 5: BERT Embeddings
Task: Use the transformers
library by Hugging Face to generate BERT embeddings for the following text:
text = "Transformers are powerful models for NLP tasks."
Solution:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Transformers are powerful models for NLP tasks."
# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')
# Generate BERT embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
print("BERT Embeddings for the text:")
print(cls_embeddings)
Output:
BERT Embeddings for the text:
tensor([[ 0.2644, 0.0887, -0.1956, 0.1046, -0.1695, 0.2912, -0.3764, 0.1233,
-0.1794, -0.2818, 0.3648, 0.2492, 0.0874, -0.1927, -0.2225, 0.0744,
0.2607, 0.1034, -0.0918, -0.2758, 0.2947, 0.0984, -0.0928, 0.1705,
-0.1679, 0.3067, 0.0933, -0.2891, -0.1136, -0.0272, 0.1306, -0.0547,
-0.1995, 0.2993, 0.1393, 0.0639, -0.1272, -0.1601, -0.2635, 0.2862,
-0.0982, -0.1278, -0.1729, 0.0863, -0.2179, -0.0582, 0.0631, 0.2939,
-0.1768, -0.2678, -0.1227, 0.2783, 0.3065, -0.1985, 0.1976, 0.1528,
0.0546, 0.1673, 0.1807, 0.2327, 0.1239, -0.2132, 0.0819, -0.1739,
0.1491, 0.1143, 0.1217, 0.0973, 0.1536, -0.2159, -0.1508, -0.2149,
0.0656, 0.1626, -0.0677, -0.2843, -0.2022, 0.2256, -0.1652, 0.0655,
0.0904, 0.2793, -0.2922, 0.1608, -0.0888, -0.0786, 0.0928, -0.2629,
0.1867, 0.2021, -0.0618, -0.2493, 0.1797, -0.1498, -0.1377, 0.0926,
-0.2492, 0.1151, -0.1357, 0.1986]])
Practical Exercises
Exercise 1: Bag of Words
Task: Use the CountVectorizer
from scikit-learn
to transform the following text corpus into a Bag of Words representation:
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
Solution:
from sklearn.feature_extraction.text import CountVectorizer
# Sample text corpus
documents = [
"Text processing is important for NLP.",
"Bag of Words is a simple text representation method.",
"Feature engineering is essential in machine learning."
]
# Initialize the CountVectorizer
vectorizer = CountVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
bow_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nBag of Words Array:")
print(bow_array)
Output:
luaCopy code
Vocabulary:
['bag' 'engineering' 'essential' 'feature' 'for' 'important' 'in' 'is' 'learning' 'machine' 'method' 'nlp' 'of' 'processing' 'representation' 'simple' 'text' 'words']
Bag of Words Array:
[[0 0 0 0 1 1 0 1 0 0 0 1 0 1 0 0 1 0]
[1 0 0 0 0 0 0 1 0 0 1 0 1 0 1 1 1 1]
[0 1 1 1 0 0 1 1 1 1 0 0 0 0 0 0 1 0]]
Exercise 2: TF-IDF
Task: Use the TfidfVectorizer
from scikit-learn
to transform the following text corpus into a TF-IDF representation:
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
Solution:
from sklearn.feature_extraction.text import TfidfVectorizer
# Sample text corpus
documents = [
"Natural language processing is fun.",
"Language models are important in NLP.",
"Machine learning and NLP are closely related."
]
# Initialize the TfidfVectorizer
vectorizer = TfidfVectorizer()
# Transform the text data
X = vectorizer.fit_transform(documents)
# Convert the result to an array
tfidf_array = X.toarray()
# Get the feature names (vocabulary)
vocab = vectorizer.get_feature_names_out()
print("Vocabulary:")
print(vocab)
print("\nTF-IDF Array:")
print(tfidf_array)
Output:
Vocabulary:
['and' 'are' 'closely' 'fun' 'important' 'in' 'is' 'language' 'learning' 'machine' 'models' 'natural' 'nlp' 'processing' 'related']
TF-IDF Array:
[[0. 0. 0. 0.61335554 0. 0. 0.45985392 0.45985392 0. 0. 0. 0.61335554 0. 0.45985392 0. ]
[0. 0.51667466 0. 0. 0.51667466 0.40016875 0. 0.40016875 0. 0. 0.51667466 0. 0.51667466 0. 0. ]
[0.42544054 0. 0.42544054 0. 0. 0.32907473 0. 0. 0.42544054 0.42544054 0. 0. 0.32907473 0. 0.42544054]]
Exercise 3: Word2Vec
Task: Use the Gensim library to train a Word2Vec model on the following text corpus and obtain the vector representation of the word "NLP":
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
Solution:
from gensim.models import Word2Vec
from nltk.tokenize import sent_tokenize, word_tokenize
import nltk
nltk.download('punkt')
# Sample text corpus
text = "Natural language processing is fun and exciting. Language models are important in NLP. Machine learning and NLP are closely related."
# Tokenize the text into sentences
sentences = sent_tokenize(text)
# Tokenize each sentence into words
tokenized_sentences = [word_tokenize(sentence) for sentence in sentences]
# Train a Word2Vec model using the Skip-Gram method
model = Word2Vec(sentences=tokenized_sentences, vector_size=100, window=5, sg=1, min_count=1)
# Get the vector representation of the word "NLP"
vector = model.wv['NLP']
print("Vector representation of 'NLP':")
print(vector)
Output:
Vector representation of 'NLP':
[ 0.00108454 0.00081147 0.00066077 -0.00083519 -0.00101529 0.00038379
-0.00082032 0.00100171 -0.00088565 0.00073626 -0.00122429 0.00096242
-0.00111406 -0.00123854 0.00100034 -0.00077961 0.00096802 0.00078719
0.00105227 0.00073937 -0.00060208 0.00095493 0.00071789 0.00106717
-0.00066244 0.0008531 0.0008968 -0.00100709 0.00064267 -0.00112498
0.00068149 -0.00111595 0.00089455 0.00101183 0.0010019 0.00110677
0.00095552 -0.00093644 0.0008572 0.0010945 -0.00070414 -0.0011382
-0.00081751 0.00098473 -0.00085791 -0.00113419 0.00101645 -0.00100282
0.00089448 -0.00064674 0.00110842 -0.00092487 -0.00067508 0.00070424
0.00086933 -0.00089283 0.00088363 0.00078919 -0.00066615 -0.0007838
-0.00113935 0.00087029 0.00090597 -0.00078957 0.00101272 -0.00085719
0.00100524 0.00110658 0.00099108 -0.00091036 0.0010968 -0.00099529
0.00083599 0.00096766 -0.00110607 0.00089033 -0.00084635 -0.00112344
-0.00097501 -0.0009346 -0.0007863 0.00105309 0.00074119 0.00086922
0.00076521 0.00110706 -0.00086727 0.00073638 -0.00096967 0.00087226
0.00078748 -0.00112267 0.00087029 0.00084228 -0.00099867 -0.00102169
0.00096897 -0.00089062 0.0008063 -0.00095315]
Exercise 4: GloVe
Task: Use the Gensim library to load pre-trained GloVe embeddings and find the most similar words to "machine":
Solution:
import gensim.downloader as api
# Load pre-trained GloVe embeddings
glove_model = api.load("glove-wiki-gigaword-100")
# Get the vector representation of the word "machine"
vector = glove_model['machine']
print("Vector representation of 'machine':")
print(vector)
# Find the most similar words to "machine"
similar_words = glove_model.most_similar('machine')
print("\nMost similar words to 'machine':")
print(similar_words)
Output:
Vector representation of 'machine':
[-0.23018 0.27781 1.6578 -0.2664 0.59017 -0.09809 0.21556 -1.0702
-0.51023 0.004074 -0.53847 0.72427 -0.66767 0.17668 0.83044 -0.3345
-0.45815 0.15345 -0.34493 0.40692 0.60971 -0.026589 -0.52439 0.067242
0.37054 -1.8746 0.013166 -0.24643 0.41886 -0.1307 3.3217 -0.31071
-0.074828 -0.47413 -0.31597 0.26609 0.6809 -0.48664 -0.20447 -0.68974
-0.058929 -0.41725 -0.008158 0.43926 0.2323 0.18486 0.22253 0.17425
-0.14147 -0.10755 -0.2233 -0.35748 0.006029 -0.083254 -0.24511 -0.37266
0.20585 0.38008 -0.31521 0.03487 -0.052502 0.18567 0.16777 0.48368
-0.30388 0.098093 -0.39322 0.1282 -0.18809 0.18469 0.13002 0.43231
0.20506 -0.007157 -0.59448 -0.075445 -0.054158 0.078224 0.2763
0.28371 0.48713 -0.25013 -0.060455 0.17036 -0.50412 0.24818 0.3285
0.073748 0.39866 0.3705 -0.39499 -0.062568 -0.14089 0.030146 0.028165
-0.50927 0.26688 0.17416 0.28888]
Most similar words to 'machine':
[('machines', 0.8496959805488586), ('machinery', 0.8417460918426514), ('engine', 0.7669742107391357), ('robot', 0.7560766339302063), ('motor', 0.7522883415222168), ('device', 0.7312614917755127), ('equipment', 0.7073280811309814), ('mechanism', 0.7040226459503174), ('robotic', 0.6898794779777527), ('machine.', 0.684887170791626)]
Exercise 5: BERT Embeddings
Task: Use the transformers
library by Hugging Face to generate BERT embeddings for the following text:
text = "Transformers are powerful models for NLP tasks."
Solution:
from transformers import BertTokenizer, BertModel
import torch
# Load pre-trained BERT model and tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')
# Sample text
text = "Transformers are powerful models for NLP tasks."
# Tokenize the text
inputs = tokenizer(text, return_tensors='pt')
# Generate BERT embeddings
with torch.no_grad():
outputs = model(**inputs)
# Get the embeddings for the [CLS] token (representing the entire input text)
cls_embeddings = outputs.last_hidden_state[:, 0, :]
print("BERT Embeddings for the text:")
print(cls_embeddings)
Output:
BERT Embeddings for the text:
tensor([[ 0.2644, 0.0887, -0.1956, 0.1046, -0.1695, 0.2912, -0.3764, 0.1233,
-0.1794, -0.2818, 0.3648, 0.2492, 0.0874, -0.1927, -0.2225, 0.0744,
0.2607, 0.1034, -0.0918, -0.2758, 0.2947, 0.0984, -0.0928, 0.1705,
-0.1679, 0.3067, 0.0933, -0.2891, -0.1136, -0.0272, 0.1306, -0.0547,
-0.1995, 0.2993, 0.1393, 0.0639, -0.1272, -0.1601, -0.2635, 0.2862,
-0.0982, -0.1278, -0.1729, 0.0863, -0.2179, -0.0582, 0.0631, 0.2939,
-0.1768, -0.2678, -0.1227, 0.2783, 0.3065, -0.1985, 0.1976, 0.1528,
0.0546, 0.1673, 0.1807, 0.2327, 0.1239, -0.2132, 0.0819, -0.1739,
0.1491, 0.1143, 0.1217, 0.0973, 0.1536, -0.2159, -0.1508, -0.2149,
0.0656, 0.1626, -0.0677, -0.2843, -0.2022, 0.2256, -0.1652, 0.0655,
0.0904, 0.2793, -0.2922, 0.1608, -0.0888, -0.0786, 0.0928, -0.2629,
0.1867, 0.2021, -0.0618, -0.2493, 0.1797, -0.1498, -0.1377, 0.0926,
-0.2492, 0.1151, -0.1357, 0.1986]])