Chapter 2: Fundamentals of Machine Learning for
Practical Exercises for Chapter 2
This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.
Exercise 1: Preprocessing Text Data
Task: Write a Python program to preprocess text by:
- Tokenizing it into words.
- Removing stopwords.
- Converting the text into a Bag-of-Words (BoW) representation.
Input Example:
"Natural language processing is a fascinating field of artificial intelligence."
Solution:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])
print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())
Expected Output:
Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
[[1 1 1 1 1 1 1]]
Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis
Task: Train a simple feedforward neural network to classify reviews as positive or negative.
Dataset:
Reviews = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample dataset
texts = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)
# Define the feedforward neural network
model = Sequential([
Dense(8, input_dim=X_train.shape[1], activation='relu'), # Hidden layer
Dense(1, activation='sigmoid') # Output layer
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")
Expected Output:
Epoch 10/10
Test Accuracy: 1.00
Exercise 3: Extracting Word Embeddings with BERT
Task: Extract contextualized embeddings for a word in a sentence using BERT.
Input Sentence:
"The bank is located near the river."
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Input sentence
sentence = "The bank is located near the river."
# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # Shape: [batch_size, seq_length, hidden_dim]
# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")
Exercise 4: Sentence Embeddings with Sentence Transformers
Task: Generate sentence embeddings for semantic similarity.
Sentences:
- "I love natural language processing."
- "NLP is a fascinating field."
Solution:
from sentence_transformers import SentenceTransformer
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Input sentences
sentences = [
"I love natural language processing.",
"NLP is a fascinating field."
]
# Generate sentence embeddings
embeddings = model.encode(sentences)
# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])
Expected Output:
Two vectors representing the semantic meaning of each sentence.
Exercise 5: Fine-Tuning BERT for Text Classification
Task: Fine-tune BERT on a small text classification dataset.
Dataset Example:
Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Tokenize dataset
def tokenize_function(example):
return tokenizer(example["text"], truncation=True, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
# Fine-tune
trainer.train()
These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.
Practical Exercises for Chapter 2
This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.
Exercise 1: Preprocessing Text Data
Task: Write a Python program to preprocess text by:
- Tokenizing it into words.
- Removing stopwords.
- Converting the text into a Bag-of-Words (BoW) representation.
Input Example:
"Natural language processing is a fascinating field of artificial intelligence."
Solution:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])
print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())
Expected Output:
Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
[[1 1 1 1 1 1 1]]
Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis
Task: Train a simple feedforward neural network to classify reviews as positive or negative.
Dataset:
Reviews = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample dataset
texts = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)
# Define the feedforward neural network
model = Sequential([
Dense(8, input_dim=X_train.shape[1], activation='relu'), # Hidden layer
Dense(1, activation='sigmoid') # Output layer
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")
Expected Output:
Epoch 10/10
Test Accuracy: 1.00
Exercise 3: Extracting Word Embeddings with BERT
Task: Extract contextualized embeddings for a word in a sentence using BERT.
Input Sentence:
"The bank is located near the river."
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Input sentence
sentence = "The bank is located near the river."
# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # Shape: [batch_size, seq_length, hidden_dim]
# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")
Exercise 4: Sentence Embeddings with Sentence Transformers
Task: Generate sentence embeddings for semantic similarity.
Sentences:
- "I love natural language processing."
- "NLP is a fascinating field."
Solution:
from sentence_transformers import SentenceTransformer
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Input sentences
sentences = [
"I love natural language processing.",
"NLP is a fascinating field."
]
# Generate sentence embeddings
embeddings = model.encode(sentences)
# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])
Expected Output:
Two vectors representing the semantic meaning of each sentence.
Exercise 5: Fine-Tuning BERT for Text Classification
Task: Fine-tune BERT on a small text classification dataset.
Dataset Example:
Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Tokenize dataset
def tokenize_function(example):
return tokenizer(example["text"], truncation=True, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
# Fine-tune
trainer.train()
These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.
Practical Exercises for Chapter 2
This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.
Exercise 1: Preprocessing Text Data
Task: Write a Python program to preprocess text by:
- Tokenizing it into words.
- Removing stopwords.
- Converting the text into a Bag-of-Words (BoW) representation.
Input Example:
"Natural language processing is a fascinating field of artificial intelligence."
Solution:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])
print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())
Expected Output:
Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
[[1 1 1 1 1 1 1]]
Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis
Task: Train a simple feedforward neural network to classify reviews as positive or negative.
Dataset:
Reviews = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample dataset
texts = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)
# Define the feedforward neural network
model = Sequential([
Dense(8, input_dim=X_train.shape[1], activation='relu'), # Hidden layer
Dense(1, activation='sigmoid') # Output layer
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")
Expected Output:
Epoch 10/10
Test Accuracy: 1.00
Exercise 3: Extracting Word Embeddings with BERT
Task: Extract contextualized embeddings for a word in a sentence using BERT.
Input Sentence:
"The bank is located near the river."
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Input sentence
sentence = "The bank is located near the river."
# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # Shape: [batch_size, seq_length, hidden_dim]
# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")
Exercise 4: Sentence Embeddings with Sentence Transformers
Task: Generate sentence embeddings for semantic similarity.
Sentences:
- "I love natural language processing."
- "NLP is a fascinating field."
Solution:
from sentence_transformers import SentenceTransformer
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Input sentences
sentences = [
"I love natural language processing.",
"NLP is a fascinating field."
]
# Generate sentence embeddings
embeddings = model.encode(sentences)
# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])
Expected Output:
Two vectors representing the semantic meaning of each sentence.
Exercise 5: Fine-Tuning BERT for Text Classification
Task: Fine-tune BERT on a small text classification dataset.
Dataset Example:
Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Tokenize dataset
def tokenize_function(example):
return tokenizer(example["text"], truncation=True, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
# Fine-tune
trainer.train()
These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.
Practical Exercises for Chapter 2
This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.
Exercise 1: Preprocessing Text Data
Task: Write a Python program to preprocess text by:
- Tokenizing it into words.
- Removing stopwords.
- Converting the text into a Bag-of-Words (BoW) representation.
Input Example:
"Natural language processing is a fascinating field of artificial intelligence."
Solution:
from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."
# Tokenize
tokens = word_tokenize(text.lower())
# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]
# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])
print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())
Expected Output:
Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
[[1 1 1 1 1 1 1]]
Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis
Task: Train a simple feedforward neural network to classify reviews as positive or negative.
Dataset:
Reviews = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Sample dataset
texts = [
"I love this movie; it's fantastic!",
"This film was terrible and boring.",
"Amazing acting and a great story.",
"The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()
# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)
# Define the feedforward neural network
model = Sequential([
Dense(8, input_dim=X_train.shape[1], activation='relu'), # Hidden layer
Dense(1, activation='sigmoid') # Output layer
])
# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)
# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")
Expected Output:
Epoch 10/10
Test Accuracy: 1.00
Exercise 3: Extracting Word Embeddings with BERT
Task: Extract contextualized embeddings for a word in a sentence using BERT.
Input Sentence:
"The bank is located near the river."
Solution:
from transformers import AutoTokenizer, AutoModel
import torch
# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")
# Input sentence
sentence = "The bank is located near the river."
# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)
# Generate embeddings
with torch.no_grad():
outputs = model(**inputs)
embeddings = outputs.last_hidden_state # Shape: [batch_size, seq_length, hidden_dim]
# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")
Exercise 4: Sentence Embeddings with Sentence Transformers
Task: Generate sentence embeddings for semantic similarity.
Sentences:
- "I love natural language processing."
- "NLP is a fascinating field."
Solution:
from sentence_transformers import SentenceTransformer
# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Input sentences
sentences = [
"I love natural language processing.",
"NLP is a fascinating field."
]
# Generate sentence embeddings
embeddings = model.encode(sentences)
# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])
Expected Output:
Two vectors representing the semantic meaning of each sentence.
Exercise 5: Fine-Tuning BERT for Text Classification
Task: Fine-tune BERT on a small text classification dataset.
Dataset Example:
Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
Solution:
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset
# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0] # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
# Tokenize dataset
def tokenize_function(example):
return tokenizer(example["text"], truncation=True, padding="max_length")
tokenized_dataset = dataset.map(tokenize_function, batched=True)
# Training arguments
training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
per_device_train_batch_size=8,
num_train_epochs=3,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
)
# Fine-tune
trainer.train()
These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.