Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP with Transformers: Fundamentals and Core Applications
NLP with Transformers: Fundamentals and Core Applications

Chapter 2: Fundamentals of Machine Learning for

Practical Exercises for Chapter 2

This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.

Exercise 1: Preprocessing Text Data

Task: Write a Python program to preprocess text by:

  1. Tokenizing it into words.
  2. Removing stopwords.
  3. Converting the text into a Bag-of-Words (BoW) representation.

Input Example:

"Natural language processing is a fascinating field of artificial intelligence."

Solution:

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."

# Tokenize
tokens = word_tokenize(text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])

print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
 [[1 1 1 1 1 1 1]]

Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis

Task: Train a simple feedforward neural network to classify reviews as positive or negative.

Dataset:

Reviews = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sample dataset
texts = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

# Define the feedforward neural network
model = Sequential([
    Dense(8, input_dim=X_train.shape[1], activation='relu'),  # Hidden layer
    Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

Expected Output:

Epoch 10/10
Test Accuracy: 1.00

Exercise 3: Extracting Word Embeddings with BERT

Task: Extract contextualized embeddings for a word in a sentence using BERT.

Input Sentence:

"The bank is located near the river."

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Input sentence
sentence = "The bank is located near the river."

# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # Shape: [batch_size, seq_length, hidden_dim]

# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")

Exercise 4: Sentence Embeddings with Sentence Transformers

Task: Generate sentence embeddings for semantic similarity.

Sentences:

  1. "I love natural language processing."
  2. "NLP is a fascinating field."

Solution:

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Input sentences
sentences = [
    "I love natural language processing.",
    "NLP is a fascinating field."
]

# Generate sentence embeddings
embeddings = model.encode(sentences)

# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])

Expected Output:

Two vectors representing the semantic meaning of each sentence.

Exercise 5: Fine-Tuning BERT for Text Classification

Task: Fine-tune BERT on a small text classification dataset.

Dataset Example:

Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune
trainer.train()

These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.

Practical Exercises for Chapter 2

This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.

Exercise 1: Preprocessing Text Data

Task: Write a Python program to preprocess text by:

  1. Tokenizing it into words.
  2. Removing stopwords.
  3. Converting the text into a Bag-of-Words (BoW) representation.

Input Example:

"Natural language processing is a fascinating field of artificial intelligence."

Solution:

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."

# Tokenize
tokens = word_tokenize(text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])

print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
 [[1 1 1 1 1 1 1]]

Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis

Task: Train a simple feedforward neural network to classify reviews as positive or negative.

Dataset:

Reviews = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sample dataset
texts = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

# Define the feedforward neural network
model = Sequential([
    Dense(8, input_dim=X_train.shape[1], activation='relu'),  # Hidden layer
    Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

Expected Output:

Epoch 10/10
Test Accuracy: 1.00

Exercise 3: Extracting Word Embeddings with BERT

Task: Extract contextualized embeddings for a word in a sentence using BERT.

Input Sentence:

"The bank is located near the river."

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Input sentence
sentence = "The bank is located near the river."

# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # Shape: [batch_size, seq_length, hidden_dim]

# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")

Exercise 4: Sentence Embeddings with Sentence Transformers

Task: Generate sentence embeddings for semantic similarity.

Sentences:

  1. "I love natural language processing."
  2. "NLP is a fascinating field."

Solution:

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Input sentences
sentences = [
    "I love natural language processing.",
    "NLP is a fascinating field."
]

# Generate sentence embeddings
embeddings = model.encode(sentences)

# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])

Expected Output:

Two vectors representing the semantic meaning of each sentence.

Exercise 5: Fine-Tuning BERT for Text Classification

Task: Fine-tune BERT on a small text classification dataset.

Dataset Example:

Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune
trainer.train()

These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.

Practical Exercises for Chapter 2

This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.

Exercise 1: Preprocessing Text Data

Task: Write a Python program to preprocess text by:

  1. Tokenizing it into words.
  2. Removing stopwords.
  3. Converting the text into a Bag-of-Words (BoW) representation.

Input Example:

"Natural language processing is a fascinating field of artificial intelligence."

Solution:

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."

# Tokenize
tokens = word_tokenize(text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])

print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
 [[1 1 1 1 1 1 1]]

Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis

Task: Train a simple feedforward neural network to classify reviews as positive or negative.

Dataset:

Reviews = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sample dataset
texts = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

# Define the feedforward neural network
model = Sequential([
    Dense(8, input_dim=X_train.shape[1], activation='relu'),  # Hidden layer
    Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

Expected Output:

Epoch 10/10
Test Accuracy: 1.00

Exercise 3: Extracting Word Embeddings with BERT

Task: Extract contextualized embeddings for a word in a sentence using BERT.

Input Sentence:

"The bank is located near the river."

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Input sentence
sentence = "The bank is located near the river."

# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # Shape: [batch_size, seq_length, hidden_dim]

# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")

Exercise 4: Sentence Embeddings with Sentence Transformers

Task: Generate sentence embeddings for semantic similarity.

Sentences:

  1. "I love natural language processing."
  2. "NLP is a fascinating field."

Solution:

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Input sentences
sentences = [
    "I love natural language processing.",
    "NLP is a fascinating field."
]

# Generate sentence embeddings
embeddings = model.encode(sentences)

# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])

Expected Output:

Two vectors representing the semantic meaning of each sentence.

Exercise 5: Fine-Tuning BERT for Text Classification

Task: Fine-tune BERT on a small text classification dataset.

Dataset Example:

Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune
trainer.train()

These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.

Practical Exercises for Chapter 2

This practical exercise section consolidates your understanding of the topics covered in Chapter 2. Each exercise is designed to provide hands-on experience with key concepts such as machine learning basics, neural networks, and transformer-based embeddings. Solutions with detailed code are included for each task.

Exercise 1: Preprocessing Text Data

Task: Write a Python program to preprocess text by:

  1. Tokenizing it into words.
  2. Removing stopwords.
  3. Converting the text into a Bag-of-Words (BoW) representation.

Input Example:

"Natural language processing is a fascinating field of artificial intelligence."

Solution:

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

# Input text
text = "Natural language processing is a fascinating field of artificial intelligence."

# Tokenize
tokens = word_tokenize(text.lower())

# Remove stopwords
stop_words = set(stopwords.words('english'))
filtered_tokens = [word for word in tokens if word.isalnum() and word not in stop_words]

# Convert to Bag-of-Words representation
vectorizer = CountVectorizer()
bow_matrix = vectorizer.fit_transform([" ".join(filtered_tokens)])

print("Filtered Tokens:", filtered_tokens)
print("Vocabulary:", vectorizer.vocabulary_)
print("BoW Matrix:\n", bow_matrix.toarray())

Expected Output:

Filtered Tokens: ['natural', 'language', 'processing', 'fascinating', 'field', 'artificial', 'intelligence']
Vocabulary: {'natural': 3, 'language': 2, 'processing': 5, 'fascinating': 1, 'field': 0, 'artificial': 4, 'intelligence': 6}
BoW Matrix:
 [[1 1 1 1 1 1 1]]

Exercise 2: Training a Feedforward Neural Network for Sentiment Analysis

Task: Train a simple feedforward neural network to classify reviews as positive or negative.

Dataset:

Reviews = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense

# Sample dataset
texts = [
    "I love this movie; it's fantastic!",
    "This film was terrible and boring.",
    "Amazing acting and a great story.",
    "The plot was awful, and I hated it."
]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

# Preprocess text using Bag-of-Words
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(texts).toarray()

# Train-test split
X_train, X_test, y_train, y_test = train_test_split(X, labels, test_size=0.25, random_state=42)

# Define the feedforward neural network
model = Sequential([
    Dense(8, input_dim=X_train.shape[1], activation='relu'),  # Hidden layer
    Dense(1, activation='sigmoid')  # Output layer
])

# Compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=10, batch_size=2, verbose=1)

# Evaluate the model
loss, accuracy = model.evaluate(X_test, y_test)
print(f"Test Accuracy: {accuracy:.2f}")

Expected Output:

Epoch 10/10
Test Accuracy: 1.00

Exercise 3: Extracting Word Embeddings with BERT

Task: Extract contextualized embeddings for a word in a sentence using BERT.

Input Sentence:

"The bank is located near the river."

Solution:

from transformers import AutoTokenizer, AutoModel
import torch

# Load BERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased")

# Input sentence
sentence = "The bank is located near the river."

# Tokenize input
inputs = tokenizer(sentence, return_tensors="pt", truncation=True, padding=True)

# Generate embeddings
with torch.no_grad():
    outputs = model(**inputs)
    embeddings = outputs.last_hidden_state  # Shape: [batch_size, seq_length, hidden_dim]

# Display embedding for the word 'bank'
tokenized_words = tokenizer.tokenize(sentence)
bank_index = tokenized_words.index("bank")
bank_embedding = embeddings[0, bank_index, :]
print(f"Embedding for 'bank': {bank_embedding}")

Exercise 4: Sentence Embeddings with Sentence Transformers

Task: Generate sentence embeddings for semantic similarity.

Sentences:

  1. "I love natural language processing."
  2. "NLP is a fascinating field."

Solution:

from sentence_transformers import SentenceTransformer

# Load a pre-trained sentence transformer model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Input sentences
sentences = [
    "I love natural language processing.",
    "NLP is a fascinating field."
]

# Generate sentence embeddings
embeddings = model.encode(sentences)

# Display embeddings
print("Embedding for sentence 1:", embeddings[0])
print("Embedding for sentence 2:", embeddings[1])

Expected Output:

Two vectors representing the semantic meaning of each sentence.

Exercise 5: Fine-Tuning BERT for Text Classification

Task: Fine-tune BERT on a small text classification dataset.

Dataset Example:

Texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
Labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import Dataset

# Prepare dataset
texts = ["I love this movie!", "The movie was awful.", "What a great film!", "I disliked the plot."]
labels = [1, 0, 1, 0]  # 1 = Positive, 0 = Negative
data = {"text": texts, "label": labels}
dataset = Dataset.from_dict(data)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Tokenize dataset
def tokenize_function(example):
    return tokenizer(example["text"], truncation=True, padding="max_length")

tokenized_dataset = dataset.map(tokenize_function, batched=True)

# Training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset,
)

# Fine-tune
trainer.train()

These exercises guide you through preprocessing, building and training models, and working with embeddings. Completing them will give you hands-on experience with the techniques discussed in this chapter, building a strong foundation for tackling real-world NLP tasks.