Click here to view the next lesson.

Chapter 2: Hugging Face and Other NLP Libraries

2.4 Practical Exercises

This section provides practical exercises to reinforce your understanding of the Hugging Face ecosystem and its integration with TensorFlow and PyTorch. Each exercise includes a clear explanation and code solutions to guide your learning.

Exercise 1: Using the Hugging Face Pipeline

Task: Use the Hugging Face pipeline to perform named entity recognition (NER) on a given text.

Instructions:

Import the Hugging Face pipeline.
Load the NER pipeline.
Process a sample text to identify named entities.

Solution:

from transformers import pipeline

# Step 1: Load the NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Step 2: Define the input text
text = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, across the Manhattan Bridge."

# Step 3: Perform named entity recognition
entities = ner_pipeline(text)

# Step 4: Print the results
print("Named Entities:")
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.2f}")

Expected Output:

Named Entities:
Entity: Hugging Face Inc., Type: ORG, Score: 0.99
Entity: New York City, Type: LOC, Score: 0.99
Entity: DUMBO, Type: LOC, Score: 0.97
Entity: Manhattan Bridge, Type: LOC, Score: 0.96

Exercise 2: Fine-Tuning a Transformer Model

Task: Fine-tune a BERT model for text classification using the Hugging Face Trainer API and the IMDB dataset.

Instructions:

Load the IMDB dataset and preprocess it.
Tokenize the text using the BERT tokenizer.
Fine-tune the model on a small subset of the dataset.
Evaluate the model.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Step 2: Prepare the data for training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Step 3: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Step 4: Define the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Step 5: Train the model
trainer.train()

# Step 6: Evaluate the model
results = trainer.evaluate()
print("Evaluation Results:", results)

Expected Output:

Evaluation Results: {'eval_loss': 0.36, 'eval_accuracy': 0.88}

Exercise 3: PyTorch Training Loop

Task: Implement a PyTorch training loop to fine-tune a BERT model for text classification.

Instructions:

Load the IMDB dataset and preprocess it.
Convert the dataset to PyTorch tensors.
Write a training loop for fine-tuning the model.
Evaluate the model's accuracy.

Solution:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
from torch.optim import AdamW

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_dataloader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(2000)), batch_size=8)
test_dataloader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(500)), batch_size=8)

# Step 2: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Step 3: Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 2
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss/len(train_dataloader)}")

# Step 4: Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        correct += (predictions == batch["label"]).sum().item()
        total += batch["label"].size(0)

print(f"Accuracy: {correct / total:.2f}")

Expected Output:

Epoch 1 Loss: 0.45
Epoch 2 Loss: 0.36
Accuracy: 0.88

These practical exercises demonstrate how to effectively use the Hugging Face ecosystem and integrate it with TensorFlow and PyTorch for NLP workflows. By completing these exercises, you gain hands-on experience with pipelines, fine-tuning, and custom training loops. Keep experimenting with other datasets and tasks to deepen your understanding!

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Task: Use the Hugging Face pipeline to perform named entity recognition (NER) on a given text.

Instructions:

Import the Hugging Face pipeline.
Load the NER pipeline.
Process a sample text to identify named entities.

Solution:

from transformers import pipeline

# Step 1: Load the NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Step 2: Define the input text
text = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, across the Manhattan Bridge."

# Step 3: Perform named entity recognition
entities = ner_pipeline(text)

# Step 4: Print the results
print("Named Entities:")
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.2f}")

Expected Output:

Named Entities:
Entity: Hugging Face Inc., Type: ORG, Score: 0.99
Entity: New York City, Type: LOC, Score: 0.99
Entity: DUMBO, Type: LOC, Score: 0.97
Entity: Manhattan Bridge, Type: LOC, Score: 0.96

Exercise 2: Fine-Tuning a Transformer Model

Task: Fine-tune a BERT model for text classification using the Hugging Face Trainer API and the IMDB dataset.

Instructions:

Load the IMDB dataset and preprocess it.
Tokenize the text using the BERT tokenizer.
Fine-tune the model on a small subset of the dataset.
Evaluate the model.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Step 2: Prepare the data for training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Step 3: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Step 4: Define the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Step 5: Train the model
trainer.train()

# Step 6: Evaluate the model
results = trainer.evaluate()
print("Evaluation Results:", results)

Expected Output:

Evaluation Results: {'eval_loss': 0.36, 'eval_accuracy': 0.88}

Exercise 3: PyTorch Training Loop

Task: Implement a PyTorch training loop to fine-tune a BERT model for text classification.

Instructions:

Load the IMDB dataset and preprocess it.
Convert the dataset to PyTorch tensors.
Write a training loop for fine-tuning the model.
Evaluate the model's accuracy.

Solution:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
from torch.optim import AdamW

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_dataloader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(2000)), batch_size=8)
test_dataloader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(500)), batch_size=8)

# Step 2: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Step 3: Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 2
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss/len(train_dataloader)}")

# Step 4: Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        correct += (predictions == batch["label"]).sum().item()
        total += batch["label"].size(0)

print(f"Accuracy: {correct / total:.2f}")

Expected Output:

Epoch 1 Loss: 0.45
Epoch 2 Loss: 0.36
Accuracy: 0.88

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Task: Use the Hugging Face pipeline to perform named entity recognition (NER) on a given text.

Instructions:

Import the Hugging Face pipeline.
Load the NER pipeline.
Process a sample text to identify named entities.

Solution:

from transformers import pipeline

# Step 1: Load the NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Step 2: Define the input text
text = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, across the Manhattan Bridge."

# Step 3: Perform named entity recognition
entities = ner_pipeline(text)

# Step 4: Print the results
print("Named Entities:")
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.2f}")

Expected Output:

Named Entities:
Entity: Hugging Face Inc., Type: ORG, Score: 0.99
Entity: New York City, Type: LOC, Score: 0.99
Entity: DUMBO, Type: LOC, Score: 0.97
Entity: Manhattan Bridge, Type: LOC, Score: 0.96

Exercise 2: Fine-Tuning a Transformer Model

Task: Fine-tune a BERT model for text classification using the Hugging Face Trainer API and the IMDB dataset.

Instructions:

Load the IMDB dataset and preprocess it.
Tokenize the text using the BERT tokenizer.
Fine-tune the model on a small subset of the dataset.
Evaluate the model.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Step 2: Prepare the data for training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Step 3: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Step 4: Define the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Step 5: Train the model
trainer.train()

# Step 6: Evaluate the model
results = trainer.evaluate()
print("Evaluation Results:", results)

Expected Output:

Evaluation Results: {'eval_loss': 0.36, 'eval_accuracy': 0.88}

Exercise 3: PyTorch Training Loop

Task: Implement a PyTorch training loop to fine-tune a BERT model for text classification.

Instructions:

Load the IMDB dataset and preprocess it.
Convert the dataset to PyTorch tensors.
Write a training loop for fine-tuning the model.
Evaluate the model's accuracy.

Solution:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
from torch.optim import AdamW

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_dataloader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(2000)), batch_size=8)
test_dataloader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(500)), batch_size=8)

# Step 2: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Step 3: Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 2
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss/len(train_dataloader)}")

# Step 4: Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        correct += (predictions == batch["label"]).sum().item()
        total += batch["label"].size(0)

print(f"Accuracy: {correct / total:.2f}")

Expected Output:

Epoch 1 Loss: 0.45
Epoch 2 Loss: 0.36
Accuracy: 0.88

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Task: Use the Hugging Face pipeline to perform named entity recognition (NER) on a given text.

Instructions:

Import the Hugging Face pipeline.
Load the NER pipeline.
Process a sample text to identify named entities.

Solution:

from transformers import pipeline

# Step 1: Load the NER pipeline
ner_pipeline = pipeline("ner", grouped_entities=True)

# Step 2: Define the input text
text = "Hugging Face Inc. is a company based in New York City. Its headquarters are in DUMBO, across the Manhattan Bridge."

# Step 3: Perform named entity recognition
entities = ner_pipeline(text)

# Step 4: Print the results
print("Named Entities:")
for entity in entities:
    print(f"Entity: {entity['word']}, Type: {entity['entity_group']}, Score: {entity['score']:.2f}")

Expected Output:

Named Entities:
Entity: Hugging Face Inc., Type: ORG, Score: 0.99
Entity: New York City, Type: LOC, Score: 0.99
Entity: DUMBO, Type: LOC, Score: 0.97
Entity: Manhattan Bridge, Type: LOC, Score: 0.96

Exercise 2: Fine-Tuning a Transformer Model

Task: Fine-tune a BERT model for text classification using the Hugging Face Trainer API and the IMDB dataset.

Instructions:

Load the IMDB dataset and preprocess it.
Tokenize the text using the BERT tokenizer.
Fine-tune the model on a small subset of the dataset.
Evaluate the model.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments
from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)

# Step 2: Prepare the data for training
train_dataset = tokenized_datasets["train"].shuffle(seed=42).select(range(2000))
test_dataset = tokenized_datasets["test"].shuffle(seed=42).select(range(500))

# Step 3: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)

# Step 4: Define the training arguments and trainer
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=2,
    weight_decay=0.01
)

def compute_metrics(eval_pred):
    predictions, labels = eval_pred
    predictions = np.argmax(predictions, axis=1)
    return {"accuracy": accuracy_score(labels, predictions)}

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset,
    compute_metrics=compute_metrics
)

# Step 5: Train the model
trainer.train()

# Step 6: Evaluate the model
results = trainer.evaluate()
print("Evaluation Results:", results)

Expected Output:

Evaluation Results: {'eval_loss': 0.36, 'eval_accuracy': 0.88}

Exercise 3: PyTorch Training Loop

Task: Implement a PyTorch training loop to fine-tune a BERT model for text classification.

Instructions:

Load the IMDB dataset and preprocess it.
Convert the dataset to PyTorch tensors.
Write a training loop for fine-tuning the model.
Evaluate the model's accuracy.

Solution:

import torch
from torch.utils.data import DataLoader
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from datasets import load_dataset
from torch.optim import AdamW

# Step 1: Load and preprocess the dataset
dataset = load_dataset("imdb")
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

def preprocess_function(examples):
    return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=256)

tokenized_datasets = dataset.map(preprocess_function, batched=True)
tokenized_datasets.set_format("torch", columns=["input_ids", "attention_mask", "label"])

train_dataloader = DataLoader(tokenized_datasets["train"].shuffle(seed=42).select(range(2000)), batch_size=8)
test_dataloader = DataLoader(tokenized_datasets["test"].shuffle(seed=42).select(range(500)), batch_size=8)

# Step 2: Load the model
model = AutoModelForSequenceClassification.from_pretrained("bert-base-uncased", num_labels=2)
optimizer = AdamW(model.parameters(), lr=5e-5)

# Step 3: Training loop
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

epochs = 2
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch in train_dataloader:
        optimizer.zero_grad()
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        loss = outputs.loss
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
    print(f"Epoch {epoch+1} Loss: {total_loss/len(train_dataloader)}")

# Step 4: Evaluation
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for batch in test_dataloader:
        batch = {k: v.to(device) for k, v in batch.items()}
        outputs = model(**batch)
        predictions = torch.argmax(outputs.logits, dim=-1)
        correct += (predictions == batch["label"]).sum().item()
        total += batch["label"].size(0)

print(f"Accuracy: {correct / total:.2f}")

Expected Output:

Epoch 1 Loss: 0.45
Epoch 2 Loss: 0.36
Accuracy: 0.88

Compra este libro

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 2: Hugging Face and Other NLP Libraries

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Exercise 2: Fine-Tuning a Transformer Model

Exercise 3: PyTorch Training Loop

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Exercise 2: Fine-Tuning a Transformer Model

Exercise 3: PyTorch Training Loop

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Exercise 2: Fine-Tuning a Transformer Model

Exercise 3: PyTorch Training Loop

2.4 Practical Exercises

Exercise 1: Using the Hugging Face Pipeline

Exercise 2: Fine-Tuning a Transformer Model

Exercise 3: PyTorch Training Loop