Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP con Transformers: fundamentos y aplicaciones principales
NLP con Transformers: fundamentos y aplicaciones principales

Chapter 5: Key Transformer Models and Innovations

Practical Exercises for Chapter 5

The following exercises are designed to solidify your understanding of specialized Transformer models such as BioBERT and LegalBERT. Each exercise includes a detailed explanation, task requirements, and solutions with code examples.

Exercise 1: Using BioBERT for Named Entity Recognition (NER)

Task:

Extract named entities from biomedical text using BioBERT. Identify entities related to diseases, chemicals, and genes.

Solution:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load pre-trained BioBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

# Biomedical text
text = "BRCA1 is a gene associated with breast cancer. Aspirin is a chemical used for pain relief."

# Initialize pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Perform named entity recognition
results = ner_pipeline(text)

# Display entities
print("Extracted Entities:")
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

Expected Output:

Extracted Entities:
Entity: BRCA1, Label: B-Gene
Entity: breast cancer, Label: B-Disease
Entity: Aspirin, Label: B-Chemical

Exercise 2: Fine-Tuning BioBERT for Relation Extraction

Task:

Fine-tune BioBERT on a relation extraction dataset to predict the relationship between a gene and a disease.

Solution:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load BioBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1", num_labels=2)

# Example dataset
texts = ["BRCA1 is associated with breast cancer.", "EGFR is unrelated to heart disease."]
labels = [1, 0]  # 1: Related, 0: Unrelated

# Tokenize dataset
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
dataset = torch.utils.data.TensorDataset(inputs["input_ids"], inputs["attention_mask"], torch.tensor(labels))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Train the model
trainer.train()

Exercise 3: Classifying Contract Clauses Using LegalBERT

Task:

Classify contract clauses into categories such as "Payment Clause" or "Termination Clause" using LegalBERT.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load LegalBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=2)

# Legal text example
text = "The tenant shall pay rent on the first day of each month without demand."

# Define clause types
labels = {0: "Payment Clause", 1: "Termination Clause"}

# Use pipeline for classification
classification_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classification_pipeline(text)

# Map prediction to label
predicted_label = labels[int(result[0]['label'].split('_')[-1])]
print(f"Predicted Clause Type: {predicted_label}")

Expected Output:

Predicted Clause Type: Payment Clause

Exercise 4: Zero-Shot Classification Using CLIP

Task:

Classify an image based on textual descriptions using CLIP. Match an image of a dog with appropriate text labels.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Input image and text descriptions
image = Image.open("dog.jpg")  # Replace with your image file
texts = ["a photo of a dog", "a photo of a cat", "a photo of a car"]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Compute similarity
logits_per_image = outputs.logits_per_image  # Image-to-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Probabilities
print("Similarity Scores:", probs)

Expected Output:

Similarity Scores: [[0.80, 0.15, 0.05]]

Exercise 5: Generating Legal Summaries with LegalBERT

Task:

Use LegalBERT to summarize a legal document, extracting the main points.

Solution:

from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="nlpaueb/legal-bert-base-uncased")

# Input legal text
legal_text = """
The tenant agrees to pay rent on the first day of each month. Failure to do so will result in penalties as outlined in Section 5.
Additionally, the landlord may terminate this agreement if the tenant violates any clauses.
"""

# Summarize text
summary = summarizer(legal_text, max_length=50, min_length=20, do_sample=False)
print("Legal Summary:", summary[0]['summary_text'])

Expected Output:

Legal Summary: The tenant agrees to pay rent monthly and faces penalties or termination for non-compliance.

These exercises provide hands-on experience with BioBERTLegalBERT, and CLIP, demonstrating their practical utility in specialized domains. By completing these tasks, you gain deeper insights into how Transformer-based models can be adapted for biomedical and legal tasks, as well as multimodal learning.

Practical Exercises for Chapter 5

The following exercises are designed to solidify your understanding of specialized Transformer models such as BioBERT and LegalBERT. Each exercise includes a detailed explanation, task requirements, and solutions with code examples.

Exercise 1: Using BioBERT for Named Entity Recognition (NER)

Task:

Extract named entities from biomedical text using BioBERT. Identify entities related to diseases, chemicals, and genes.

Solution:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load pre-trained BioBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

# Biomedical text
text = "BRCA1 is a gene associated with breast cancer. Aspirin is a chemical used for pain relief."

# Initialize pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Perform named entity recognition
results = ner_pipeline(text)

# Display entities
print("Extracted Entities:")
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

Expected Output:

Extracted Entities:
Entity: BRCA1, Label: B-Gene
Entity: breast cancer, Label: B-Disease
Entity: Aspirin, Label: B-Chemical

Exercise 2: Fine-Tuning BioBERT for Relation Extraction

Task:

Fine-tune BioBERT on a relation extraction dataset to predict the relationship between a gene and a disease.

Solution:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load BioBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1", num_labels=2)

# Example dataset
texts = ["BRCA1 is associated with breast cancer.", "EGFR is unrelated to heart disease."]
labels = [1, 0]  # 1: Related, 0: Unrelated

# Tokenize dataset
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
dataset = torch.utils.data.TensorDataset(inputs["input_ids"], inputs["attention_mask"], torch.tensor(labels))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Train the model
trainer.train()

Exercise 3: Classifying Contract Clauses Using LegalBERT

Task:

Classify contract clauses into categories such as "Payment Clause" or "Termination Clause" using LegalBERT.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load LegalBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=2)

# Legal text example
text = "The tenant shall pay rent on the first day of each month without demand."

# Define clause types
labels = {0: "Payment Clause", 1: "Termination Clause"}

# Use pipeline for classification
classification_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classification_pipeline(text)

# Map prediction to label
predicted_label = labels[int(result[0]['label'].split('_')[-1])]
print(f"Predicted Clause Type: {predicted_label}")

Expected Output:

Predicted Clause Type: Payment Clause

Exercise 4: Zero-Shot Classification Using CLIP

Task:

Classify an image based on textual descriptions using CLIP. Match an image of a dog with appropriate text labels.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Input image and text descriptions
image = Image.open("dog.jpg")  # Replace with your image file
texts = ["a photo of a dog", "a photo of a cat", "a photo of a car"]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Compute similarity
logits_per_image = outputs.logits_per_image  # Image-to-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Probabilities
print("Similarity Scores:", probs)

Expected Output:

Similarity Scores: [[0.80, 0.15, 0.05]]

Exercise 5: Generating Legal Summaries with LegalBERT

Task:

Use LegalBERT to summarize a legal document, extracting the main points.

Solution:

from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="nlpaueb/legal-bert-base-uncased")

# Input legal text
legal_text = """
The tenant agrees to pay rent on the first day of each month. Failure to do so will result in penalties as outlined in Section 5.
Additionally, the landlord may terminate this agreement if the tenant violates any clauses.
"""

# Summarize text
summary = summarizer(legal_text, max_length=50, min_length=20, do_sample=False)
print("Legal Summary:", summary[0]['summary_text'])

Expected Output:

Legal Summary: The tenant agrees to pay rent monthly and faces penalties or termination for non-compliance.

These exercises provide hands-on experience with BioBERTLegalBERT, and CLIP, demonstrating their practical utility in specialized domains. By completing these tasks, you gain deeper insights into how Transformer-based models can be adapted for biomedical and legal tasks, as well as multimodal learning.

Practical Exercises for Chapter 5

The following exercises are designed to solidify your understanding of specialized Transformer models such as BioBERT and LegalBERT. Each exercise includes a detailed explanation, task requirements, and solutions with code examples.

Exercise 1: Using BioBERT for Named Entity Recognition (NER)

Task:

Extract named entities from biomedical text using BioBERT. Identify entities related to diseases, chemicals, and genes.

Solution:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load pre-trained BioBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

# Biomedical text
text = "BRCA1 is a gene associated with breast cancer. Aspirin is a chemical used for pain relief."

# Initialize pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Perform named entity recognition
results = ner_pipeline(text)

# Display entities
print("Extracted Entities:")
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

Expected Output:

Extracted Entities:
Entity: BRCA1, Label: B-Gene
Entity: breast cancer, Label: B-Disease
Entity: Aspirin, Label: B-Chemical

Exercise 2: Fine-Tuning BioBERT for Relation Extraction

Task:

Fine-tune BioBERT on a relation extraction dataset to predict the relationship between a gene and a disease.

Solution:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load BioBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1", num_labels=2)

# Example dataset
texts = ["BRCA1 is associated with breast cancer.", "EGFR is unrelated to heart disease."]
labels = [1, 0]  # 1: Related, 0: Unrelated

# Tokenize dataset
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
dataset = torch.utils.data.TensorDataset(inputs["input_ids"], inputs["attention_mask"], torch.tensor(labels))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Train the model
trainer.train()

Exercise 3: Classifying Contract Clauses Using LegalBERT

Task:

Classify contract clauses into categories such as "Payment Clause" or "Termination Clause" using LegalBERT.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load LegalBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=2)

# Legal text example
text = "The tenant shall pay rent on the first day of each month without demand."

# Define clause types
labels = {0: "Payment Clause", 1: "Termination Clause"}

# Use pipeline for classification
classification_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classification_pipeline(text)

# Map prediction to label
predicted_label = labels[int(result[0]['label'].split('_')[-1])]
print(f"Predicted Clause Type: {predicted_label}")

Expected Output:

Predicted Clause Type: Payment Clause

Exercise 4: Zero-Shot Classification Using CLIP

Task:

Classify an image based on textual descriptions using CLIP. Match an image of a dog with appropriate text labels.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Input image and text descriptions
image = Image.open("dog.jpg")  # Replace with your image file
texts = ["a photo of a dog", "a photo of a cat", "a photo of a car"]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Compute similarity
logits_per_image = outputs.logits_per_image  # Image-to-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Probabilities
print("Similarity Scores:", probs)

Expected Output:

Similarity Scores: [[0.80, 0.15, 0.05]]

Exercise 5: Generating Legal Summaries with LegalBERT

Task:

Use LegalBERT to summarize a legal document, extracting the main points.

Solution:

from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="nlpaueb/legal-bert-base-uncased")

# Input legal text
legal_text = """
The tenant agrees to pay rent on the first day of each month. Failure to do so will result in penalties as outlined in Section 5.
Additionally, the landlord may terminate this agreement if the tenant violates any clauses.
"""

# Summarize text
summary = summarizer(legal_text, max_length=50, min_length=20, do_sample=False)
print("Legal Summary:", summary[0]['summary_text'])

Expected Output:

Legal Summary: The tenant agrees to pay rent monthly and faces penalties or termination for non-compliance.

These exercises provide hands-on experience with BioBERTLegalBERT, and CLIP, demonstrating their practical utility in specialized domains. By completing these tasks, you gain deeper insights into how Transformer-based models can be adapted for biomedical and legal tasks, as well as multimodal learning.

Practical Exercises for Chapter 5

The following exercises are designed to solidify your understanding of specialized Transformer models such as BioBERT and LegalBERT. Each exercise includes a detailed explanation, task requirements, and solutions with code examples.

Exercise 1: Using BioBERT for Named Entity Recognition (NER)

Task:

Extract named entities from biomedical text using BioBERT. Identify entities related to diseases, chemicals, and genes.

Solution:

from transformers import AutoTokenizer, AutoModelForTokenClassification, pipeline

# Load pre-trained BioBERT model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForTokenClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1")

# Biomedical text
text = "BRCA1 is a gene associated with breast cancer. Aspirin is a chemical used for pain relief."

# Initialize pipeline for NER
ner_pipeline = pipeline("ner", model=model, tokenizer=tokenizer)

# Perform named entity recognition
results = ner_pipeline(text)

# Display entities
print("Extracted Entities:")
for entity in results:
    print(f"Entity: {entity['word']}, Label: {entity['entity']}")

Expected Output:

Extracted Entities:
Entity: BRCA1, Label: B-Gene
Entity: breast cancer, Label: B-Disease
Entity: Aspirin, Label: B-Chemical

Exercise 2: Fine-Tuning BioBERT for Relation Extraction

Task:

Fine-tune BioBERT on a relation extraction dataset to predict the relationship between a gene and a disease.

Solution:

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification, Trainer, TrainingArguments

# Load BioBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dmis-lab/biobert-base-cased-v1.1")
model = AutoModelForSequenceClassification.from_pretrained("dmis-lab/biobert-base-cased-v1.1", num_labels=2)

# Example dataset
texts = ["BRCA1 is associated with breast cancer.", "EGFR is unrelated to heart disease."]
labels = [1, 0]  # 1: Related, 0: Unrelated

# Tokenize dataset
inputs = tokenizer(texts, padding=True, truncation=True, return_tensors="pt")
dataset = torch.utils.data.TensorDataset(inputs["input_ids"], inputs["attention_mask"], torch.tensor(labels))

# Define training arguments
training_args = TrainingArguments(
    output_dir="./results",
    evaluation_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    num_train_epochs=3,
    weight_decay=0.01,
)

# Initialize Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
)

# Train the model
trainer.train()

Exercise 3: Classifying Contract Clauses Using LegalBERT

Task:

Classify contract clauses into categories such as "Payment Clause" or "Termination Clause" using LegalBERT.

Solution:

from transformers import AutoTokenizer, AutoModelForSequenceClassification, pipeline

# Load LegalBERT tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("nlpaueb/legal-bert-base-uncased")
model = AutoModelForSequenceClassification.from_pretrained("nlpaueb/legal-bert-base-uncased", num_labels=2)

# Legal text example
text = "The tenant shall pay rent on the first day of each month without demand."

# Define clause types
labels = {0: "Payment Clause", 1: "Termination Clause"}

# Use pipeline for classification
classification_pipeline = pipeline("text-classification", model=model, tokenizer=tokenizer)
result = classification_pipeline(text)

# Map prediction to label
predicted_label = labels[int(result[0]['label'].split('_')[-1])]
print(f"Predicted Clause Type: {predicted_label}")

Expected Output:

Predicted Clause Type: Payment Clause

Exercise 4: Zero-Shot Classification Using CLIP

Task:

Classify an image based on textual descriptions using CLIP. Match an image of a dog with appropriate text labels.

Solution:

from transformers import CLIPProcessor, CLIPModel
from PIL import Image

# Load pre-trained CLIP model and processor
model = CLIPModel.from_pretrained("openai/clip-vit-base-patch32")
processor = CLIPProcessor.from_pretrained("openai/clip-vit-base-patch32")

# Input image and text descriptions
image = Image.open("dog.jpg")  # Replace with your image file
texts = ["a photo of a dog", "a photo of a cat", "a photo of a car"]

# Process inputs
inputs = processor(text=texts, images=image, return_tensors="pt", padding=True)
outputs = model(**inputs)

# Compute similarity
logits_per_image = outputs.logits_per_image  # Image-to-text similarity scores
probs = logits_per_image.softmax(dim=1)  # Probabilities
print("Similarity Scores:", probs)

Expected Output:

Similarity Scores: [[0.80, 0.15, 0.05]]

Exercise 5: Generating Legal Summaries with LegalBERT

Task:

Use LegalBERT to summarize a legal document, extracting the main points.

Solution:

from transformers import pipeline

# Load summarization pipeline
summarizer = pipeline("summarization", model="nlpaueb/legal-bert-base-uncased")

# Input legal text
legal_text = """
The tenant agrees to pay rent on the first day of each month. Failure to do so will result in penalties as outlined in Section 5.
Additionally, the landlord may terminate this agreement if the tenant violates any clauses.
"""

# Summarize text
summary = summarizer(legal_text, max_length=50, min_length=20, do_sample=False)
print("Legal Summary:", summary[0]['summary_text'])

Expected Output:

Legal Summary: The tenant agrees to pay rent monthly and faces penalties or termination for non-compliance.

These exercises provide hands-on experience with BioBERTLegalBERT, and CLIP, demonstrating their practical utility in specialized domains. By completing these tasks, you gain deeper insights into how Transformer-based models can be adapted for biomedical and legal tasks, as well as multimodal learning.