Project 4: Named Entity Recognition (NER) Pipeline with Custom Fine-Tuning
Step 3: Tokenize the Dataset
Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.
This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.
from transformers import AutoTokenizer
# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Tokenize dataset
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
labels.append(aligned_labels)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
Code breakdown:
1. Initial Setup
The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:
- Uses 'bert-base-uncased' as the pre-trained model
- Creates a tokenizer instance that will convert text into tokens the model can understand
2. Tokenization Function
The 'tokenize_and_align_labels' function performs two crucial tasks:
- Converts words into tokens while handling cases where single words might be split into multiple tokens
- Ensures entity labels are correctly aligned with the tokenized input
3. Function Details
The function processes the data in these steps:
- Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
- Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words
4. Final Step
The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.
Step 3: Tokenize the Dataset
Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.
This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.
from transformers import AutoTokenizer
# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Tokenize dataset
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
labels.append(aligned_labels)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
Code breakdown:
1. Initial Setup
The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:
- Uses 'bert-base-uncased' as the pre-trained model
- Creates a tokenizer instance that will convert text into tokens the model can understand
2. Tokenization Function
The 'tokenize_and_align_labels' function performs two crucial tasks:
- Converts words into tokens while handling cases where single words might be split into multiple tokens
- Ensures entity labels are correctly aligned with the tokenized input
3. Function Details
The function processes the data in these steps:
- Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
- Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words
4. Final Step
The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.
Step 3: Tokenize the Dataset
Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.
This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.
from transformers import AutoTokenizer
# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Tokenize dataset
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
labels.append(aligned_labels)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
Code breakdown:
1. Initial Setup
The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:
- Uses 'bert-base-uncased' as the pre-trained model
- Creates a tokenizer instance that will convert text into tokens the model can understand
2. Tokenization Function
The 'tokenize_and_align_labels' function performs two crucial tasks:
- Converts words into tokens while handling cases where single words might be split into multiple tokens
- Ensures entity labels are correctly aligned with the tokenized input
3. Function Details
The function processes the data in these steps:
- Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
- Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words
4. Final Step
The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.
Step 3: Tokenize the Dataset
Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.
This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.
from transformers import AutoTokenizer
# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)
# Tokenize dataset
def tokenize_and_align_labels(examples):
tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
labels = []
for i, label in enumerate(examples["ner_tags"]):
word_ids = tokenized_inputs.word_ids(batch_index=i)
aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
labels.append(aligned_labels)
tokenized_inputs["labels"] = labels
return tokenized_inputs
tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)
Code breakdown:
1. Initial Setup
The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:
- Uses 'bert-base-uncased' as the pre-trained model
- Creates a tokenizer instance that will convert text into tokens the model can understand
2. Tokenization Function
The 'tokenize_and_align_labels' function performs two crucial tasks:
- Converts words into tokens while handling cases where single words might be split into multiple tokens
- Ensures entity labels are correctly aligned with the tokenized input
3. Function Details
The function processes the data in these steps:
- Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
- Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words
4. Final Step
The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.