Project 4: Named Entity Recognition (NER) Pipeline with Custom Fine-Tuning

Step 3: Tokenize the Dataset

Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.

This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.

from transformers import AutoTokenizer

# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize dataset
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(aligned_labels)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Code breakdown:

1. Initial Setup

The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:

Uses 'bert-base-uncased' as the pre-trained model
Creates a tokenizer instance that will convert text into tokens the model can understand

2. Tokenization Function

The 'tokenize_and_align_labels' function performs two crucial tasks:

Converts words into tokens while handling cases where single words might be split into multiple tokens
Ensures entity labels are correctly aligned with the tokenized input

3. Function Details

The function processes the data in these steps:

Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words

4. Final Step

The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.

Step 3: Tokenize the Dataset

Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.

This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.

from transformers import AutoTokenizer

# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize dataset
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(aligned_labels)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Code breakdown:

1. Initial Setup

The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:

Uses 'bert-base-uncased' as the pre-trained model
Creates a tokenizer instance that will convert text into tokens the model can understand

2. Tokenization Function

The 'tokenize_and_align_labels' function performs two crucial tasks:

Converts words into tokens while handling cases where single words might be split into multiple tokens
Ensures entity labels are correctly aligned with the tokenized input

3. Function Details

The function processes the data in these steps:

Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words

4. Final Step

The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.

Step 3: Tokenize the Dataset

Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.

This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.

from transformers import AutoTokenizer

# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize dataset
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(aligned_labels)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Code breakdown:

1. Initial Setup

The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:

Uses 'bert-base-uncased' as the pre-trained model
Creates a tokenizer instance that will convert text into tokens the model can understand

2. Tokenization Function

The 'tokenize_and_align_labels' function performs two crucial tasks:

Converts words into tokens while handling cases where single words might be split into multiple tokens
Ensures entity labels are correctly aligned with the tokenized input

3. Function Details

The function processes the data in these steps:

Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words

4. Final Step

The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.

Step 3: Tokenize the Dataset

Use a pretrained tokenizer to prepare the dataset for the transformer model. The tokenizer converts raw text into a format that the model can understand by breaking down words into subwords or tokens, adding special tokens like [CLS] and [SEP], and converting tokens to numerical IDs.

This step is crucial because transformer models work with fixed-size numerical inputs rather than raw text. For NER tasks specifically, the tokenizer must also handle word-to-token alignment carefully to ensure that entity labels are properly mapped to the tokenized input. This is especially important when a single word gets split into multiple tokens, as we need to maintain the correct entity labels for each token.

from transformers import AutoTokenizer

# Initialize tokenizer
model_name = "bert-base-uncased"
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Tokenize dataset
def tokenize_and_align_labels(examples):
    tokenized_inputs = tokenizer(examples["tokens"], truncation=True, padding=True, is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner_tags"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        aligned_labels = [-100 if word_id is None else label[word_id] for word_id in word_ids]
        labels.append(aligned_labels)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

tokenized_datasets = dataset.map(tokenize_and_align_labels, batched=True)

Code breakdown:

1. Initial Setup

The code begins by importing the AutoTokenizer from transformers and initializing a BERT tokenizer:

Uses 'bert-base-uncased' as the pre-trained model
Creates a tokenizer instance that will convert text into tokens the model can understand

2. Tokenization Function

The 'tokenize_and_align_labels' function performs two crucial tasks:

Converts words into tokens while handling cases where single words might be split into multiple tokens
Ensures entity labels are correctly aligned with the tokenized input

3. Function Details

The function processes the data in these steps:

Tokenizes the input text with parameters:
- truncation=True: Cuts text that's too long
- padding=True: Adds padding to make sequences uniform length
- is_split_into_words=True: Indicates input is pre-split into words
Creates aligned labels by:
- Using word_ids to track which tokens belong to which original words
- Assigns -100 to special tokens (like [CLS], [SEP]) to ignore them during training
- Maps original NER tags to the tokenized words

4. Final Step

The code applies this tokenization function to the entire dataset using the map function, processing all examples in batches for efficiency.

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Step 3: Tokenize the Dataset

Step 3: Tokenize the Dataset

Step 3: Tokenize the Dataset

Step 3: Tokenize the Dataset