Menu iconMenu iconGenerative Deep Learning Updated Edition
Generative Deep Learning Updated Edition

Chapter 8: Project: Text Generation with Autoregressive Models

8.4 Evaluating the Model

Evaluating the performance of a text generation model is crucial to ensure that it generates high-quality, coherent, and contextually appropriate text. In this section, we will discuss various methods for evaluating our fine-tuned GPT-2 model, including both quantitative metrics and qualitative assessments. We will also provide example codes to demonstrate these evaluation techniques.

8.4.1 Quantitative Evaluation Metrics

Quantitative metrics provide objective measures of the model's performance. For text generation, common metrics include Perplexity, BLEU score, and ROUGE score. These metrics help assess the fluency, coherence, and relevance of the generated text.

Perplexity

Perplexity measures how well a probability distribution or probability model predicts a sample. Lower perplexity indicates better performance, as it means the model assigns higher probabilities to the actual data.

Example: Calculating Perplexity

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define a function to calculate perplexity
def calculate_perplexity(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Example text for perplexity calculation
text = "The quick brown fox jumps over the lazy dog."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")

First, the pre-trained GPT-2 model and its corresponding tokenizer are loaded. The tokenizer is used to convert input text into a format that the model can understand, while the model itself is used to generate predictions.

Next, a function named calculate_perplexity is defined, which takes in a piece of text as input. Inside this function, the input text is tokenized and converted into PyTorch tensors using the loaded tokenizer. These tensors are then fed into the model, which generates predictions in the form of logits.

The model function is called with the input ids and the labels (which are also the input ids in this case), and it returns the model's loss. The loss is a measure of how well the model's predictions match the actual outcomes. In the context of language modeling, a lower loss means that the model's predicted probabilities for the sequence of words are closer to the actual sequence.

The loss is then used to calculate the perplexity, which is a measure of uncertainty. It is calculated by taking the exponential of the loss. In the context of language models, a lower perplexity is better, as it means the model is more certain of its predictions.

Finally, an example text is provided ("The quick brown fox jumps over the lazy dog.") to demonstrate how to use the calculate_perplexity function. The calculated perplexity is then printed out. This allows users to see how well the model predicts the example text and gives an idea of the overall performance of the model.

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score evaluates the quality of text that has been machine-translated from one language to another. It is also used to evaluate text generation models by comparing the generated text to reference texts.

Example: Calculating BLEU Score

from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox jumps over the lazy dog."

# Calculate BLEU score
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU Score: {bleu_score}")

In this example, the from nltk.translate.bleu_score import sentence_bleu line is importing the required function sentence_bleu from NLTK.

Then, it defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. In this case, the reference and candidate sentences are identical.

The sentence_bleu function is then called with the reference sentence and the candidate sentence as its arguments. The reference sentence is split into individual words using the split() method because BLEU score calculation requires the sentences to be tokenized (i.e., split into individual words).

The result of the function, bleu_score, is the BLEU score of the candidate sentence relative to the reference sentence. The BLEU score is a number between 0 and 1 - a score of 1 means that the candidate sentence perfectly matches the reference sentence, while a score of 0 means that there is no match at all.

In this case, since the reference and candidate sentence are identical, the BLEU score should be 1, indicating a perfect match.

Finally, the BLEU score is printed out with a formatted string.

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap between the generated text and reference texts, focusing on recall. It is commonly used for summarization tasks.

Example: Calculating ROUGE Score

from rouge_score import rouge_scorer

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox leaps over the lazy dog."

# Calculate ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print(f"ROUGE-1 Score: {scores['rouge1'].fmeasure}")
print(f"ROUGE-L Score: {scores['rougeL'].fmeasure}")

The first step in the code is to import the rouge_scorer from the rouge_score module. This scorer is a tool for calculating the ROUGE scores.

Next, the code defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. Here, the reference is "The quick brown fox jumps over the lazy dog." and the candidate is "The quick brown fox leaps over the lazy dog."

To calculate the ROUGE score, the code creates an instance of RougeScorer, which is initialized with the types of ROUGE scores we want to calculate. In this case, 'rouge1' and 'rougeL' are used. 'rouge1' refers to the overlap of unigrams (single words) between the reference and candidate texts. 'rougeL' uses the longest common subsequence (LCS) based statistics. LCS refers to the longest sequence of words that are the same between the reference and candidate texts, in the same order.

The use_stemmer=True argument means that the scorer will apply stemming to the words before calculating the scores. Stemming is a process of reducing words to their root form, which can help in matching similar words.

The scorer.score(reference, candidate) line is what actually calculates the ROUGE scores. The resulting scores variable is a dictionary that contains the calculated scores for 'rouge1' and 'rougeL'.

The final two lines of the code print the F-measure for 'rouge1' and 'rougeL'. The F-measure, or F1 score, is the harmonic mean of precision and recall, providing a balance between these two measures.

8.4.2 Qualitative Evaluation

Qualitative evaluation involves manually inspecting the generated text to assess its fluency, coherence, and relevance. This method is subjective but provides valuable insights into the model's performance.

Visual Inspection

Visual inspection involves generating a set of texts and examining them for grammatical correctness, coherence, and relevance to the prompt. This can help identify any obvious issues such as repetitive phrases, lack of coherence, or inappropriate content.

Example: Visual Inspection

# Define a prompt
prompt = "In the quiet village of Rivendell,"

# Generate text using the fine-tuned GPT-2 model
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

This example begins with a predefined prompt, "In the quiet village of Rivendell,". The prompt is encoded into tokens suitable for the model, and then the model generates text up to a maximum length of 100 tokens based on this input. The generated text is then decoded back into human-readable text and printed out.

Human Evaluation

Human evaluation involves asking a group of people to rate the generated texts based on criteria such as coherence, fluency, and relevance. This method provides a more robust assessment of the model's performance but can be time-consuming and resource-intensive.

Example: Human Evaluation Criteria

  • Coherence: Does the text make logical sense and flow naturally?
  • Fluency: Is the text grammatically correct and easy to read?
  • Relevance: Does the text stay on topic and respond appropriately to the prompt?

8.4.3 Evaluating Diversity and Creativity

To assess the diversity and creativity of the generated text, we can analyze the variation in outputs given different prompts or slight variations of the same prompt. This helps ensure that the model does not produce repetitive or overly similar texts.

Example: Evaluating Diversity

# Define a set of similar prompts
prompts = [
    "Once upon a time in a faraway land,",
    "Long ago in a distant kingdom,",
    "In a realm beyond the mountains,",
]

# Generate and print text for each prompt
for i, prompt in enumerate(prompts):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt {i+1}:\\n{generated_text}\\n")

This example code first defines a list of prompts, each being the starting sentence of a potential story. It then uses a pre-existing tokenizer and model to generate and print out a story for each prompt.

The tokenizer.encode function is used to convert the prompt into a format that the model can understand (i.e., a tensor of integer IDs). The model.generate function is then used to generate a continuation of the prompt up to a length of 100 tokens. The temperature parameter is used to control the randomness of the output (with higher values leading to more random output).

Finally, the tokenizer.decode function is used to convert the output from the model back into human-readable text, and this text is printed to the console.

8.4 Evaluating the Model

Evaluating the performance of a text generation model is crucial to ensure that it generates high-quality, coherent, and contextually appropriate text. In this section, we will discuss various methods for evaluating our fine-tuned GPT-2 model, including both quantitative metrics and qualitative assessments. We will also provide example codes to demonstrate these evaluation techniques.

8.4.1 Quantitative Evaluation Metrics

Quantitative metrics provide objective measures of the model's performance. For text generation, common metrics include Perplexity, BLEU score, and ROUGE score. These metrics help assess the fluency, coherence, and relevance of the generated text.

Perplexity

Perplexity measures how well a probability distribution or probability model predicts a sample. Lower perplexity indicates better performance, as it means the model assigns higher probabilities to the actual data.

Example: Calculating Perplexity

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define a function to calculate perplexity
def calculate_perplexity(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Example text for perplexity calculation
text = "The quick brown fox jumps over the lazy dog."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")

First, the pre-trained GPT-2 model and its corresponding tokenizer are loaded. The tokenizer is used to convert input text into a format that the model can understand, while the model itself is used to generate predictions.

Next, a function named calculate_perplexity is defined, which takes in a piece of text as input. Inside this function, the input text is tokenized and converted into PyTorch tensors using the loaded tokenizer. These tensors are then fed into the model, which generates predictions in the form of logits.

The model function is called with the input ids and the labels (which are also the input ids in this case), and it returns the model's loss. The loss is a measure of how well the model's predictions match the actual outcomes. In the context of language modeling, a lower loss means that the model's predicted probabilities for the sequence of words are closer to the actual sequence.

The loss is then used to calculate the perplexity, which is a measure of uncertainty. It is calculated by taking the exponential of the loss. In the context of language models, a lower perplexity is better, as it means the model is more certain of its predictions.

Finally, an example text is provided ("The quick brown fox jumps over the lazy dog.") to demonstrate how to use the calculate_perplexity function. The calculated perplexity is then printed out. This allows users to see how well the model predicts the example text and gives an idea of the overall performance of the model.

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score evaluates the quality of text that has been machine-translated from one language to another. It is also used to evaluate text generation models by comparing the generated text to reference texts.

Example: Calculating BLEU Score

from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox jumps over the lazy dog."

# Calculate BLEU score
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU Score: {bleu_score}")

In this example, the from nltk.translate.bleu_score import sentence_bleu line is importing the required function sentence_bleu from NLTK.

Then, it defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. In this case, the reference and candidate sentences are identical.

The sentence_bleu function is then called with the reference sentence and the candidate sentence as its arguments. The reference sentence is split into individual words using the split() method because BLEU score calculation requires the sentences to be tokenized (i.e., split into individual words).

The result of the function, bleu_score, is the BLEU score of the candidate sentence relative to the reference sentence. The BLEU score is a number between 0 and 1 - a score of 1 means that the candidate sentence perfectly matches the reference sentence, while a score of 0 means that there is no match at all.

In this case, since the reference and candidate sentence are identical, the BLEU score should be 1, indicating a perfect match.

Finally, the BLEU score is printed out with a formatted string.

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap between the generated text and reference texts, focusing on recall. It is commonly used for summarization tasks.

Example: Calculating ROUGE Score

from rouge_score import rouge_scorer

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox leaps over the lazy dog."

# Calculate ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print(f"ROUGE-1 Score: {scores['rouge1'].fmeasure}")
print(f"ROUGE-L Score: {scores['rougeL'].fmeasure}")

The first step in the code is to import the rouge_scorer from the rouge_score module. This scorer is a tool for calculating the ROUGE scores.

Next, the code defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. Here, the reference is "The quick brown fox jumps over the lazy dog." and the candidate is "The quick brown fox leaps over the lazy dog."

To calculate the ROUGE score, the code creates an instance of RougeScorer, which is initialized with the types of ROUGE scores we want to calculate. In this case, 'rouge1' and 'rougeL' are used. 'rouge1' refers to the overlap of unigrams (single words) between the reference and candidate texts. 'rougeL' uses the longest common subsequence (LCS) based statistics. LCS refers to the longest sequence of words that are the same between the reference and candidate texts, in the same order.

The use_stemmer=True argument means that the scorer will apply stemming to the words before calculating the scores. Stemming is a process of reducing words to their root form, which can help in matching similar words.

The scorer.score(reference, candidate) line is what actually calculates the ROUGE scores. The resulting scores variable is a dictionary that contains the calculated scores for 'rouge1' and 'rougeL'.

The final two lines of the code print the F-measure for 'rouge1' and 'rougeL'. The F-measure, or F1 score, is the harmonic mean of precision and recall, providing a balance between these two measures.

8.4.2 Qualitative Evaluation

Qualitative evaluation involves manually inspecting the generated text to assess its fluency, coherence, and relevance. This method is subjective but provides valuable insights into the model's performance.

Visual Inspection

Visual inspection involves generating a set of texts and examining them for grammatical correctness, coherence, and relevance to the prompt. This can help identify any obvious issues such as repetitive phrases, lack of coherence, or inappropriate content.

Example: Visual Inspection

# Define a prompt
prompt = "In the quiet village of Rivendell,"

# Generate text using the fine-tuned GPT-2 model
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

This example begins with a predefined prompt, "In the quiet village of Rivendell,". The prompt is encoded into tokens suitable for the model, and then the model generates text up to a maximum length of 100 tokens based on this input. The generated text is then decoded back into human-readable text and printed out.

Human Evaluation

Human evaluation involves asking a group of people to rate the generated texts based on criteria such as coherence, fluency, and relevance. This method provides a more robust assessment of the model's performance but can be time-consuming and resource-intensive.

Example: Human Evaluation Criteria

  • Coherence: Does the text make logical sense and flow naturally?
  • Fluency: Is the text grammatically correct and easy to read?
  • Relevance: Does the text stay on topic and respond appropriately to the prompt?

8.4.3 Evaluating Diversity and Creativity

To assess the diversity and creativity of the generated text, we can analyze the variation in outputs given different prompts or slight variations of the same prompt. This helps ensure that the model does not produce repetitive or overly similar texts.

Example: Evaluating Diversity

# Define a set of similar prompts
prompts = [
    "Once upon a time in a faraway land,",
    "Long ago in a distant kingdom,",
    "In a realm beyond the mountains,",
]

# Generate and print text for each prompt
for i, prompt in enumerate(prompts):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt {i+1}:\\n{generated_text}\\n")

This example code first defines a list of prompts, each being the starting sentence of a potential story. It then uses a pre-existing tokenizer and model to generate and print out a story for each prompt.

The tokenizer.encode function is used to convert the prompt into a format that the model can understand (i.e., a tensor of integer IDs). The model.generate function is then used to generate a continuation of the prompt up to a length of 100 tokens. The temperature parameter is used to control the randomness of the output (with higher values leading to more random output).

Finally, the tokenizer.decode function is used to convert the output from the model back into human-readable text, and this text is printed to the console.

8.4 Evaluating the Model

Evaluating the performance of a text generation model is crucial to ensure that it generates high-quality, coherent, and contextually appropriate text. In this section, we will discuss various methods for evaluating our fine-tuned GPT-2 model, including both quantitative metrics and qualitative assessments. We will also provide example codes to demonstrate these evaluation techniques.

8.4.1 Quantitative Evaluation Metrics

Quantitative metrics provide objective measures of the model's performance. For text generation, common metrics include Perplexity, BLEU score, and ROUGE score. These metrics help assess the fluency, coherence, and relevance of the generated text.

Perplexity

Perplexity measures how well a probability distribution or probability model predicts a sample. Lower perplexity indicates better performance, as it means the model assigns higher probabilities to the actual data.

Example: Calculating Perplexity

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define a function to calculate perplexity
def calculate_perplexity(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Example text for perplexity calculation
text = "The quick brown fox jumps over the lazy dog."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")

First, the pre-trained GPT-2 model and its corresponding tokenizer are loaded. The tokenizer is used to convert input text into a format that the model can understand, while the model itself is used to generate predictions.

Next, a function named calculate_perplexity is defined, which takes in a piece of text as input. Inside this function, the input text is tokenized and converted into PyTorch tensors using the loaded tokenizer. These tensors are then fed into the model, which generates predictions in the form of logits.

The model function is called with the input ids and the labels (which are also the input ids in this case), and it returns the model's loss. The loss is a measure of how well the model's predictions match the actual outcomes. In the context of language modeling, a lower loss means that the model's predicted probabilities for the sequence of words are closer to the actual sequence.

The loss is then used to calculate the perplexity, which is a measure of uncertainty. It is calculated by taking the exponential of the loss. In the context of language models, a lower perplexity is better, as it means the model is more certain of its predictions.

Finally, an example text is provided ("The quick brown fox jumps over the lazy dog.") to demonstrate how to use the calculate_perplexity function. The calculated perplexity is then printed out. This allows users to see how well the model predicts the example text and gives an idea of the overall performance of the model.

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score evaluates the quality of text that has been machine-translated from one language to another. It is also used to evaluate text generation models by comparing the generated text to reference texts.

Example: Calculating BLEU Score

from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox jumps over the lazy dog."

# Calculate BLEU score
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU Score: {bleu_score}")

In this example, the from nltk.translate.bleu_score import sentence_bleu line is importing the required function sentence_bleu from NLTK.

Then, it defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. In this case, the reference and candidate sentences are identical.

The sentence_bleu function is then called with the reference sentence and the candidate sentence as its arguments. The reference sentence is split into individual words using the split() method because BLEU score calculation requires the sentences to be tokenized (i.e., split into individual words).

The result of the function, bleu_score, is the BLEU score of the candidate sentence relative to the reference sentence. The BLEU score is a number between 0 and 1 - a score of 1 means that the candidate sentence perfectly matches the reference sentence, while a score of 0 means that there is no match at all.

In this case, since the reference and candidate sentence are identical, the BLEU score should be 1, indicating a perfect match.

Finally, the BLEU score is printed out with a formatted string.

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap between the generated text and reference texts, focusing on recall. It is commonly used for summarization tasks.

Example: Calculating ROUGE Score

from rouge_score import rouge_scorer

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox leaps over the lazy dog."

# Calculate ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print(f"ROUGE-1 Score: {scores['rouge1'].fmeasure}")
print(f"ROUGE-L Score: {scores['rougeL'].fmeasure}")

The first step in the code is to import the rouge_scorer from the rouge_score module. This scorer is a tool for calculating the ROUGE scores.

Next, the code defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. Here, the reference is "The quick brown fox jumps over the lazy dog." and the candidate is "The quick brown fox leaps over the lazy dog."

To calculate the ROUGE score, the code creates an instance of RougeScorer, which is initialized with the types of ROUGE scores we want to calculate. In this case, 'rouge1' and 'rougeL' are used. 'rouge1' refers to the overlap of unigrams (single words) between the reference and candidate texts. 'rougeL' uses the longest common subsequence (LCS) based statistics. LCS refers to the longest sequence of words that are the same between the reference and candidate texts, in the same order.

The use_stemmer=True argument means that the scorer will apply stemming to the words before calculating the scores. Stemming is a process of reducing words to their root form, which can help in matching similar words.

The scorer.score(reference, candidate) line is what actually calculates the ROUGE scores. The resulting scores variable is a dictionary that contains the calculated scores for 'rouge1' and 'rougeL'.

The final two lines of the code print the F-measure for 'rouge1' and 'rougeL'. The F-measure, or F1 score, is the harmonic mean of precision and recall, providing a balance between these two measures.

8.4.2 Qualitative Evaluation

Qualitative evaluation involves manually inspecting the generated text to assess its fluency, coherence, and relevance. This method is subjective but provides valuable insights into the model's performance.

Visual Inspection

Visual inspection involves generating a set of texts and examining them for grammatical correctness, coherence, and relevance to the prompt. This can help identify any obvious issues such as repetitive phrases, lack of coherence, or inappropriate content.

Example: Visual Inspection

# Define a prompt
prompt = "In the quiet village of Rivendell,"

# Generate text using the fine-tuned GPT-2 model
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

This example begins with a predefined prompt, "In the quiet village of Rivendell,". The prompt is encoded into tokens suitable for the model, and then the model generates text up to a maximum length of 100 tokens based on this input. The generated text is then decoded back into human-readable text and printed out.

Human Evaluation

Human evaluation involves asking a group of people to rate the generated texts based on criteria such as coherence, fluency, and relevance. This method provides a more robust assessment of the model's performance but can be time-consuming and resource-intensive.

Example: Human Evaluation Criteria

  • Coherence: Does the text make logical sense and flow naturally?
  • Fluency: Is the text grammatically correct and easy to read?
  • Relevance: Does the text stay on topic and respond appropriately to the prompt?

8.4.3 Evaluating Diversity and Creativity

To assess the diversity and creativity of the generated text, we can analyze the variation in outputs given different prompts or slight variations of the same prompt. This helps ensure that the model does not produce repetitive or overly similar texts.

Example: Evaluating Diversity

# Define a set of similar prompts
prompts = [
    "Once upon a time in a faraway land,",
    "Long ago in a distant kingdom,",
    "In a realm beyond the mountains,",
]

# Generate and print text for each prompt
for i, prompt in enumerate(prompts):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt {i+1}:\\n{generated_text}\\n")

This example code first defines a list of prompts, each being the starting sentence of a potential story. It then uses a pre-existing tokenizer and model to generate and print out a story for each prompt.

The tokenizer.encode function is used to convert the prompt into a format that the model can understand (i.e., a tensor of integer IDs). The model.generate function is then used to generate a continuation of the prompt up to a length of 100 tokens. The temperature parameter is used to control the randomness of the output (with higher values leading to more random output).

Finally, the tokenizer.decode function is used to convert the output from the model back into human-readable text, and this text is printed to the console.

8.4 Evaluating the Model

Evaluating the performance of a text generation model is crucial to ensure that it generates high-quality, coherent, and contextually appropriate text. In this section, we will discuss various methods for evaluating our fine-tuned GPT-2 model, including both quantitative metrics and qualitative assessments. We will also provide example codes to demonstrate these evaluation techniques.

8.4.1 Quantitative Evaluation Metrics

Quantitative metrics provide objective measures of the model's performance. For text generation, common metrics include Perplexity, BLEU score, and ROUGE score. These metrics help assess the fluency, coherence, and relevance of the generated text.

Perplexity

Perplexity measures how well a probability distribution or probability model predicts a sample. Lower perplexity indicates better performance, as it means the model assigns higher probabilities to the actual data.

Example: Calculating Perplexity

import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the pre-trained GPT-2 model and tokenizer
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = GPT2LMHeadModel.from_pretrained("gpt2")

# Define a function to calculate perplexity
def calculate_perplexity(text):
    input_ids = tokenizer.encode(text, return_tensors='pt')
    with torch.no_grad():
        outputs = model(input_ids, labels=input_ids)
        loss = outputs.loss
        perplexity = torch.exp(loss)
    return perplexity.item()

# Example text for perplexity calculation
text = "The quick brown fox jumps over the lazy dog."
perplexity = calculate_perplexity(text)
print(f"Perplexity: {perplexity}")

First, the pre-trained GPT-2 model and its corresponding tokenizer are loaded. The tokenizer is used to convert input text into a format that the model can understand, while the model itself is used to generate predictions.

Next, a function named calculate_perplexity is defined, which takes in a piece of text as input. Inside this function, the input text is tokenized and converted into PyTorch tensors using the loaded tokenizer. These tensors are then fed into the model, which generates predictions in the form of logits.

The model function is called with the input ids and the labels (which are also the input ids in this case), and it returns the model's loss. The loss is a measure of how well the model's predictions match the actual outcomes. In the context of language modeling, a lower loss means that the model's predicted probabilities for the sequence of words are closer to the actual sequence.

The loss is then used to calculate the perplexity, which is a measure of uncertainty. It is calculated by taking the exponential of the loss. In the context of language models, a lower perplexity is better, as it means the model is more certain of its predictions.

Finally, an example text is provided ("The quick brown fox jumps over the lazy dog.") to demonstrate how to use the calculate_perplexity function. The calculated perplexity is then printed out. This allows users to see how well the model predicts the example text and gives an idea of the overall performance of the model.

BLEU Score

The BLEU (Bilingual Evaluation Understudy) score evaluates the quality of text that has been machine-translated from one language to another. It is also used to evaluate text generation models by comparing the generated text to reference texts.

Example: Calculating BLEU Score

from nltk.translate.bleu_score import sentence_bleu

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox jumps over the lazy dog."

# Calculate BLEU score
bleu_score = sentence_bleu([reference.split()], candidate.split())
print(f"BLEU Score: {bleu_score}")

In this example, the from nltk.translate.bleu_score import sentence_bleu line is importing the required function sentence_bleu from NLTK.

Then, it defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. In this case, the reference and candidate sentences are identical.

The sentence_bleu function is then called with the reference sentence and the candidate sentence as its arguments. The reference sentence is split into individual words using the split() method because BLEU score calculation requires the sentences to be tokenized (i.e., split into individual words).

The result of the function, bleu_score, is the BLEU score of the candidate sentence relative to the reference sentence. The BLEU score is a number between 0 and 1 - a score of 1 means that the candidate sentence perfectly matches the reference sentence, while a score of 0 means that there is no match at all.

In this case, since the reference and candidate sentence are identical, the BLEU score should be 1, indicating a perfect match.

Finally, the BLEU score is printed out with a formatted string.

ROUGE Score

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) measures the overlap between the generated text and reference texts, focusing on recall. It is commonly used for summarization tasks.

Example: Calculating ROUGE Score

from rouge_score import rouge_scorer

# Reference and candidate texts
reference = "The quick brown fox jumps over the lazy dog."
candidate = "The quick brown fox leaps over the lazy dog."

# Calculate ROUGE score
scorer = rouge_scorer.RougeScorer(['rouge1', 'rougeL'], use_stemmer=True)
scores = scorer.score(reference, candidate)
print(f"ROUGE-1 Score: {scores['rouge1'].fmeasure}")
print(f"ROUGE-L Score: {scores['rougeL'].fmeasure}")

The first step in the code is to import the rouge_scorer from the rouge_score module. This scorer is a tool for calculating the ROUGE scores.

Next, the code defines two sentences - the 'reference' sentence and the 'candidate' sentence. The reference sentence is the text that we consider the correct version, while the candidate sentence is the machine-generated text that we want to evaluate. Here, the reference is "The quick brown fox jumps over the lazy dog." and the candidate is "The quick brown fox leaps over the lazy dog."

To calculate the ROUGE score, the code creates an instance of RougeScorer, which is initialized with the types of ROUGE scores we want to calculate. In this case, 'rouge1' and 'rougeL' are used. 'rouge1' refers to the overlap of unigrams (single words) between the reference and candidate texts. 'rougeL' uses the longest common subsequence (LCS) based statistics. LCS refers to the longest sequence of words that are the same between the reference and candidate texts, in the same order.

The use_stemmer=True argument means that the scorer will apply stemming to the words before calculating the scores. Stemming is a process of reducing words to their root form, which can help in matching similar words.

The scorer.score(reference, candidate) line is what actually calculates the ROUGE scores. The resulting scores variable is a dictionary that contains the calculated scores for 'rouge1' and 'rougeL'.

The final two lines of the code print the F-measure for 'rouge1' and 'rougeL'. The F-measure, or F1 score, is the harmonic mean of precision and recall, providing a balance between these two measures.

8.4.2 Qualitative Evaluation

Qualitative evaluation involves manually inspecting the generated text to assess its fluency, coherence, and relevance. This method is subjective but provides valuable insights into the model's performance.

Visual Inspection

Visual inspection involves generating a set of texts and examining them for grammatical correctness, coherence, and relevance to the prompt. This can help identify any obvious issues such as repetitive phrases, lack of coherence, or inappropriate content.

Example: Visual Inspection

# Define a prompt
prompt = "In the quiet village of Rivendell,"

# Generate text using the fine-tuned GPT-2 model
input_ids = tokenizer.encode(prompt, return_tensors='pt')
output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True)

# Print the generated text
print(generated_text)

This example begins with a predefined prompt, "In the quiet village of Rivendell,". The prompt is encoded into tokens suitable for the model, and then the model generates text up to a maximum length of 100 tokens based on this input. The generated text is then decoded back into human-readable text and printed out.

Human Evaluation

Human evaluation involves asking a group of people to rate the generated texts based on criteria such as coherence, fluency, and relevance. This method provides a more robust assessment of the model's performance but can be time-consuming and resource-intensive.

Example: Human Evaluation Criteria

  • Coherence: Does the text make logical sense and flow naturally?
  • Fluency: Is the text grammatically correct and easy to read?
  • Relevance: Does the text stay on topic and respond appropriately to the prompt?

8.4.3 Evaluating Diversity and Creativity

To assess the diversity and creativity of the generated text, we can analyze the variation in outputs given different prompts or slight variations of the same prompt. This helps ensure that the model does not produce repetitive or overly similar texts.

Example: Evaluating Diversity

# Define a set of similar prompts
prompts = [
    "Once upon a time in a faraway land,",
    "Long ago in a distant kingdom,",
    "In a realm beyond the mountains,",
]

# Generate and print text for each prompt
for i, prompt in enumerate(prompts):
    input_ids = tokenizer.encode(prompt, return_tensors='pt')
    output = model.generate(input_ids, max_length=100, num_return_sequences=1, temperature=0.7)
    generated_text = tokenizer.decode(output[0], skip_special_tokens=True)
    print(f"Prompt {i+1}:\\n{generated_text}\\n")

This example code first defines a list of prompts, each being the starting sentence of a potential story. It then uses a pre-existing tokenizer and model to generate and print out a story for each prompt.

The tokenizer.encode function is used to convert the prompt into a format that the model can understand (i.e., a tensor of integer IDs). The model.generate function is then used to generate a continuation of the prompt up to a length of 100 tokens. The temperature parameter is used to control the randomness of the output (with higher values leading to more random output).

Finally, the tokenizer.decode function is used to convert the output from the model back into human-readable text, and this text is printed to the console.