Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNLP con Transformers, técnicas avanzadas y aplicaciones multimodales
NLP con Transformers, técnicas avanzadas y aplicaciones multimodales

Project 2: Text Summarization with T5

Base Project Implementation

Text summarization stands as a crucial application within Natural Language Processing (NLP), revolutionizing how we handle and process large volumes of information. This technology enables both individuals and organizations to automatically distill key points from extensive documents, making information processing more efficient and accessible. The applications are vast and growing - from condensing academic literature and legal documents to creating brief versions of news articles and corporate reports.

What makes text summarization particularly powerful is its ability to identify and extract the most salient information while maintaining context and coherence. Modern summarization systems can process multiple languages, understand complex contexts, and even adapt their output style based on the target audience. This versatility has made it an essential tool in today's information-rich environment.

In this project, we will dive deep into text summarization using T5 (Text-to-Text Transfer Transformer), a cutting-edge model that has redefined the landscape of natural language processing. T5's unique architecture treats every NLP task as a text-to-text conversion problem, which provides remarkable flexibility and effectiveness for summarization tasks. The model can generate both extractive summaries (selecting and combining existing sentences) and abstractive summaries (creating new, condensed text that captures the essence of the original content).

We'll be working with Hugging Face's Transformers library, a powerful toolkit that makes it easier to implement and fine-tune state-of-the-art transformer models. This library provides a robust framework for customizing the summarization process to meet specific needs and requirements.

While T5 is capable of both extractive and abstractive summarization, this project will specifically focus on abstractive summarization - where the model generates new text that captures the essence of the input, rather than selecting and combining existing sentences. This approach showcases T5's advanced natural language generation capabilities and allows for more flexible and concise summaries. Extractive summarization techniques, while valuable, are better suited for different architectures and will not be covered in this implementation.

Project Goals

By completing this project, you will:

  1. Understand the principles of text summarization, including:
    • The difference between extractive and abstractive summarization
    • Key algorithms and techniques used in modern summarization systems
    • Evaluation metrics for measuring summary quality
  2. Learn how to load and use a pretrained T5 model for summarization tasks, covering:
    • Model architecture and components
    • Data preprocessing techniques
    • Implementation of the summarization pipeline
  3. Experiment with hyperparameters to adjust the style and conciseness of summaries, including:
    • Length control mechanisms
    • Beam search optimization
    • Output quality enhancement techniques
  4. Gain practical experience generating summaries for various types of content, such as:
    • News articles and blog posts
    • Technical documentation
    • Academic papers

This comprehensive project provides a hands-on opportunity to master text summarization technology, from theoretical foundations to practical implementation. You'll develop skills that are increasingly valuable in today's data-driven world, where efficient information processing is crucial for decision-making and knowledge management.

Here is the base project code for text summarization using T5. This implementation includes preprocessing, summarizing text, handling long inputs, and optimizing output with detailed explanations.

# Importing the required libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Load the T5 tokenizer and model
model_name = "t5-small"  # Choose the T5 variant; 't5-small' is lightweight and efficient
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
print("T5 model and tokenizer loaded successfully!")

# Step 2: Function to summarize text
def summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes the input text using the T5 model.

    Parameters:
    - text (str): The input text to summarize.
    - max_length (int): The maximum length of the summary.
    - min_length (int): The minimum length of the summary.
    - length_penalty (float): Length penalty (higher values prioritize longer summaries).
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The summarized text.
    """
    input_text = "summarize: " + text  # Prefix with the task-specific token
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)  # Tokenize input
    summary_ids = model.generate(
        inputs.input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=length_penalty,
        num_beams=num_beams,
        early_stopping=True
    )  # Generate summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary
    return summary

# Step 3: Function to handle long text inputs
def summarize_long_text(text, chunk_size=512, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes long text by splitting it into smaller chunks.

    Parameters:
    - text (str): The long input text to summarize.
    - chunk_size (int): The maximum size of each text chunk.
    - max_length (int): The maximum length of the summary for each chunk.
    - min_length (int): The minimum length of the summary for each chunk.
    - length_penalty (float): Length penalty for beam search.
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The concatenated summaries of all chunks.
    """
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]  # Split text into chunks
    summaries = []
    for chunk in chunks:
        summary = summarize_text(
            chunk,
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            num_beams=num_beams
        )
        summaries.append(summary)
    return " ".join(summaries)

# Step 4: Example usage
if __name__ == "__main__":
    # Sample text
    long_text = """
    Artificial intelligence (AI) is transforming industries by automating tasks, improving efficiency,
    and creating new opportunities. In healthcare, AI-driven systems assist doctors in diagnostics and
    treatment recommendations. Meanwhile, in finance, AI-powered algorithms analyze market trends and
    optimize investment strategies. Across industries, AI is becoming a cornerstone of innovation.
    """

    # Summarize a shorter text
    short_summary = summarize_text(long_text)
    print("Short Summary:")
    print(short_summary)

    # Summarize a longer text
    longer_text = long_text * 5  # Simulating a longer document
    long_summary = summarize_long_text(longer_text)
    print("\nLong Summary:")
    print(long_summary)

Let's break down this code:

Core Components and Setup

The code uses T5 (Text-to-Text Transfer Transformer) model for text summarization. It starts by importing and initializing two main components:

  • T5 tokenizer and model initialization using the 't5-small' variant, which is lightweight and efficient

Key Functions

The code implements two main functions:

  1. summarize_text():
  • Takes input text and summarization parameters
  • Prefixes the input with "summarize:" to indicate the task
  • Processes text using parameters like max_length, min_length, length_penalty, and num_beams to control summary generation
  1. summarize_long_text():
  • Handles longer texts by breaking them into manageable chunks
  • Processes each chunk separately and combines the results
  • Uses the same parameters as summarize_text() plus a chunk_size parameter

Key Parameters

The code uses several important parameters to control summarization:

  • max_length/min_length: Control summary length
  • num_beams: Controls beam search for better quality output
  • length_penalty: Influences whether to favor shorter or longer summaries

The code includes example usage with both short and long texts, demonstrating how to handle different input lengths and generate appropriate summaries.

Example Usage

The script demonstrates usage with both shorter and longer texts:

short_summary = summarize_text(long_text)
print("Short Summary:")
print(short_summary)

long_summary = summarize_long_text(longer_text)
print("\nLong Summary:")
print(long_summary)
  • The shorter text is summarized directly.
  • Longer text is expanded for demonstration and summarized chunk by chunk.

Base Project Implementation

Text summarization stands as a crucial application within Natural Language Processing (NLP), revolutionizing how we handle and process large volumes of information. This technology enables both individuals and organizations to automatically distill key points from extensive documents, making information processing more efficient and accessible. The applications are vast and growing - from condensing academic literature and legal documents to creating brief versions of news articles and corporate reports.

What makes text summarization particularly powerful is its ability to identify and extract the most salient information while maintaining context and coherence. Modern summarization systems can process multiple languages, understand complex contexts, and even adapt their output style based on the target audience. This versatility has made it an essential tool in today's information-rich environment.

In this project, we will dive deep into text summarization using T5 (Text-to-Text Transfer Transformer), a cutting-edge model that has redefined the landscape of natural language processing. T5's unique architecture treats every NLP task as a text-to-text conversion problem, which provides remarkable flexibility and effectiveness for summarization tasks. The model can generate both extractive summaries (selecting and combining existing sentences) and abstractive summaries (creating new, condensed text that captures the essence of the original content).

We'll be working with Hugging Face's Transformers library, a powerful toolkit that makes it easier to implement and fine-tune state-of-the-art transformer models. This library provides a robust framework for customizing the summarization process to meet specific needs and requirements.

While T5 is capable of both extractive and abstractive summarization, this project will specifically focus on abstractive summarization - where the model generates new text that captures the essence of the input, rather than selecting and combining existing sentences. This approach showcases T5's advanced natural language generation capabilities and allows for more flexible and concise summaries. Extractive summarization techniques, while valuable, are better suited for different architectures and will not be covered in this implementation.

Project Goals

By completing this project, you will:

  1. Understand the principles of text summarization, including:
    • The difference between extractive and abstractive summarization
    • Key algorithms and techniques used in modern summarization systems
    • Evaluation metrics for measuring summary quality
  2. Learn how to load and use a pretrained T5 model for summarization tasks, covering:
    • Model architecture and components
    • Data preprocessing techniques
    • Implementation of the summarization pipeline
  3. Experiment with hyperparameters to adjust the style and conciseness of summaries, including:
    • Length control mechanisms
    • Beam search optimization
    • Output quality enhancement techniques
  4. Gain practical experience generating summaries for various types of content, such as:
    • News articles and blog posts
    • Technical documentation
    • Academic papers

This comprehensive project provides a hands-on opportunity to master text summarization technology, from theoretical foundations to practical implementation. You'll develop skills that are increasingly valuable in today's data-driven world, where efficient information processing is crucial for decision-making and knowledge management.

Here is the base project code for text summarization using T5. This implementation includes preprocessing, summarizing text, handling long inputs, and optimizing output with detailed explanations.

# Importing the required libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Load the T5 tokenizer and model
model_name = "t5-small"  # Choose the T5 variant; 't5-small' is lightweight and efficient
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
print("T5 model and tokenizer loaded successfully!")

# Step 2: Function to summarize text
def summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes the input text using the T5 model.

    Parameters:
    - text (str): The input text to summarize.
    - max_length (int): The maximum length of the summary.
    - min_length (int): The minimum length of the summary.
    - length_penalty (float): Length penalty (higher values prioritize longer summaries).
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The summarized text.
    """
    input_text = "summarize: " + text  # Prefix with the task-specific token
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)  # Tokenize input
    summary_ids = model.generate(
        inputs.input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=length_penalty,
        num_beams=num_beams,
        early_stopping=True
    )  # Generate summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary
    return summary

# Step 3: Function to handle long text inputs
def summarize_long_text(text, chunk_size=512, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes long text by splitting it into smaller chunks.

    Parameters:
    - text (str): The long input text to summarize.
    - chunk_size (int): The maximum size of each text chunk.
    - max_length (int): The maximum length of the summary for each chunk.
    - min_length (int): The minimum length of the summary for each chunk.
    - length_penalty (float): Length penalty for beam search.
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The concatenated summaries of all chunks.
    """
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]  # Split text into chunks
    summaries = []
    for chunk in chunks:
        summary = summarize_text(
            chunk,
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            num_beams=num_beams
        )
        summaries.append(summary)
    return " ".join(summaries)

# Step 4: Example usage
if __name__ == "__main__":
    # Sample text
    long_text = """
    Artificial intelligence (AI) is transforming industries by automating tasks, improving efficiency,
    and creating new opportunities. In healthcare, AI-driven systems assist doctors in diagnostics and
    treatment recommendations. Meanwhile, in finance, AI-powered algorithms analyze market trends and
    optimize investment strategies. Across industries, AI is becoming a cornerstone of innovation.
    """

    # Summarize a shorter text
    short_summary = summarize_text(long_text)
    print("Short Summary:")
    print(short_summary)

    # Summarize a longer text
    longer_text = long_text * 5  # Simulating a longer document
    long_summary = summarize_long_text(longer_text)
    print("\nLong Summary:")
    print(long_summary)

Let's break down this code:

Core Components and Setup

The code uses T5 (Text-to-Text Transfer Transformer) model for text summarization. It starts by importing and initializing two main components:

  • T5 tokenizer and model initialization using the 't5-small' variant, which is lightweight and efficient

Key Functions

The code implements two main functions:

  1. summarize_text():
  • Takes input text and summarization parameters
  • Prefixes the input with "summarize:" to indicate the task
  • Processes text using parameters like max_length, min_length, length_penalty, and num_beams to control summary generation
  1. summarize_long_text():
  • Handles longer texts by breaking them into manageable chunks
  • Processes each chunk separately and combines the results
  • Uses the same parameters as summarize_text() plus a chunk_size parameter

Key Parameters

The code uses several important parameters to control summarization:

  • max_length/min_length: Control summary length
  • num_beams: Controls beam search for better quality output
  • length_penalty: Influences whether to favor shorter or longer summaries

The code includes example usage with both short and long texts, demonstrating how to handle different input lengths and generate appropriate summaries.

Example Usage

The script demonstrates usage with both shorter and longer texts:

short_summary = summarize_text(long_text)
print("Short Summary:")
print(short_summary)

long_summary = summarize_long_text(longer_text)
print("\nLong Summary:")
print(long_summary)
  • The shorter text is summarized directly.
  • Longer text is expanded for demonstration and summarized chunk by chunk.

Base Project Implementation

Text summarization stands as a crucial application within Natural Language Processing (NLP), revolutionizing how we handle and process large volumes of information. This technology enables both individuals and organizations to automatically distill key points from extensive documents, making information processing more efficient and accessible. The applications are vast and growing - from condensing academic literature and legal documents to creating brief versions of news articles and corporate reports.

What makes text summarization particularly powerful is its ability to identify and extract the most salient information while maintaining context and coherence. Modern summarization systems can process multiple languages, understand complex contexts, and even adapt their output style based on the target audience. This versatility has made it an essential tool in today's information-rich environment.

In this project, we will dive deep into text summarization using T5 (Text-to-Text Transfer Transformer), a cutting-edge model that has redefined the landscape of natural language processing. T5's unique architecture treats every NLP task as a text-to-text conversion problem, which provides remarkable flexibility and effectiveness for summarization tasks. The model can generate both extractive summaries (selecting and combining existing sentences) and abstractive summaries (creating new, condensed text that captures the essence of the original content).

We'll be working with Hugging Face's Transformers library, a powerful toolkit that makes it easier to implement and fine-tune state-of-the-art transformer models. This library provides a robust framework for customizing the summarization process to meet specific needs and requirements.

While T5 is capable of both extractive and abstractive summarization, this project will specifically focus on abstractive summarization - where the model generates new text that captures the essence of the input, rather than selecting and combining existing sentences. This approach showcases T5's advanced natural language generation capabilities and allows for more flexible and concise summaries. Extractive summarization techniques, while valuable, are better suited for different architectures and will not be covered in this implementation.

Project Goals

By completing this project, you will:

  1. Understand the principles of text summarization, including:
    • The difference between extractive and abstractive summarization
    • Key algorithms and techniques used in modern summarization systems
    • Evaluation metrics for measuring summary quality
  2. Learn how to load and use a pretrained T5 model for summarization tasks, covering:
    • Model architecture and components
    • Data preprocessing techniques
    • Implementation of the summarization pipeline
  3. Experiment with hyperparameters to adjust the style and conciseness of summaries, including:
    • Length control mechanisms
    • Beam search optimization
    • Output quality enhancement techniques
  4. Gain practical experience generating summaries for various types of content, such as:
    • News articles and blog posts
    • Technical documentation
    • Academic papers

This comprehensive project provides a hands-on opportunity to master text summarization technology, from theoretical foundations to practical implementation. You'll develop skills that are increasingly valuable in today's data-driven world, where efficient information processing is crucial for decision-making and knowledge management.

Here is the base project code for text summarization using T5. This implementation includes preprocessing, summarizing text, handling long inputs, and optimizing output with detailed explanations.

# Importing the required libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Load the T5 tokenizer and model
model_name = "t5-small"  # Choose the T5 variant; 't5-small' is lightweight and efficient
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
print("T5 model and tokenizer loaded successfully!")

# Step 2: Function to summarize text
def summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes the input text using the T5 model.

    Parameters:
    - text (str): The input text to summarize.
    - max_length (int): The maximum length of the summary.
    - min_length (int): The minimum length of the summary.
    - length_penalty (float): Length penalty (higher values prioritize longer summaries).
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The summarized text.
    """
    input_text = "summarize: " + text  # Prefix with the task-specific token
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)  # Tokenize input
    summary_ids = model.generate(
        inputs.input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=length_penalty,
        num_beams=num_beams,
        early_stopping=True
    )  # Generate summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary
    return summary

# Step 3: Function to handle long text inputs
def summarize_long_text(text, chunk_size=512, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes long text by splitting it into smaller chunks.

    Parameters:
    - text (str): The long input text to summarize.
    - chunk_size (int): The maximum size of each text chunk.
    - max_length (int): The maximum length of the summary for each chunk.
    - min_length (int): The minimum length of the summary for each chunk.
    - length_penalty (float): Length penalty for beam search.
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The concatenated summaries of all chunks.
    """
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]  # Split text into chunks
    summaries = []
    for chunk in chunks:
        summary = summarize_text(
            chunk,
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            num_beams=num_beams
        )
        summaries.append(summary)
    return " ".join(summaries)

# Step 4: Example usage
if __name__ == "__main__":
    # Sample text
    long_text = """
    Artificial intelligence (AI) is transforming industries by automating tasks, improving efficiency,
    and creating new opportunities. In healthcare, AI-driven systems assist doctors in diagnostics and
    treatment recommendations. Meanwhile, in finance, AI-powered algorithms analyze market trends and
    optimize investment strategies. Across industries, AI is becoming a cornerstone of innovation.
    """

    # Summarize a shorter text
    short_summary = summarize_text(long_text)
    print("Short Summary:")
    print(short_summary)

    # Summarize a longer text
    longer_text = long_text * 5  # Simulating a longer document
    long_summary = summarize_long_text(longer_text)
    print("\nLong Summary:")
    print(long_summary)

Let's break down this code:

Core Components and Setup

The code uses T5 (Text-to-Text Transfer Transformer) model for text summarization. It starts by importing and initializing two main components:

  • T5 tokenizer and model initialization using the 't5-small' variant, which is lightweight and efficient

Key Functions

The code implements two main functions:

  1. summarize_text():
  • Takes input text and summarization parameters
  • Prefixes the input with "summarize:" to indicate the task
  • Processes text using parameters like max_length, min_length, length_penalty, and num_beams to control summary generation
  1. summarize_long_text():
  • Handles longer texts by breaking them into manageable chunks
  • Processes each chunk separately and combines the results
  • Uses the same parameters as summarize_text() plus a chunk_size parameter

Key Parameters

The code uses several important parameters to control summarization:

  • max_length/min_length: Control summary length
  • num_beams: Controls beam search for better quality output
  • length_penalty: Influences whether to favor shorter or longer summaries

The code includes example usage with both short and long texts, demonstrating how to handle different input lengths and generate appropriate summaries.

Example Usage

The script demonstrates usage with both shorter and longer texts:

short_summary = summarize_text(long_text)
print("Short Summary:")
print(short_summary)

long_summary = summarize_long_text(longer_text)
print("\nLong Summary:")
print(long_summary)
  • The shorter text is summarized directly.
  • Longer text is expanded for demonstration and summarized chunk by chunk.

Base Project Implementation

Text summarization stands as a crucial application within Natural Language Processing (NLP), revolutionizing how we handle and process large volumes of information. This technology enables both individuals and organizations to automatically distill key points from extensive documents, making information processing more efficient and accessible. The applications are vast and growing - from condensing academic literature and legal documents to creating brief versions of news articles and corporate reports.

What makes text summarization particularly powerful is its ability to identify and extract the most salient information while maintaining context and coherence. Modern summarization systems can process multiple languages, understand complex contexts, and even adapt their output style based on the target audience. This versatility has made it an essential tool in today's information-rich environment.

In this project, we will dive deep into text summarization using T5 (Text-to-Text Transfer Transformer), a cutting-edge model that has redefined the landscape of natural language processing. T5's unique architecture treats every NLP task as a text-to-text conversion problem, which provides remarkable flexibility and effectiveness for summarization tasks. The model can generate both extractive summaries (selecting and combining existing sentences) and abstractive summaries (creating new, condensed text that captures the essence of the original content).

We'll be working with Hugging Face's Transformers library, a powerful toolkit that makes it easier to implement and fine-tune state-of-the-art transformer models. This library provides a robust framework for customizing the summarization process to meet specific needs and requirements.

While T5 is capable of both extractive and abstractive summarization, this project will specifically focus on abstractive summarization - where the model generates new text that captures the essence of the input, rather than selecting and combining existing sentences. This approach showcases T5's advanced natural language generation capabilities and allows for more flexible and concise summaries. Extractive summarization techniques, while valuable, are better suited for different architectures and will not be covered in this implementation.

Project Goals

By completing this project, you will:

  1. Understand the principles of text summarization, including:
    • The difference between extractive and abstractive summarization
    • Key algorithms and techniques used in modern summarization systems
    • Evaluation metrics for measuring summary quality
  2. Learn how to load and use a pretrained T5 model for summarization tasks, covering:
    • Model architecture and components
    • Data preprocessing techniques
    • Implementation of the summarization pipeline
  3. Experiment with hyperparameters to adjust the style and conciseness of summaries, including:
    • Length control mechanisms
    • Beam search optimization
    • Output quality enhancement techniques
  4. Gain practical experience generating summaries for various types of content, such as:
    • News articles and blog posts
    • Technical documentation
    • Academic papers

This comprehensive project provides a hands-on opportunity to master text summarization technology, from theoretical foundations to practical implementation. You'll develop skills that are increasingly valuable in today's data-driven world, where efficient information processing is crucial for decision-making and knowledge management.

Here is the base project code for text summarization using T5. This implementation includes preprocessing, summarizing text, handling long inputs, and optimizing output with detailed explanations.

# Importing the required libraries
from transformers import T5Tokenizer, T5ForConditionalGeneration

# Step 1: Load the T5 tokenizer and model
model_name = "t5-small"  # Choose the T5 variant; 't5-small' is lightweight and efficient
tokenizer = T5Tokenizer.from_pretrained(model_name)
model = T5ForConditionalGeneration.from_pretrained(model_name)
print("T5 model and tokenizer loaded successfully!")

# Step 2: Function to summarize text
def summarize_text(text, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes the input text using the T5 model.

    Parameters:
    - text (str): The input text to summarize.
    - max_length (int): The maximum length of the summary.
    - min_length (int): The minimum length of the summary.
    - length_penalty (float): Length penalty (higher values prioritize longer summaries).
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The summarized text.
    """
    input_text = "summarize: " + text  # Prefix with the task-specific token
    inputs = tokenizer(input_text, return_tensors="pt", max_length=512, truncation=True)  # Tokenize input
    summary_ids = model.generate(
        inputs.input_ids,
        max_length=max_length,
        min_length=min_length,
        length_penalty=length_penalty,
        num_beams=num_beams,
        early_stopping=True
    )  # Generate summary
    summary = tokenizer.decode(summary_ids[0], skip_special_tokens=True)  # Decode summary
    return summary

# Step 3: Function to handle long text inputs
def summarize_long_text(text, chunk_size=512, max_length=50, min_length=20, length_penalty=2.0, num_beams=4):
    """
    Summarizes long text by splitting it into smaller chunks.

    Parameters:
    - text (str): The long input text to summarize.
    - chunk_size (int): The maximum size of each text chunk.
    - max_length (int): The maximum length of the summary for each chunk.
    - min_length (int): The minimum length of the summary for each chunk.
    - length_penalty (float): Length penalty for beam search.
    - num_beams (int): Number of beams for beam search.

    Returns:
    - str: The concatenated summaries of all chunks.
    """
    chunks = [text[i:i+chunk_size] for i in range(0, len(text), chunk_size)]  # Split text into chunks
    summaries = []
    for chunk in chunks:
        summary = summarize_text(
            chunk,
            max_length=max_length,
            min_length=min_length,
            length_penalty=length_penalty,
            num_beams=num_beams
        )
        summaries.append(summary)
    return " ".join(summaries)

# Step 4: Example usage
if __name__ == "__main__":
    # Sample text
    long_text = """
    Artificial intelligence (AI) is transforming industries by automating tasks, improving efficiency,
    and creating new opportunities. In healthcare, AI-driven systems assist doctors in diagnostics and
    treatment recommendations. Meanwhile, in finance, AI-powered algorithms analyze market trends and
    optimize investment strategies. Across industries, AI is becoming a cornerstone of innovation.
    """

    # Summarize a shorter text
    short_summary = summarize_text(long_text)
    print("Short Summary:")
    print(short_summary)

    # Summarize a longer text
    longer_text = long_text * 5  # Simulating a longer document
    long_summary = summarize_long_text(longer_text)
    print("\nLong Summary:")
    print(long_summary)

Let's break down this code:

Core Components and Setup

The code uses T5 (Text-to-Text Transfer Transformer) model for text summarization. It starts by importing and initializing two main components:

  • T5 tokenizer and model initialization using the 't5-small' variant, which is lightweight and efficient

Key Functions

The code implements two main functions:

  1. summarize_text():
  • Takes input text and summarization parameters
  • Prefixes the input with "summarize:" to indicate the task
  • Processes text using parameters like max_length, min_length, length_penalty, and num_beams to control summary generation
  1. summarize_long_text():
  • Handles longer texts by breaking them into manageable chunks
  • Processes each chunk separately and combines the results
  • Uses the same parameters as summarize_text() plus a chunk_size parameter

Key Parameters

The code uses several important parameters to control summarization:

  • max_length/min_length: Control summary length
  • num_beams: Controls beam search for better quality output
  • length_penalty: Influences whether to favor shorter or longer summaries

The code includes example usage with both short and long texts, demonstrating how to handle different input lengths and generate appropriate summaries.

Example Usage

The script demonstrates usage with both shorter and longer texts:

short_summary = summarize_text(long_text)
print("Short Summary:")
print(short_summary)

long_summary = summarize_long_text(longer_text)
print("\nLong Summary:")
print(long_summary)
  • The shorter text is summarized directly.
  • Longer text is expanded for demonstration and summarized chunk by chunk.