Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNatural Language Processing with Python Updated Edition
Natural Language Processing with Python Updated Edition

Chapter 13: Project: Sentiment Analysis Dashboard

13.2 Data Collection and Preprocessing

Data collection and preprocessing are crucial steps in building a sentiment analysis dashboard. The quality of the data collected and how it is processed directly impact the performance and reliability of the sentiment analysis model. In this section, we will discuss how to collect text data from various sources and preprocess it to make it suitable for sentiment analysis.

13.2.1 Collecting Data

To perform sentiment analysis, we need a dataset consisting of text samples with labeled sentiments (positive, negative, or neutral). There are several sources where we can collect such data:

  1. Public Datasets: There are many publicly available datasets for sentiment analysis. Some popular ones include the IMDB Movie Reviews dataset, the Twitter Sentiment140 dataset, and the Yelp Reviews dataset.
  2. User-Generated Data: Allow users to upload their own text data for sentiment analysis. This could include customer reviews, social media posts, survey responses, etc.

Example: Using the IMDB Movie Reviews Dataset

For this project, we will use the IMDB Movie Reviews dataset, which contains 50,000 movie reviews labeled as positive or negative. You can download the dataset from Kaggle.

Downloading and Loading the Dataset

Let's download the dataset and load it into our project.

data_preprocessing.py:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the IMDB Movie Reviews dataset
data = pd.read_csv('data/raw_data/IMDB_Dataset.csv')

# Display the first few rows of the dataset
print(data.head())

# Split the dataset into training and test sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Save the training and test sets
train_data.to_csv('data/processed_data/train_data.csv', index=False)
test_data.to_csv('data/processed_data/test_data.csv', index=False)

In this script, we load the IMDB Movie Reviews dataset, display the first few rows to understand its structure, and split it into training and test sets. The training set will be used to train the sentiment analysis model, while the test set will be used to evaluate its performance.

13.2.2 Preprocessing Data

Preprocessing is essential for converting raw text data into a format suitable for training machine learning models. The preprocessing pipeline includes text normalization, tokenization, stop word removal, lemmatization, and vectorization.

Text Normalization and Tokenization

Text normalization involves converting text to lowercase and removing punctuation. Tokenization is the process of splitting text into individual words or tokens.

Stop Word Removal

Stop words are common words that do not contribute significantly to the meaning of the text. Removing them helps focus on the important words.

Lemmatization

Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as the same.

Vectorization

Vectorization converts text into numerical representations, which are used as input for machine learning models. We will use the TF-IDF vectorizer for this purpose.

Preprocessing Implementation

Let's implement the preprocessing steps in Python.

data_preprocessing.py (continued):

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import pickle

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove punctuation and stop words
    tokens = [word for word in tokens if word not in string.punctuation and word not in stopwords.words('english')]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing to training and test sets
train_data['review'] = train_data['review'].apply(preprocess_text)
test_data['review'] = test_data['review'].apply(preprocess_text)

# Save preprocessed data
train_data.to_csv('data/processed_data/train_data_preprocessed.csv', index=False)
test_data.to_csv('data/processed_data/test_data_preprocessed.csv', index=False)

# Vectorize the preprocessed text
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['review']).toarray()
X_test = vectorizer.transform(test_data['review']).toarray()

# Save the vectorizer and vectorized data
with open('models/vectorizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('data/processed_data/X_train.pickle', 'wb') as file:
    pickle.dump(X_train, file)
with open('data/processed_data/X_test.pickle', 'wb') as file:
    pickle.dump(X_test, file)

In this script, we define a preprocessing function that normalizes, tokenizes, removes stop words, and lemmatizes the text. We apply this function to the training and test sets, then vectorize the preprocessed text using the TF-IDF vectorizer. The vectorized data and vectorizer are saved for future use.

13.2.3 Handling Imbalanced Data

In real-world datasets, the distribution of sentiment labels might be imbalanced, meaning there are more instances of one sentiment (e.g., positive) than another (e.g., negative). Handling imbalanced data is crucial for training a robust sentiment analysis model.

Example: Handling Imbalanced Data using SMOTE

We can use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by generating synthetic samples for the minority class.

data_preprocessing.py (continued):

from imblearn.over_sampling import SMOTE

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Vectorize the preprocessed text
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

# Balance the dataset using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_vectorized, y_train)

# Save the balanced data
with open('data/processed_data/X_train_balanced.pickle', 'wb') as file:
    pickle.dump(X_train_balanced, file)
with open('data/processed_data/y_train_balanced.pickle', 'wb') as file:
    pickle.dump(y_train_balanced, file)

In this script, we use SMOTE to balance the training dataset by generating synthetic samples for the minority class. The balanced dataset is then saved for future use.

To this point, we covered the essential steps of data collection and preprocessing for building a sentiment analysis dashboard. We discussed how to collect text data from various sources, using the IMDB Movie Reviews dataset as an example.

We implemented a comprehensive preprocessing pipeline that includes text normalization, tokenization, stop word removal, lemmatization, and vectorization. Additionally, we addressed handling imbalanced data using SMOTE to ensure our sentiment analysis model is trained on a balanced dataset.

These steps ensure that the text data is clean, balanced, and suitable for training a sentiment analysis model.

13.2 Data Collection and Preprocessing

Data collection and preprocessing are crucial steps in building a sentiment analysis dashboard. The quality of the data collected and how it is processed directly impact the performance and reliability of the sentiment analysis model. In this section, we will discuss how to collect text data from various sources and preprocess it to make it suitable for sentiment analysis.

13.2.1 Collecting Data

To perform sentiment analysis, we need a dataset consisting of text samples with labeled sentiments (positive, negative, or neutral). There are several sources where we can collect such data:

  1. Public Datasets: There are many publicly available datasets for sentiment analysis. Some popular ones include the IMDB Movie Reviews dataset, the Twitter Sentiment140 dataset, and the Yelp Reviews dataset.
  2. User-Generated Data: Allow users to upload their own text data for sentiment analysis. This could include customer reviews, social media posts, survey responses, etc.

Example: Using the IMDB Movie Reviews Dataset

For this project, we will use the IMDB Movie Reviews dataset, which contains 50,000 movie reviews labeled as positive or negative. You can download the dataset from Kaggle.

Downloading and Loading the Dataset

Let's download the dataset and load it into our project.

data_preprocessing.py:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the IMDB Movie Reviews dataset
data = pd.read_csv('data/raw_data/IMDB_Dataset.csv')

# Display the first few rows of the dataset
print(data.head())

# Split the dataset into training and test sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Save the training and test sets
train_data.to_csv('data/processed_data/train_data.csv', index=False)
test_data.to_csv('data/processed_data/test_data.csv', index=False)

In this script, we load the IMDB Movie Reviews dataset, display the first few rows to understand its structure, and split it into training and test sets. The training set will be used to train the sentiment analysis model, while the test set will be used to evaluate its performance.

13.2.2 Preprocessing Data

Preprocessing is essential for converting raw text data into a format suitable for training machine learning models. The preprocessing pipeline includes text normalization, tokenization, stop word removal, lemmatization, and vectorization.

Text Normalization and Tokenization

Text normalization involves converting text to lowercase and removing punctuation. Tokenization is the process of splitting text into individual words or tokens.

Stop Word Removal

Stop words are common words that do not contribute significantly to the meaning of the text. Removing them helps focus on the important words.

Lemmatization

Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as the same.

Vectorization

Vectorization converts text into numerical representations, which are used as input for machine learning models. We will use the TF-IDF vectorizer for this purpose.

Preprocessing Implementation

Let's implement the preprocessing steps in Python.

data_preprocessing.py (continued):

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import pickle

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove punctuation and stop words
    tokens = [word for word in tokens if word not in string.punctuation and word not in stopwords.words('english')]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing to training and test sets
train_data['review'] = train_data['review'].apply(preprocess_text)
test_data['review'] = test_data['review'].apply(preprocess_text)

# Save preprocessed data
train_data.to_csv('data/processed_data/train_data_preprocessed.csv', index=False)
test_data.to_csv('data/processed_data/test_data_preprocessed.csv', index=False)

# Vectorize the preprocessed text
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['review']).toarray()
X_test = vectorizer.transform(test_data['review']).toarray()

# Save the vectorizer and vectorized data
with open('models/vectorizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('data/processed_data/X_train.pickle', 'wb') as file:
    pickle.dump(X_train, file)
with open('data/processed_data/X_test.pickle', 'wb') as file:
    pickle.dump(X_test, file)

In this script, we define a preprocessing function that normalizes, tokenizes, removes stop words, and lemmatizes the text. We apply this function to the training and test sets, then vectorize the preprocessed text using the TF-IDF vectorizer. The vectorized data and vectorizer are saved for future use.

13.2.3 Handling Imbalanced Data

In real-world datasets, the distribution of sentiment labels might be imbalanced, meaning there are more instances of one sentiment (e.g., positive) than another (e.g., negative). Handling imbalanced data is crucial for training a robust sentiment analysis model.

Example: Handling Imbalanced Data using SMOTE

We can use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by generating synthetic samples for the minority class.

data_preprocessing.py (continued):

from imblearn.over_sampling import SMOTE

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Vectorize the preprocessed text
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

# Balance the dataset using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_vectorized, y_train)

# Save the balanced data
with open('data/processed_data/X_train_balanced.pickle', 'wb') as file:
    pickle.dump(X_train_balanced, file)
with open('data/processed_data/y_train_balanced.pickle', 'wb') as file:
    pickle.dump(y_train_balanced, file)

In this script, we use SMOTE to balance the training dataset by generating synthetic samples for the minority class. The balanced dataset is then saved for future use.

To this point, we covered the essential steps of data collection and preprocessing for building a sentiment analysis dashboard. We discussed how to collect text data from various sources, using the IMDB Movie Reviews dataset as an example.

We implemented a comprehensive preprocessing pipeline that includes text normalization, tokenization, stop word removal, lemmatization, and vectorization. Additionally, we addressed handling imbalanced data using SMOTE to ensure our sentiment analysis model is trained on a balanced dataset.

These steps ensure that the text data is clean, balanced, and suitable for training a sentiment analysis model.

13.2 Data Collection and Preprocessing

Data collection and preprocessing are crucial steps in building a sentiment analysis dashboard. The quality of the data collected and how it is processed directly impact the performance and reliability of the sentiment analysis model. In this section, we will discuss how to collect text data from various sources and preprocess it to make it suitable for sentiment analysis.

13.2.1 Collecting Data

To perform sentiment analysis, we need a dataset consisting of text samples with labeled sentiments (positive, negative, or neutral). There are several sources where we can collect such data:

  1. Public Datasets: There are many publicly available datasets for sentiment analysis. Some popular ones include the IMDB Movie Reviews dataset, the Twitter Sentiment140 dataset, and the Yelp Reviews dataset.
  2. User-Generated Data: Allow users to upload their own text data for sentiment analysis. This could include customer reviews, social media posts, survey responses, etc.

Example: Using the IMDB Movie Reviews Dataset

For this project, we will use the IMDB Movie Reviews dataset, which contains 50,000 movie reviews labeled as positive or negative. You can download the dataset from Kaggle.

Downloading and Loading the Dataset

Let's download the dataset and load it into our project.

data_preprocessing.py:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the IMDB Movie Reviews dataset
data = pd.read_csv('data/raw_data/IMDB_Dataset.csv')

# Display the first few rows of the dataset
print(data.head())

# Split the dataset into training and test sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Save the training and test sets
train_data.to_csv('data/processed_data/train_data.csv', index=False)
test_data.to_csv('data/processed_data/test_data.csv', index=False)

In this script, we load the IMDB Movie Reviews dataset, display the first few rows to understand its structure, and split it into training and test sets. The training set will be used to train the sentiment analysis model, while the test set will be used to evaluate its performance.

13.2.2 Preprocessing Data

Preprocessing is essential for converting raw text data into a format suitable for training machine learning models. The preprocessing pipeline includes text normalization, tokenization, stop word removal, lemmatization, and vectorization.

Text Normalization and Tokenization

Text normalization involves converting text to lowercase and removing punctuation. Tokenization is the process of splitting text into individual words or tokens.

Stop Word Removal

Stop words are common words that do not contribute significantly to the meaning of the text. Removing them helps focus on the important words.

Lemmatization

Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as the same.

Vectorization

Vectorization converts text into numerical representations, which are used as input for machine learning models. We will use the TF-IDF vectorizer for this purpose.

Preprocessing Implementation

Let's implement the preprocessing steps in Python.

data_preprocessing.py (continued):

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import pickle

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove punctuation and stop words
    tokens = [word for word in tokens if word not in string.punctuation and word not in stopwords.words('english')]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing to training and test sets
train_data['review'] = train_data['review'].apply(preprocess_text)
test_data['review'] = test_data['review'].apply(preprocess_text)

# Save preprocessed data
train_data.to_csv('data/processed_data/train_data_preprocessed.csv', index=False)
test_data.to_csv('data/processed_data/test_data_preprocessed.csv', index=False)

# Vectorize the preprocessed text
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['review']).toarray()
X_test = vectorizer.transform(test_data['review']).toarray()

# Save the vectorizer and vectorized data
with open('models/vectorizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('data/processed_data/X_train.pickle', 'wb') as file:
    pickle.dump(X_train, file)
with open('data/processed_data/X_test.pickle', 'wb') as file:
    pickle.dump(X_test, file)

In this script, we define a preprocessing function that normalizes, tokenizes, removes stop words, and lemmatizes the text. We apply this function to the training and test sets, then vectorize the preprocessed text using the TF-IDF vectorizer. The vectorized data and vectorizer are saved for future use.

13.2.3 Handling Imbalanced Data

In real-world datasets, the distribution of sentiment labels might be imbalanced, meaning there are more instances of one sentiment (e.g., positive) than another (e.g., negative). Handling imbalanced data is crucial for training a robust sentiment analysis model.

Example: Handling Imbalanced Data using SMOTE

We can use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by generating synthetic samples for the minority class.

data_preprocessing.py (continued):

from imblearn.over_sampling import SMOTE

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Vectorize the preprocessed text
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

# Balance the dataset using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_vectorized, y_train)

# Save the balanced data
with open('data/processed_data/X_train_balanced.pickle', 'wb') as file:
    pickle.dump(X_train_balanced, file)
with open('data/processed_data/y_train_balanced.pickle', 'wb') as file:
    pickle.dump(y_train_balanced, file)

In this script, we use SMOTE to balance the training dataset by generating synthetic samples for the minority class. The balanced dataset is then saved for future use.

To this point, we covered the essential steps of data collection and preprocessing for building a sentiment analysis dashboard. We discussed how to collect text data from various sources, using the IMDB Movie Reviews dataset as an example.

We implemented a comprehensive preprocessing pipeline that includes text normalization, tokenization, stop word removal, lemmatization, and vectorization. Additionally, we addressed handling imbalanced data using SMOTE to ensure our sentiment analysis model is trained on a balanced dataset.

These steps ensure that the text data is clean, balanced, and suitable for training a sentiment analysis model.

13.2 Data Collection and Preprocessing

Data collection and preprocessing are crucial steps in building a sentiment analysis dashboard. The quality of the data collected and how it is processed directly impact the performance and reliability of the sentiment analysis model. In this section, we will discuss how to collect text data from various sources and preprocess it to make it suitable for sentiment analysis.

13.2.1 Collecting Data

To perform sentiment analysis, we need a dataset consisting of text samples with labeled sentiments (positive, negative, or neutral). There are several sources where we can collect such data:

  1. Public Datasets: There are many publicly available datasets for sentiment analysis. Some popular ones include the IMDB Movie Reviews dataset, the Twitter Sentiment140 dataset, and the Yelp Reviews dataset.
  2. User-Generated Data: Allow users to upload their own text data for sentiment analysis. This could include customer reviews, social media posts, survey responses, etc.

Example: Using the IMDB Movie Reviews Dataset

For this project, we will use the IMDB Movie Reviews dataset, which contains 50,000 movie reviews labeled as positive or negative. You can download the dataset from Kaggle.

Downloading and Loading the Dataset

Let's download the dataset and load it into our project.

data_preprocessing.py:

import pandas as pd
from sklearn.model_selection import train_test_split

# Load the IMDB Movie Reviews dataset
data = pd.read_csv('data/raw_data/IMDB_Dataset.csv')

# Display the first few rows of the dataset
print(data.head())

# Split the dataset into training and test sets
train_data, test_data = train_test_split(data, test_size=0.2, random_state=42)

# Save the training and test sets
train_data.to_csv('data/processed_data/train_data.csv', index=False)
test_data.to_csv('data/processed_data/test_data.csv', index=False)

In this script, we load the IMDB Movie Reviews dataset, display the first few rows to understand its structure, and split it into training and test sets. The training set will be used to train the sentiment analysis model, while the test set will be used to evaluate its performance.

13.2.2 Preprocessing Data

Preprocessing is essential for converting raw text data into a format suitable for training machine learning models. The preprocessing pipeline includes text normalization, tokenization, stop word removal, lemmatization, and vectorization.

Text Normalization and Tokenization

Text normalization involves converting text to lowercase and removing punctuation. Tokenization is the process of splitting text into individual words or tokens.

Stop Word Removal

Stop words are common words that do not contribute significantly to the meaning of the text. Removing them helps focus on the important words.

Lemmatization

Lemmatization reduces words to their base or root form, ensuring that different forms of a word are treated as the same.

Vectorization

Vectorization converts text into numerical representations, which are used as input for machine learning models. We will use the TF-IDF vectorizer for this purpose.

Preprocessing Implementation

Let's implement the preprocessing steps in Python.

data_preprocessing.py (continued):

import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
import pickle

# Download NLTK resources
nltk.download('punkt')
nltk.download('stopwords')
nltk.download('wordnet')

# Initialize lemmatizer
lemmatizer = WordNetLemmatizer()

# Define preprocessing function
def preprocess_text(text):
    # Convert text to lowercase
    text = text.lower()
    # Tokenize text
    tokens = nltk.word_tokenize(text)
    # Remove punctuation and stop words
    tokens = [word for word in tokens if word not in string.punctuation and word not in stopwords.words('english')]
    # Lemmatize tokens
    tokens = [lemmatizer.lemmatize(word) for word in tokens]
    return ' '.join(tokens)

# Apply preprocessing to training and test sets
train_data['review'] = train_data['review'].apply(preprocess_text)
test_data['review'] = test_data['review'].apply(preprocess_text)

# Save preprocessed data
train_data.to_csv('data/processed_data/train_data_preprocessed.csv', index=False)
test_data.to_csv('data/processed_data/test_data_preprocessed.csv', index=False)

# Vectorize the preprocessed text
vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_data['review']).toarray()
X_test = vectorizer.transform(test_data['review']).toarray()

# Save the vectorizer and vectorized data
with open('models/vectorizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('data/processed_data/X_train.pickle', 'wb') as file:
    pickle.dump(X_train, file)
with open('data/processed_data/X_test.pickle', 'wb') as file:
    pickle.dump(X_test, file)

In this script, we define a preprocessing function that normalizes, tokenizes, removes stop words, and lemmatizes the text. We apply this function to the training and test sets, then vectorize the preprocessed text using the TF-IDF vectorizer. The vectorized data and vectorizer are saved for future use.

13.2.3 Handling Imbalanced Data

In real-world datasets, the distribution of sentiment labels might be imbalanced, meaning there are more instances of one sentiment (e.g., positive) than another (e.g., negative). Handling imbalanced data is crucial for training a robust sentiment analysis model.

Example: Handling Imbalanced Data using SMOTE

We can use the Synthetic Minority Over-sampling Technique (SMOTE) to balance the dataset by generating synthetic samples for the minority class.

data_preprocessing.py (continued):

from imblearn.over_sampling import SMOTE

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment']
X_test = test_data['review']
y_test = test_data['sentiment']

# Vectorize the preprocessed text
X_train_vectorized = vectorizer.fit_transform(X_train).toarray()
X_test_vectorized = vectorizer.transform(X_test).toarray()

# Balance the dataset using SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train_vectorized, y_train)

# Save the balanced data
with open('data/processed_data/X_train_balanced.pickle', 'wb') as file:
    pickle.dump(X_train_balanced, file)
with open('data/processed_data/y_train_balanced.pickle', 'wb') as file:
    pickle.dump(y_train_balanced, file)

In this script, we use SMOTE to balance the training dataset by generating synthetic samples for the minority class. The balanced dataset is then saved for future use.

To this point, we covered the essential steps of data collection and preprocessing for building a sentiment analysis dashboard. We discussed how to collect text data from various sources, using the IMDB Movie Reviews dataset as an example.

We implemented a comprehensive preprocessing pipeline that includes text normalization, tokenization, stop word removal, lemmatization, and vectorization. Additionally, we addressed handling imbalanced data using SMOTE to ensure our sentiment analysis model is trained on a balanced dataset.

These steps ensure that the text data is clean, balanced, and suitable for training a sentiment analysis model.