Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconNatural Language Processing with Python Updated Edition
Natural Language Processing with Python Updated Edition

Chapter 13: Project: Sentiment Analysis Dashboard

13.3 Building and Training Sentiment Analysis Models

Building and training sentiment analysis models is a crucial step in developing a sentiment analysis dashboard. These models analyze the sentiment of text data and classify it as positive, negative, or neutral. In this section, we will discuss how to build and train sentiment analysis models using machine learning and deep learning techniques. We will also provide example codes to guide you through the process.

13.3.1 Choosing the Right Model

Choosing the right model for sentiment analysis depends on several factors, including the size of the dataset, the complexity of the text data, and the desired accuracy. We will explore two approaches: traditional machine learning models and deep learning models.

1. Traditional Machine Learning Models

Traditional machine learning models, such as Logistic Regression, Support Vector Machines (SVM), and Naive Bayes, are effective for text classification tasks and are relatively easy to implement and interpret.

2. Deep Learning Models

Deep learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Bidirectional Encoder Representations from Transformers (BERT), can capture complex patterns in text data and often achieve higher accuracy. However, they require more computational resources and training time.

13.3.2 Implementing Machine Learning Models

Let's start by implementing a machine learning model for sentiment analysis. We will use Logistic Regression as an example.

logistic_regression.py:

import pandas as pd
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load balanced training data
with open('data/processed_data/X_train_balanced.pickle', 'rb') as file:
    X_train = pickle.load(file)
with open('data/processed_data/y_train_balanced.pickle', 'rb') as file:
    y_train = pickle.load(file)

# Load test data
with open('data/processed_data/X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')
y_test = test_data['sentiment']

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save the trained model
with open('models/logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(model, file)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we train a Logistic Regression model on the balanced training data and evaluate its performance on the test set. The trained model is saved for future use.

13.3.3 Implementing Deep Learning Models

Next, let's implement a deep learning model for sentiment analysis. We will use an LSTM model as an example.

lstm_model.py:

import numpy as np
import pandas as pd
import pickle
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, classification_report

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_test = test_data['review']
y_test = test_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

max_length = 200
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# Build the LSTM model
embedding_dim = 100
model = Sequential([
    Embedding(input_dim=5000, output_dim=embedding_dim, input_length=max_length),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Save the trained model and tokenizer
model.save('models/lstm_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(tokenizer, file)

# Evaluate the model on the test set
y_pred_prob = model.predict(X_test_padded)
y_pred = (y_pred_prob > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we build and train an LSTM model for sentiment analysis. The text data is tokenized and padded to a fixed length, and the model is trained on the padded sequences. The trained model and tokenizer are saved for future use. The model is then evaluated on the test set to measure its performance.

13.3.4 Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning and deep learning models. It involves selecting the best combination of hyperparameters for the model.

Example: Hyperparameter Tuning using GridSearchCV

We can use GridSearchCV from the sklearn library to perform hyperparameter tuning for the Logistic Regression model.

hyperparameter_tuning.py:

from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1)

# Perform hyperparameter tuning
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')

# Save the best model
best_model = grid_search.best_estimator_
with open('models/best_logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(best_model, file)

In this script, we define a grid of hyperparameters for the Logistic Regression model and use GridSearchCV to find the best combination. The best model is saved for future use.

13.3.5 Evaluating Model Performance

Evaluating the performance of sentiment analysis models is essential to understand their strengths and weaknesses. We will use various metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Example: Model Evaluation

evaluate_model.py:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the best Logistic Regression model
with open('models/best_logistic_regression_model.pickle', 'rb') as file:
    best_model = pickle.load(file)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap=plt.cm.Blues)
plt.show()

In this script, we evaluate the best Logistic Regression model on the test set and print various metrics. We also plot the confusion matrix to visualize the performance.

In this section, we covered the essential steps of building and training sentiment analysis models. We discussed how to choose the right model, implemented both traditional machine learning models (Logistic Regression) and deep learning models (LSTM), and performed hyperparameter tuning using GridSearchCV.

Additionally, we evaluated the model performance using various metrics and visualizations. By following these steps, we have developed robust sentiment analysis models that can classify the sentiment of text data. 

13.3 Building and Training Sentiment Analysis Models

Building and training sentiment analysis models is a crucial step in developing a sentiment analysis dashboard. These models analyze the sentiment of text data and classify it as positive, negative, or neutral. In this section, we will discuss how to build and train sentiment analysis models using machine learning and deep learning techniques. We will also provide example codes to guide you through the process.

13.3.1 Choosing the Right Model

Choosing the right model for sentiment analysis depends on several factors, including the size of the dataset, the complexity of the text data, and the desired accuracy. We will explore two approaches: traditional machine learning models and deep learning models.

1. Traditional Machine Learning Models

Traditional machine learning models, such as Logistic Regression, Support Vector Machines (SVM), and Naive Bayes, are effective for text classification tasks and are relatively easy to implement and interpret.

2. Deep Learning Models

Deep learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Bidirectional Encoder Representations from Transformers (BERT), can capture complex patterns in text data and often achieve higher accuracy. However, they require more computational resources and training time.

13.3.2 Implementing Machine Learning Models

Let's start by implementing a machine learning model for sentiment analysis. We will use Logistic Regression as an example.

logistic_regression.py:

import pandas as pd
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load balanced training data
with open('data/processed_data/X_train_balanced.pickle', 'rb') as file:
    X_train = pickle.load(file)
with open('data/processed_data/y_train_balanced.pickle', 'rb') as file:
    y_train = pickle.load(file)

# Load test data
with open('data/processed_data/X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')
y_test = test_data['sentiment']

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save the trained model
with open('models/logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(model, file)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we train a Logistic Regression model on the balanced training data and evaluate its performance on the test set. The trained model is saved for future use.

13.3.3 Implementing Deep Learning Models

Next, let's implement a deep learning model for sentiment analysis. We will use an LSTM model as an example.

lstm_model.py:

import numpy as np
import pandas as pd
import pickle
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, classification_report

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_test = test_data['review']
y_test = test_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

max_length = 200
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# Build the LSTM model
embedding_dim = 100
model = Sequential([
    Embedding(input_dim=5000, output_dim=embedding_dim, input_length=max_length),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Save the trained model and tokenizer
model.save('models/lstm_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(tokenizer, file)

# Evaluate the model on the test set
y_pred_prob = model.predict(X_test_padded)
y_pred = (y_pred_prob > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we build and train an LSTM model for sentiment analysis. The text data is tokenized and padded to a fixed length, and the model is trained on the padded sequences. The trained model and tokenizer are saved for future use. The model is then evaluated on the test set to measure its performance.

13.3.4 Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning and deep learning models. It involves selecting the best combination of hyperparameters for the model.

Example: Hyperparameter Tuning using GridSearchCV

We can use GridSearchCV from the sklearn library to perform hyperparameter tuning for the Logistic Regression model.

hyperparameter_tuning.py:

from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1)

# Perform hyperparameter tuning
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')

# Save the best model
best_model = grid_search.best_estimator_
with open('models/best_logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(best_model, file)

In this script, we define a grid of hyperparameters for the Logistic Regression model and use GridSearchCV to find the best combination. The best model is saved for future use.

13.3.5 Evaluating Model Performance

Evaluating the performance of sentiment analysis models is essential to understand their strengths and weaknesses. We will use various metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Example: Model Evaluation

evaluate_model.py:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the best Logistic Regression model
with open('models/best_logistic_regression_model.pickle', 'rb') as file:
    best_model = pickle.load(file)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap=plt.cm.Blues)
plt.show()

In this script, we evaluate the best Logistic Regression model on the test set and print various metrics. We also plot the confusion matrix to visualize the performance.

In this section, we covered the essential steps of building and training sentiment analysis models. We discussed how to choose the right model, implemented both traditional machine learning models (Logistic Regression) and deep learning models (LSTM), and performed hyperparameter tuning using GridSearchCV.

Additionally, we evaluated the model performance using various metrics and visualizations. By following these steps, we have developed robust sentiment analysis models that can classify the sentiment of text data. 

13.3 Building and Training Sentiment Analysis Models

Building and training sentiment analysis models is a crucial step in developing a sentiment analysis dashboard. These models analyze the sentiment of text data and classify it as positive, negative, or neutral. In this section, we will discuss how to build and train sentiment analysis models using machine learning and deep learning techniques. We will also provide example codes to guide you through the process.

13.3.1 Choosing the Right Model

Choosing the right model for sentiment analysis depends on several factors, including the size of the dataset, the complexity of the text data, and the desired accuracy. We will explore two approaches: traditional machine learning models and deep learning models.

1. Traditional Machine Learning Models

Traditional machine learning models, such as Logistic Regression, Support Vector Machines (SVM), and Naive Bayes, are effective for text classification tasks and are relatively easy to implement and interpret.

2. Deep Learning Models

Deep learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Bidirectional Encoder Representations from Transformers (BERT), can capture complex patterns in text data and often achieve higher accuracy. However, they require more computational resources and training time.

13.3.2 Implementing Machine Learning Models

Let's start by implementing a machine learning model for sentiment analysis. We will use Logistic Regression as an example.

logistic_regression.py:

import pandas as pd
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load balanced training data
with open('data/processed_data/X_train_balanced.pickle', 'rb') as file:
    X_train = pickle.load(file)
with open('data/processed_data/y_train_balanced.pickle', 'rb') as file:
    y_train = pickle.load(file)

# Load test data
with open('data/processed_data/X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')
y_test = test_data['sentiment']

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save the trained model
with open('models/logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(model, file)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we train a Logistic Regression model on the balanced training data and evaluate its performance on the test set. The trained model is saved for future use.

13.3.3 Implementing Deep Learning Models

Next, let's implement a deep learning model for sentiment analysis. We will use an LSTM model as an example.

lstm_model.py:

import numpy as np
import pandas as pd
import pickle
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, classification_report

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_test = test_data['review']
y_test = test_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

max_length = 200
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# Build the LSTM model
embedding_dim = 100
model = Sequential([
    Embedding(input_dim=5000, output_dim=embedding_dim, input_length=max_length),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Save the trained model and tokenizer
model.save('models/lstm_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(tokenizer, file)

# Evaluate the model on the test set
y_pred_prob = model.predict(X_test_padded)
y_pred = (y_pred_prob > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we build and train an LSTM model for sentiment analysis. The text data is tokenized and padded to a fixed length, and the model is trained on the padded sequences. The trained model and tokenizer are saved for future use. The model is then evaluated on the test set to measure its performance.

13.3.4 Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning and deep learning models. It involves selecting the best combination of hyperparameters for the model.

Example: Hyperparameter Tuning using GridSearchCV

We can use GridSearchCV from the sklearn library to perform hyperparameter tuning for the Logistic Regression model.

hyperparameter_tuning.py:

from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1)

# Perform hyperparameter tuning
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')

# Save the best model
best_model = grid_search.best_estimator_
with open('models/best_logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(best_model, file)

In this script, we define a grid of hyperparameters for the Logistic Regression model and use GridSearchCV to find the best combination. The best model is saved for future use.

13.3.5 Evaluating Model Performance

Evaluating the performance of sentiment analysis models is essential to understand their strengths and weaknesses. We will use various metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Example: Model Evaluation

evaluate_model.py:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the best Logistic Regression model
with open('models/best_logistic_regression_model.pickle', 'rb') as file:
    best_model = pickle.load(file)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap=plt.cm.Blues)
plt.show()

In this script, we evaluate the best Logistic Regression model on the test set and print various metrics. We also plot the confusion matrix to visualize the performance.

In this section, we covered the essential steps of building and training sentiment analysis models. We discussed how to choose the right model, implemented both traditional machine learning models (Logistic Regression) and deep learning models (LSTM), and performed hyperparameter tuning using GridSearchCV.

Additionally, we evaluated the model performance using various metrics and visualizations. By following these steps, we have developed robust sentiment analysis models that can classify the sentiment of text data. 

13.3 Building and Training Sentiment Analysis Models

Building and training sentiment analysis models is a crucial step in developing a sentiment analysis dashboard. These models analyze the sentiment of text data and classify it as positive, negative, or neutral. In this section, we will discuss how to build and train sentiment analysis models using machine learning and deep learning techniques. We will also provide example codes to guide you through the process.

13.3.1 Choosing the Right Model

Choosing the right model for sentiment analysis depends on several factors, including the size of the dataset, the complexity of the text data, and the desired accuracy. We will explore two approaches: traditional machine learning models and deep learning models.

1. Traditional Machine Learning Models

Traditional machine learning models, such as Logistic Regression, Support Vector Machines (SVM), and Naive Bayes, are effective for text classification tasks and are relatively easy to implement and interpret.

2. Deep Learning Models

Deep learning models, such as Recurrent Neural Networks (RNNs), Long Short-Term Memory Networks (LSTMs), and Bidirectional Encoder Representations from Transformers (BERT), can capture complex patterns in text data and often achieve higher accuracy. However, they require more computational resources and training time.

13.3.2 Implementing Machine Learning Models

Let's start by implementing a machine learning model for sentiment analysis. We will use Logistic Regression as an example.

logistic_regression.py:

import pandas as pd
import pickle
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split

# Load balanced training data
with open('data/processed_data/X_train_balanced.pickle', 'rb') as file:
    X_train = pickle.load(file)
with open('data/processed_data/y_train_balanced.pickle', 'rb') as file:
    y_train = pickle.load(file)

# Load test data
with open('data/processed_data/X_test.pickle', 'rb') as file:
    X_test = pickle.load(file)
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')
y_test = test_data['sentiment']

# Train a Logistic Regression model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Save the trained model
with open('models/logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(model, file)

# Evaluate the model on the test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we train a Logistic Regression model on the balanced training data and evaluate its performance on the test set. The trained model is saved for future use.

13.3.3 Implementing Deep Learning Models

Next, let's implement a deep learning model for sentiment analysis. We will use an LSTM model as an example.

lstm_model.py:

import numpy as np
import pandas as pd
import pickle
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.optimizers import Adam
from sklearn.metrics import accuracy_score, classification_report

# Load preprocessed data
train_data = pd.read_csv('data/processed_data/train_data_preprocessed.csv')
test_data = pd.read_csv('data/processed_data/test_data_preprocessed.csv')

# Extract features and labels
X_train = train_data['review']
y_train = train_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)
X_test = test_data['review']
y_test = test_data['sentiment'].apply(lambda x: 1 if x == 'positive' else 0)

# Tokenize and pad sequences
tokenizer = Tokenizer(num_words=5000, oov_token='<OOV>')
tokenizer.fit_on_texts(X_train)
word_index = tokenizer.word_index

X_train_sequences = tokenizer.texts_to_sequences(X_train)
X_test_sequences = tokenizer.texts_to_sequences(X_test)

max_length = 200
X_train_padded = pad_sequences(X_train_sequences, maxlen=max_length, padding='post', truncating='post')
X_test_padded = pad_sequences(X_test_sequences, maxlen=max_length, padding='post', truncating='post')

# Build the LSTM model
embedding_dim = 100
model = Sequential([
    Embedding(input_dim=5000, output_dim=embedding_dim, input_length=max_length),
    LSTM(128, return_sequences=True),
    Dropout(0.2),
    LSTM(64),
    Dropout(0.2),
    Dense(1, activation='sigmoid')
])

# Compile the model
model.compile(optimizer=Adam(learning_rate=0.001), loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
history = model.fit(X_train_padded, y_train, epochs=10, batch_size=32, validation_split=0.2)

# Save the trained model and tokenizer
model.save('models/lstm_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(tokenizer, file)

# Evaluate the model on the test set
y_pred_prob = model.predict(X_test_padded)
y_pred = (y_pred_prob > 0.5).astype(int)
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

In this script, we build and train an LSTM model for sentiment analysis. The text data is tokenized and padded to a fixed length, and the model is trained on the padded sequences. The trained model and tokenizer are saved for future use. The model is then evaluated on the test set to measure its performance.

13.3.4 Hyperparameter Tuning

Hyperparameter tuning is essential for optimizing the performance of machine learning and deep learning models. It involves selecting the best combination of hyperparameters for the model.

Example: Hyperparameter Tuning using GridSearchCV

We can use GridSearchCV from the sklearn library to perform hyperparameter tuning for the Logistic Regression model.

hyperparameter_tuning.py:

from sklearn.model_selection import GridSearchCV

# Define hyperparameters to tune
param_grid = {
    'C': [0.01, 0.1, 1, 10, 100],
    'solver': ['lbfgs', 'liblinear']
}

# Initialize GridSearchCV
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='accuracy', verbose=1)

# Perform hyperparameter tuning
grid_search.fit(X_train, y_train)

# Print the best parameters and the corresponding score
print(f'Best Parameters: {grid_search.best_params_}')
print(f'Best Score: {grid_search.best_score_}')

# Save the best model
best_model = grid_search.best_estimator_
with open('models/best_logistic_regression_model.pickle', 'wb') as file:
    pickle.dump(best_model, file)

In this script, we define a grid of hyperparameters for the Logistic Regression model and use GridSearchCV to find the best combination. The best model is saved for future use.

13.3.5 Evaluating Model Performance

Evaluating the performance of sentiment analysis models is essential to understand their strengths and weaknesses. We will use various metrics, including accuracy, precision, recall, F1-score, and confusion matrix.

Example: Model Evaluation

evaluate_model.py:

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Load the best Logistic Regression model
with open('models/best_logistic_regression_model.pickle', 'rb') as file:
    best_model = pickle.load(file)

# Predict on the test set
y_pred = best_model.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
print(classification_report(y_test, y_pred))

# Plot confusion matrix
cm = confusion_matrix(y_test, y_pred, labels=[0, 1])
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Negative', 'Positive'])
disp.plot(cmap=plt.cm.Blues)
plt.show()

In this script, we evaluate the best Logistic Regression model on the test set and print various metrics. We also plot the confusion matrix to visualize the performance.

In this section, we covered the essential steps of building and training sentiment analysis models. We discussed how to choose the right model, implemented both traditional machine learning models (Logistic Regression) and deep learning models (LSTM), and performed hyperparameter tuning using GridSearchCV.

Additionally, we evaluated the model performance using various metrics and visualizations. By following these steps, we have developed robust sentiment analysis models that can classify the sentiment of text data.