Menu iconMenu iconMachine Learning with Python
Machine Learning with Python

Chapter 13: Practical Machine Learning Projects

13.2 Project 2: Sentiment Analysis with Naive Bayes

In this project, we will develop a sentiment analysis model using the Naive Bayes algorithm. Sentiment analysis is a common application of Natural Language Processing (NLP) and Machine Learning, and it involves determining the sentiment expressed in a piece of text, such as a review or tweet.

13.2.1 Problem Statement

The goal of this project is to build a model that can accurately classify text as positive or negative based on the sentiment expressed in it. This can be useful in a variety of contexts, such as understanding customer feedback or analyzing social media posts.

13.2.2 Dataset

We will use the IMDB movie reviews dataset for this project. This dataset consists of 50,000 movie reviews from the Internet Movie Database (IMDB), each labeled as either positive (1) or negative (0). The dataset is divided evenly with 25,000 reviews intended for training and 25,000 for testing.

13.2.3 Implementation

Let's start by loading the dataset and examining its structure.

from sklearn.datasets import load_files
import numpy as np

# Make sure the path points to the correct location where your training data is stored
# If the data is in the same directory as the script, you can use "aclImdb/train/"
reviews_train = load_files("aclImdb/train/")

# Extract text data and labels from the loaded dataset
text_train, y_train = reviews_train.data, reviews_train.target

# Print the number of documents in the training data
print("Number of documents in training data: {}".format(len(text_train)))

# Print the distribution of samples per class
print("Samples per class (training): {}".format(np.bincount(y_train)))

Code breakdown:

The code first imports the load_files function from the sklearn.datasets library and the numpy library. Next, the code uses the load_files() function to load the training data from the aclImdb/train/ directory. The code then splits the data into two arrays, text_train and y_train, where text_train contains the text of the reviews and y_train contains the sentiment of the reviews (positive or negative). Finally, the code prints the number of documents in the training data and the number of samples per class.


Next, we will preprocess the data by removing HTML tags and converting all text to lowercase.

import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<[^>]*>', '', text)
    
    # Find emoticons and remove non-word characters
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    return text

# Preprocess each text in text_train
text_train = [preprocess_text(text) for text in text_train]

Code breakdown:

The code first imports the re library, which provides regular expression operations. Next, the code defines a function called preprocess_text(), which takes a string as input and returns a processed string. The function first removes HTML tags from the input string using the re.sub() function. The function then finds all emoticons in the input string using the re.findall() function. The function then converts all non-word characters to spaces in the input string using the re.sub() function. The function then converts the input string to lowercase. The function then joins the emoticons with spaces. The function then replaces all hyphens with empty strings. The function then returns the processed string. Finally, the code uses a list comprehension to apply the preprocess_text() function to all strings in the text_train array.


We will then split the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the preprocessed text data and corresponding labels into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_train, y_train, test_size=0.2, random_state=42)

Code breakdown:



Next, we will convert the text data into numerical feature vectors using the Bag of Words technique.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop_words='english' to remove common English words
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
X_train = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test = vectorizer.transform(X_test)

Code breakdown:

The code first imports the train_test_split function from the sklearn.model_selection library. Next, the code uses the train_test_split() function to split the data into training and testing subsets. The test_size parameter specifies that 20% of the data should be used for testing, and the random_state parameter specifies that the data should be shuffled randomly. Finally, the code assigns the training and testing subsets to the X_trainX_testy_train, and y_test variables.


Finally, we will train a Naive Bayes classifier on the training data and evaluate its performance on the testing data.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Initialize Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict labels for the testing data
y_pred = clf.predict(X_test)

# Compute accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

Code breakdown:

The code first imports the MultinomialNB and accuracy_score functions from the sklearn.naive_bayes and sklearn.metrics libraries, respectively. Next, the code creates a MultinomialNB classifier called clf. The code then fits the classifier to the training data using the clf.fit() function. The code then predicts the sentiment of the testing data using the clf.predict() function. The code then calculates the accuracy of the classifier using the accuracy_score() function. Finally, the code prints the accuracy of the classifier.


This project provides a practical application of machine learning in the field of NLP. It demonstrates how to use the Naive Bayes algorithm to perform sentiment analysis on movie reviews. The code provided can be used as a starting point for further exploration and experimentation.

13.2 Project 2: Sentiment Analysis with Naive Bayes

In this project, we will develop a sentiment analysis model using the Naive Bayes algorithm. Sentiment analysis is a common application of Natural Language Processing (NLP) and Machine Learning, and it involves determining the sentiment expressed in a piece of text, such as a review or tweet.

13.2.1 Problem Statement

The goal of this project is to build a model that can accurately classify text as positive or negative based on the sentiment expressed in it. This can be useful in a variety of contexts, such as understanding customer feedback or analyzing social media posts.

13.2.2 Dataset

We will use the IMDB movie reviews dataset for this project. This dataset consists of 50,000 movie reviews from the Internet Movie Database (IMDB), each labeled as either positive (1) or negative (0). The dataset is divided evenly with 25,000 reviews intended for training and 25,000 for testing.

13.2.3 Implementation

Let's start by loading the dataset and examining its structure.

from sklearn.datasets import load_files
import numpy as np

# Make sure the path points to the correct location where your training data is stored
# If the data is in the same directory as the script, you can use "aclImdb/train/"
reviews_train = load_files("aclImdb/train/")

# Extract text data and labels from the loaded dataset
text_train, y_train = reviews_train.data, reviews_train.target

# Print the number of documents in the training data
print("Number of documents in training data: {}".format(len(text_train)))

# Print the distribution of samples per class
print("Samples per class (training): {}".format(np.bincount(y_train)))

Code breakdown:

The code first imports the load_files function from the sklearn.datasets library and the numpy library. Next, the code uses the load_files() function to load the training data from the aclImdb/train/ directory. The code then splits the data into two arrays, text_train and y_train, where text_train contains the text of the reviews and y_train contains the sentiment of the reviews (positive or negative). Finally, the code prints the number of documents in the training data and the number of samples per class.


Next, we will preprocess the data by removing HTML tags and converting all text to lowercase.

import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<[^>]*>', '', text)
    
    # Find emoticons and remove non-word characters
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    return text

# Preprocess each text in text_train
text_train = [preprocess_text(text) for text in text_train]

Code breakdown:

The code first imports the re library, which provides regular expression operations. Next, the code defines a function called preprocess_text(), which takes a string as input and returns a processed string. The function first removes HTML tags from the input string using the re.sub() function. The function then finds all emoticons in the input string using the re.findall() function. The function then converts all non-word characters to spaces in the input string using the re.sub() function. The function then converts the input string to lowercase. The function then joins the emoticons with spaces. The function then replaces all hyphens with empty strings. The function then returns the processed string. Finally, the code uses a list comprehension to apply the preprocess_text() function to all strings in the text_train array.


We will then split the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the preprocessed text data and corresponding labels into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_train, y_train, test_size=0.2, random_state=42)

Code breakdown:



Next, we will convert the text data into numerical feature vectors using the Bag of Words technique.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop_words='english' to remove common English words
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
X_train = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test = vectorizer.transform(X_test)

Code breakdown:

The code first imports the train_test_split function from the sklearn.model_selection library. Next, the code uses the train_test_split() function to split the data into training and testing subsets. The test_size parameter specifies that 20% of the data should be used for testing, and the random_state parameter specifies that the data should be shuffled randomly. Finally, the code assigns the training and testing subsets to the X_trainX_testy_train, and y_test variables.


Finally, we will train a Naive Bayes classifier on the training data and evaluate its performance on the testing data.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Initialize Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict labels for the testing data
y_pred = clf.predict(X_test)

# Compute accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

Code breakdown:

The code first imports the MultinomialNB and accuracy_score functions from the sklearn.naive_bayes and sklearn.metrics libraries, respectively. Next, the code creates a MultinomialNB classifier called clf. The code then fits the classifier to the training data using the clf.fit() function. The code then predicts the sentiment of the testing data using the clf.predict() function. The code then calculates the accuracy of the classifier using the accuracy_score() function. Finally, the code prints the accuracy of the classifier.


This project provides a practical application of machine learning in the field of NLP. It demonstrates how to use the Naive Bayes algorithm to perform sentiment analysis on movie reviews. The code provided can be used as a starting point for further exploration and experimentation.

13.2 Project 2: Sentiment Analysis with Naive Bayes

In this project, we will develop a sentiment analysis model using the Naive Bayes algorithm. Sentiment analysis is a common application of Natural Language Processing (NLP) and Machine Learning, and it involves determining the sentiment expressed in a piece of text, such as a review or tweet.

13.2.1 Problem Statement

The goal of this project is to build a model that can accurately classify text as positive or negative based on the sentiment expressed in it. This can be useful in a variety of contexts, such as understanding customer feedback or analyzing social media posts.

13.2.2 Dataset

We will use the IMDB movie reviews dataset for this project. This dataset consists of 50,000 movie reviews from the Internet Movie Database (IMDB), each labeled as either positive (1) or negative (0). The dataset is divided evenly with 25,000 reviews intended for training and 25,000 for testing.

13.2.3 Implementation

Let's start by loading the dataset and examining its structure.

from sklearn.datasets import load_files
import numpy as np

# Make sure the path points to the correct location where your training data is stored
# If the data is in the same directory as the script, you can use "aclImdb/train/"
reviews_train = load_files("aclImdb/train/")

# Extract text data and labels from the loaded dataset
text_train, y_train = reviews_train.data, reviews_train.target

# Print the number of documents in the training data
print("Number of documents in training data: {}".format(len(text_train)))

# Print the distribution of samples per class
print("Samples per class (training): {}".format(np.bincount(y_train)))

Code breakdown:

The code first imports the load_files function from the sklearn.datasets library and the numpy library. Next, the code uses the load_files() function to load the training data from the aclImdb/train/ directory. The code then splits the data into two arrays, text_train and y_train, where text_train contains the text of the reviews and y_train contains the sentiment of the reviews (positive or negative). Finally, the code prints the number of documents in the training data and the number of samples per class.


Next, we will preprocess the data by removing HTML tags and converting all text to lowercase.

import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<[^>]*>', '', text)
    
    # Find emoticons and remove non-word characters
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    return text

# Preprocess each text in text_train
text_train = [preprocess_text(text) for text in text_train]

Code breakdown:

The code first imports the re library, which provides regular expression operations. Next, the code defines a function called preprocess_text(), which takes a string as input and returns a processed string. The function first removes HTML tags from the input string using the re.sub() function. The function then finds all emoticons in the input string using the re.findall() function. The function then converts all non-word characters to spaces in the input string using the re.sub() function. The function then converts the input string to lowercase. The function then joins the emoticons with spaces. The function then replaces all hyphens with empty strings. The function then returns the processed string. Finally, the code uses a list comprehension to apply the preprocess_text() function to all strings in the text_train array.


We will then split the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the preprocessed text data and corresponding labels into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_train, y_train, test_size=0.2, random_state=42)

Code breakdown:



Next, we will convert the text data into numerical feature vectors using the Bag of Words technique.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop_words='english' to remove common English words
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
X_train = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test = vectorizer.transform(X_test)

Code breakdown:

The code first imports the train_test_split function from the sklearn.model_selection library. Next, the code uses the train_test_split() function to split the data into training and testing subsets. The test_size parameter specifies that 20% of the data should be used for testing, and the random_state parameter specifies that the data should be shuffled randomly. Finally, the code assigns the training and testing subsets to the X_trainX_testy_train, and y_test variables.


Finally, we will train a Naive Bayes classifier on the training data and evaluate its performance on the testing data.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Initialize Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict labels for the testing data
y_pred = clf.predict(X_test)

# Compute accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

Code breakdown:

The code first imports the MultinomialNB and accuracy_score functions from the sklearn.naive_bayes and sklearn.metrics libraries, respectively. Next, the code creates a MultinomialNB classifier called clf. The code then fits the classifier to the training data using the clf.fit() function. The code then predicts the sentiment of the testing data using the clf.predict() function. The code then calculates the accuracy of the classifier using the accuracy_score() function. Finally, the code prints the accuracy of the classifier.


This project provides a practical application of machine learning in the field of NLP. It demonstrates how to use the Naive Bayes algorithm to perform sentiment analysis on movie reviews. The code provided can be used as a starting point for further exploration and experimentation.

13.2 Project 2: Sentiment Analysis with Naive Bayes

In this project, we will develop a sentiment analysis model using the Naive Bayes algorithm. Sentiment analysis is a common application of Natural Language Processing (NLP) and Machine Learning, and it involves determining the sentiment expressed in a piece of text, such as a review or tweet.

13.2.1 Problem Statement

The goal of this project is to build a model that can accurately classify text as positive or negative based on the sentiment expressed in it. This can be useful in a variety of contexts, such as understanding customer feedback or analyzing social media posts.

13.2.2 Dataset

We will use the IMDB movie reviews dataset for this project. This dataset consists of 50,000 movie reviews from the Internet Movie Database (IMDB), each labeled as either positive (1) or negative (0). The dataset is divided evenly with 25,000 reviews intended for training and 25,000 for testing.

13.2.3 Implementation

Let's start by loading the dataset and examining its structure.

from sklearn.datasets import load_files
import numpy as np

# Make sure the path points to the correct location where your training data is stored
# If the data is in the same directory as the script, you can use "aclImdb/train/"
reviews_train = load_files("aclImdb/train/")

# Extract text data and labels from the loaded dataset
text_train, y_train = reviews_train.data, reviews_train.target

# Print the number of documents in the training data
print("Number of documents in training data: {}".format(len(text_train)))

# Print the distribution of samples per class
print("Samples per class (training): {}".format(np.bincount(y_train)))

Code breakdown:

The code first imports the load_files function from the sklearn.datasets library and the numpy library. Next, the code uses the load_files() function to load the training data from the aclImdb/train/ directory. The code then splits the data into two arrays, text_train and y_train, where text_train contains the text of the reviews and y_train contains the sentiment of the reviews (positive or negative). Finally, the code prints the number of documents in the training data and the number of samples per class.


Next, we will preprocess the data by removing HTML tags and converting all text to lowercase.

import re

def preprocess_text(text):
    # Remove HTML tags
    text = re.sub('<[^>]*>', '', text)
    
    # Find emoticons and remove non-word characters
    emoticons = re.findall('(?::|;|=)(?:-)?(?:\)|\(|D|P)', text)
    text = re.sub('[\W]+', ' ', text.lower()) + ' '.join(emoticons).replace('-', '')
    
    return text

# Preprocess each text in text_train
text_train = [preprocess_text(text) for text in text_train]

Code breakdown:

The code first imports the re library, which provides regular expression operations. Next, the code defines a function called preprocess_text(), which takes a string as input and returns a processed string. The function first removes HTML tags from the input string using the re.sub() function. The function then finds all emoticons in the input string using the re.findall() function. The function then converts all non-word characters to spaces in the input string using the re.sub() function. The function then converts the input string to lowercase. The function then joins the emoticons with spaces. The function then replaces all hyphens with empty strings. The function then returns the processed string. Finally, the code uses a list comprehension to apply the preprocess_text() function to all strings in the text_train array.


We will then split the data into training and testing sets.

from sklearn.model_selection import train_test_split

# Split the preprocessed text data and corresponding labels into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(text_train, y_train, test_size=0.2, random_state=42)

Code breakdown:



Next, we will convert the text data into numerical feature vectors using the Bag of Words technique.

from sklearn.feature_extraction.text import CountVectorizer

# Initialize CountVectorizer with stop_words='english' to remove common English words
vectorizer = CountVectorizer(stop_words='english')

# Fit and transform the training data
X_train = vectorizer.fit_transform(X_train)

# Transform the testing data
X_test = vectorizer.transform(X_test)

Code breakdown:

The code first imports the train_test_split function from the sklearn.model_selection library. Next, the code uses the train_test_split() function to split the data into training and testing subsets. The test_size parameter specifies that 20% of the data should be used for testing, and the random_state parameter specifies that the data should be shuffled randomly. Finally, the code assigns the training and testing subsets to the X_trainX_testy_train, and y_test variables.


Finally, we will train a Naive Bayes classifier on the training data and evaluate its performance on the testing data.

from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

# Initialize Multinomial Naive Bayes classifier
clf = MultinomialNB()

# Train the classifier on the training data
clf.fit(X_train, y_train)

# Predict labels for the testing data
y_pred = clf.predict(X_test)

# Compute accuracy score
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: {:.2f}".format(accuracy))

Code breakdown:

The code first imports the MultinomialNB and accuracy_score functions from the sklearn.naive_bayes and sklearn.metrics libraries, respectively. Next, the code creates a MultinomialNB classifier called clf. The code then fits the classifier to the training data using the clf.fit() function. The code then predicts the sentiment of the testing data using the clf.predict() function. The code then calculates the accuracy of the classifier using the accuracy_score() function. Finally, the code prints the accuracy of the classifier.


This project provides a practical application of machine learning in the field of NLP. It demonstrates how to use the Naive Bayes algorithm to perform sentiment analysis on movie reviews. The code provided can be used as a starting point for further exploration and experimentation.