Click here to view the next lesson.

Chapter 11: Chatbot Project: Personal Assistant Chatbot

11.2 Data Collection and Preprocessing

Data collection and preprocessing are critical steps in building an effective chatbot. The quality and relevance of the data used to train the model directly impact the chatbot's performance. In this section, we will discuss how to collect and preprocess data for our personal assistant chatbot.

11.2.1 Collecting Data

For our personal assistant chatbot, we need data that covers a wide range of user intents and entities. We can start with the intents and patterns defined in our intents.json file and expand it with additional data sources:

Manual Data Collection: Manually create a list of common user queries and responses.
Public Datasets: Use publicly available datasets that contain conversational data, such as the Cornell Movie Dialogs Corpus or the ChatterBot dataset.
API Documentation: For specific tasks like weather updates or setting reminders, refer to API documentation to understand the data format and sample queries.

Let's enhance our intents.json file with more patterns and responses to make the chatbot more robust.

{
    "intents": [
        {
            "tag": "greeting",
            "patterns": ["Hi", "Hello", "Hey"],
            "responses": ["Hello! How can I assist you today?", "Hi there! What can I do for you?", "Hey! How can I help?"]
        },
        {
            "tag": "goodbye",
            "patterns": ["Bye", "Goodbye", "See you later"],
            "responses": ["Goodbye! Have a great day!", "See you later! Take care!"]
        },
        {
            "tag": "weather",
            "patterns": ["What's the weather like?", "Tell me the weather", "How's the weather today?"],
            "responses": ["Let me check the weather for you.", "Fetching the weather details..."]
        },
        {
            "tag": "reminder",
            "patterns": ["Set a reminder", "Remind me to", "Add a reminder"],
            "responses": ["Sure, what would you like to be reminded about?", "When would you like the reminder to be set?"]
        }
    ]
}

This file defines a few basic intents: greeting, goodbye, weather, and reminder. Each intent has patterns (possible user inputs) and responses (predefined chatbot replies).

If you want a deeper understanding of handling JSON files, we recommend reading this blog post: https://www.cuantum.tech/post/mastering-json-creating-handling-and-working-with-json-files

11.2.2 Building the NLP Engine

Next, we'll build the NLP engine to process user inputs, recognize intents, and extract entities. We'll use TensorFlow to train a simple model for intent recognition.

nlp_engine.py:

import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the intents file
with open('data/intents.json') as file:
    intents = json.load(file)

# Extract patterns and corresponding tags
patterns = []
tags = []
for intent in intents['intents']:
    for pattern in intent['patterns']:
        patterns.append(pattern)
        tags.append(intent['tag'])

# Encode the tags
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tags)

# Vectorize the patterns
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(patterns).toarray()
y = np.array(labels)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
model = Sequential()
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(label_encoder.classes_), activation='softmax'))

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

# Save the model and tokenizer
model.save('models/nlp_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('models/label_encoder.pickle', 'wb') as file:
    pickle.dump(label_encoder, file)

Here's a detailed breakdown of each part of the script:

Importing Libraries:
import json import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split
This section imports essential libraries:
- json: For handling JSON files.
- numpy: For numerical operations.
- tensorflow and keras: For building and training the neural network.
- LabelEncoder and TfidfVectorizer from scikit-learn: For encoding labels and vectorizing text data.
- train_test_split from scikit-learn: For splitting the dataset into training and test sets.
Loading the Intents File:
with open('data/intents.json') as file: intents = json.load(file)
This code snippet loads the intents JSON file, which contains various user intents and their corresponding patterns and responses.
Extracting Patterns and Tags:
patterns = [] tags = [] for intent in intents['intents']: for pattern in intent['patterns']: patterns.append(pattern) tags.append(intent['tag'])
Here, the script iterates through the intents and extracts the patterns (user inputs) and their corresponding tags (intent labels). These are stored in the patterns and tags lists, respectively.
Encoding the Tags:
label_encoder = LabelEncoder() labels = label_encoder.fit_transform(tags)
The tags are encoded into numerical values using LabelEncoder, which is necessary for training the neural network.
Vectorizing the Patterns:
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(patterns).toarray() y = np.array(labels)
The TfidfVectorizer converts the text patterns into numerical vectors based on the Term Frequency-Inverse Document Frequency (TF-IDF) scheme. This transformation is crucial for feeding the text data into the neural network.
Splitting the Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The dataset is split into training and test sets using an 80-20 ratio. The random_state parameter ensures reproducibility.
Building the Neural Network Model:
model = Sequential() model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu')) model.add(Dropout(0.5)) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(len(label_encoder.classes_), activation='softmax'))
A sequential neural network model is built using Keras. It consists of:
- An input layer with 128 neurons and ReLU activation.
- A dropout layer with a 50% dropout rate to prevent overfitting.
- A hidden layer with 64 neurons and ReLU activation.
- Another dropout layer with a 50% dropout rate.
- An output layer with the number of neurons equal to the number of unique intents, using softmax activation for multi-class classification.
Compiling the Model:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The model is compiled with:
- sparse_categorical_crossentropy loss function, suitable for multi-class classification with integer labels.
- adam optimizer, a popular choice for its efficiency.
- accuracy as the evaluation metric.
Training the Model:
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))
The model is trained for 100 epochs with a batch size of 8. The training process uses the training data and evaluates the performance on the test data after each epoch.
Saving the Model and Tokenizer:
model.save('models/nlp_model.h5') with open('models/tokenizer.pickle', 'wb') as file: pickle.dump(vectorizer, file) with open('models/label_encoder.pickle', 'wb') as file: pickle.dump(label_encoder, file)
Once trained, the model is saved to an HDF5 file (nlp_model.h5). Additionally, the TfidfVectorizer and LabelEncoder objects are saved using the pickle module. These saved objects are essential for preprocessing new data during inference.

In summary, this script processes the chatbot's training data, builds a neural network for intent recognition, trains the model, and saves the necessary components for future use.

In this section, we introduced the personal assistant chatbot project, outlined the design considerations, and set up the initial project structure. We also defined the intents and entities and built the NLP engine for intent recognition. This lays the foundation for developing a fully functional personal assistant chatbot that can handle various tasks and enhance user productivity.

11.2.3 Handling Missing or Imbalanced Data

In real-world applications, data may be missing or imbalanced. It's important to address these issues during preprocessing to ensure the model performs well.

Handling Missing Data: When dealing with missing data, it is essential to either replace the missing values with a placeholder, such as the mean or median of the column, or to remove instances that contain missing data. This ensures that the dataset remains clean and usable for analysis or model training.
Addressing Imbalanced Data: To address the issue of imbalanced data, which can adversely affect model performance, various techniques can be employed. These include oversampling the minority class, undersampling the majority class, or generating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique). Balancing the dataset in this manner helps in achieving more reliable and accurate results.

Example: Handling Missing Data and Imbalanced Data:

from imblearn.over_sampling import SMOTE

# Check for missing data
print(f"Missing values: {np.isnan(X).sum()}")

# Handle missing data (if any)
X = np.nan_to_num(X)

# Balance the dataset using SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Train the model with the balanced dataset
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

This example snippet demonstrates a machine learning workflow for handling imbalanced datasets using SMOTE (Synthetic Minority Over-sampling Technique). It first checks and handles any missing values in the feature set X.

Then, it applies SMOTE to balance the dataset by generating synthetic samples for the minority class. After balancing, it splits the data into training and test sets. Finally, it trains a machine learning model using the balanced dataset.

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

Manual Data Collection: Manually create a list of common user queries and responses.
Public Datasets: Use publicly available datasets that contain conversational data, such as the Cornell Movie Dialogs Corpus or the ChatterBot dataset.
API Documentation: For specific tasks like weather updates or setting reminders, refer to API documentation to understand the data format and sample queries.

Let's enhance our intents.json file with more patterns and responses to make the chatbot more robust.

{
    "intents": [
        {
            "tag": "greeting",
            "patterns": ["Hi", "Hello", "Hey"],
            "responses": ["Hello! How can I assist you today?", "Hi there! What can I do for you?", "Hey! How can I help?"]
        },
        {
            "tag": "goodbye",
            "patterns": ["Bye", "Goodbye", "See you later"],
            "responses": ["Goodbye! Have a great day!", "See you later! Take care!"]
        },
        {
            "tag": "weather",
            "patterns": ["What's the weather like?", "Tell me the weather", "How's the weather today?"],
            "responses": ["Let me check the weather for you.", "Fetching the weather details..."]
        },
        {
            "tag": "reminder",
            "patterns": ["Set a reminder", "Remind me to", "Add a reminder"],
            "responses": ["Sure, what would you like to be reminded about?", "When would you like the reminder to be set?"]
        }
    ]
}

This file defines a few basic intents: greeting, goodbye, weather, and reminder. Each intent has patterns (possible user inputs) and responses (predefined chatbot replies).

If you want a deeper understanding of handling JSON files, we recommend reading this blog post: https://www.cuantum.tech/post/mastering-json-creating-handling-and-working-with-json-files

11.2.2 Building the NLP Engine

Next, we'll build the NLP engine to process user inputs, recognize intents, and extract entities. We'll use TensorFlow to train a simple model for intent recognition.

nlp_engine.py:

import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the intents file
with open('data/intents.json') as file:
    intents = json.load(file)

# Extract patterns and corresponding tags
patterns = []
tags = []
for intent in intents['intents']:
    for pattern in intent['patterns']:
        patterns.append(pattern)
        tags.append(intent['tag'])

# Encode the tags
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tags)

# Vectorize the patterns
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(patterns).toarray()
y = np.array(labels)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
model = Sequential()
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(label_encoder.classes_), activation='softmax'))

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

# Save the model and tokenizer
model.save('models/nlp_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('models/label_encoder.pickle', 'wb') as file:
    pickle.dump(label_encoder, file)

Here's a detailed breakdown of each part of the script:

Importing Libraries:
import json import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split
This section imports essential libraries:
- json: For handling JSON files.
- numpy: For numerical operations.
- tensorflow and keras: For building and training the neural network.
- LabelEncoder and TfidfVectorizer from scikit-learn: For encoding labels and vectorizing text data.
- train_test_split from scikit-learn: For splitting the dataset into training and test sets.
Loading the Intents File:
with open('data/intents.json') as file: intents = json.load(file)
This code snippet loads the intents JSON file, which contains various user intents and their corresponding patterns and responses.
Extracting Patterns and Tags:
patterns = [] tags = [] for intent in intents['intents']: for pattern in intent['patterns']: patterns.append(pattern) tags.append(intent['tag'])
Here, the script iterates through the intents and extracts the patterns (user inputs) and their corresponding tags (intent labels). These are stored in the patterns and tags lists, respectively.
Encoding the Tags:
label_encoder = LabelEncoder() labels = label_encoder.fit_transform(tags)
The tags are encoded into numerical values using LabelEncoder, which is necessary for training the neural network.
Vectorizing the Patterns:
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(patterns).toarray() y = np.array(labels)
The TfidfVectorizer converts the text patterns into numerical vectors based on the Term Frequency-Inverse Document Frequency (TF-IDF) scheme. This transformation is crucial for feeding the text data into the neural network.
Splitting the Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The dataset is split into training and test sets using an 80-20 ratio. The random_state parameter ensures reproducibility.
Building the Neural Network Model:
model = Sequential() model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu')) model.add(Dropout(0.5)) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(len(label_encoder.classes_), activation='softmax'))
A sequential neural network model is built using Keras. It consists of:
- An input layer with 128 neurons and ReLU activation.
- A dropout layer with a 50% dropout rate to prevent overfitting.
- A hidden layer with 64 neurons and ReLU activation.
- Another dropout layer with a 50% dropout rate.
- An output layer with the number of neurons equal to the number of unique intents, using softmax activation for multi-class classification.
Compiling the Model:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The model is compiled with:
- sparse_categorical_crossentropy loss function, suitable for multi-class classification with integer labels.
- adam optimizer, a popular choice for its efficiency.
- accuracy as the evaluation metric.
Training the Model:
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))
The model is trained for 100 epochs with a batch size of 8. The training process uses the training data and evaluates the performance on the test data after each epoch.
Saving the Model and Tokenizer:
model.save('models/nlp_model.h5') with open('models/tokenizer.pickle', 'wb') as file: pickle.dump(vectorizer, file) with open('models/label_encoder.pickle', 'wb') as file: pickle.dump(label_encoder, file)
Once trained, the model is saved to an HDF5 file (nlp_model.h5). Additionally, the TfidfVectorizer and LabelEncoder objects are saved using the pickle module. These saved objects are essential for preprocessing new data during inference.

In summary, this script processes the chatbot's training data, builds a neural network for intent recognition, trains the model, and saves the necessary components for future use.

11.2.3 Handling Missing or Imbalanced Data

In real-world applications, data may be missing or imbalanced. It's important to address these issues during preprocessing to ensure the model performs well.

Handling Missing Data: When dealing with missing data, it is essential to either replace the missing values with a placeholder, such as the mean or median of the column, or to remove instances that contain missing data. This ensures that the dataset remains clean and usable for analysis or model training.
Addressing Imbalanced Data: To address the issue of imbalanced data, which can adversely affect model performance, various techniques can be employed. These include oversampling the minority class, undersampling the majority class, or generating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique). Balancing the dataset in this manner helps in achieving more reliable and accurate results.

Example: Handling Missing Data and Imbalanced Data:

from imblearn.over_sampling import SMOTE

# Check for missing data
print(f"Missing values: {np.isnan(X).sum()}")

# Handle missing data (if any)
X = np.nan_to_num(X)

# Balance the dataset using SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Train the model with the balanced dataset
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

Manual Data Collection: Manually create a list of common user queries and responses.
Public Datasets: Use publicly available datasets that contain conversational data, such as the Cornell Movie Dialogs Corpus or the ChatterBot dataset.
API Documentation: For specific tasks like weather updates or setting reminders, refer to API documentation to understand the data format and sample queries.

Let's enhance our intents.json file with more patterns and responses to make the chatbot more robust.

{
    "intents": [
        {
            "tag": "greeting",
            "patterns": ["Hi", "Hello", "Hey"],
            "responses": ["Hello! How can I assist you today?", "Hi there! What can I do for you?", "Hey! How can I help?"]
        },
        {
            "tag": "goodbye",
            "patterns": ["Bye", "Goodbye", "See you later"],
            "responses": ["Goodbye! Have a great day!", "See you later! Take care!"]
        },
        {
            "tag": "weather",
            "patterns": ["What's the weather like?", "Tell me the weather", "How's the weather today?"],
            "responses": ["Let me check the weather for you.", "Fetching the weather details..."]
        },
        {
            "tag": "reminder",
            "patterns": ["Set a reminder", "Remind me to", "Add a reminder"],
            "responses": ["Sure, what would you like to be reminded about?", "When would you like the reminder to be set?"]
        }
    ]
}

This file defines a few basic intents: greeting, goodbye, weather, and reminder. Each intent has patterns (possible user inputs) and responses (predefined chatbot replies).

If you want a deeper understanding of handling JSON files, we recommend reading this blog post: https://www.cuantum.tech/post/mastering-json-creating-handling-and-working-with-json-files

11.2.2 Building the NLP Engine

Next, we'll build the NLP engine to process user inputs, recognize intents, and extract entities. We'll use TensorFlow to train a simple model for intent recognition.

nlp_engine.py:

import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the intents file
with open('data/intents.json') as file:
    intents = json.load(file)

# Extract patterns and corresponding tags
patterns = []
tags = []
for intent in intents['intents']:
    for pattern in intent['patterns']:
        patterns.append(pattern)
        tags.append(intent['tag'])

# Encode the tags
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tags)

# Vectorize the patterns
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(patterns).toarray()
y = np.array(labels)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
model = Sequential()
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(label_encoder.classes_), activation='softmax'))

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

# Save the model and tokenizer
model.save('models/nlp_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('models/label_encoder.pickle', 'wb') as file:
    pickle.dump(label_encoder, file)

Here's a detailed breakdown of each part of the script:

Importing Libraries:
import json import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split
This section imports essential libraries:
- json: For handling JSON files.
- numpy: For numerical operations.
- tensorflow and keras: For building and training the neural network.
- LabelEncoder and TfidfVectorizer from scikit-learn: For encoding labels and vectorizing text data.
- train_test_split from scikit-learn: For splitting the dataset into training and test sets.
Loading the Intents File:
with open('data/intents.json') as file: intents = json.load(file)
This code snippet loads the intents JSON file, which contains various user intents and their corresponding patterns and responses.
Extracting Patterns and Tags:
patterns = [] tags = [] for intent in intents['intents']: for pattern in intent['patterns']: patterns.append(pattern) tags.append(intent['tag'])
Here, the script iterates through the intents and extracts the patterns (user inputs) and their corresponding tags (intent labels). These are stored in the patterns and tags lists, respectively.
Encoding the Tags:
label_encoder = LabelEncoder() labels = label_encoder.fit_transform(tags)
The tags are encoded into numerical values using LabelEncoder, which is necessary for training the neural network.
Vectorizing the Patterns:
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(patterns).toarray() y = np.array(labels)
The TfidfVectorizer converts the text patterns into numerical vectors based on the Term Frequency-Inverse Document Frequency (TF-IDF) scheme. This transformation is crucial for feeding the text data into the neural network.
Splitting the Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The dataset is split into training and test sets using an 80-20 ratio. The random_state parameter ensures reproducibility.
Building the Neural Network Model:
model = Sequential() model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu')) model.add(Dropout(0.5)) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(len(label_encoder.classes_), activation='softmax'))
A sequential neural network model is built using Keras. It consists of:
- An input layer with 128 neurons and ReLU activation.
- A dropout layer with a 50% dropout rate to prevent overfitting.
- A hidden layer with 64 neurons and ReLU activation.
- Another dropout layer with a 50% dropout rate.
- An output layer with the number of neurons equal to the number of unique intents, using softmax activation for multi-class classification.
Compiling the Model:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The model is compiled with:
- sparse_categorical_crossentropy loss function, suitable for multi-class classification with integer labels.
- adam optimizer, a popular choice for its efficiency.
- accuracy as the evaluation metric.
Training the Model:
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))
The model is trained for 100 epochs with a batch size of 8. The training process uses the training data and evaluates the performance on the test data after each epoch.
Saving the Model and Tokenizer:
model.save('models/nlp_model.h5') with open('models/tokenizer.pickle', 'wb') as file: pickle.dump(vectorizer, file) with open('models/label_encoder.pickle', 'wb') as file: pickle.dump(label_encoder, file)
Once trained, the model is saved to an HDF5 file (nlp_model.h5). Additionally, the TfidfVectorizer and LabelEncoder objects are saved using the pickle module. These saved objects are essential for preprocessing new data during inference.

In summary, this script processes the chatbot's training data, builds a neural network for intent recognition, trains the model, and saves the necessary components for future use.

11.2.3 Handling Missing or Imbalanced Data

In real-world applications, data may be missing or imbalanced. It's important to address these issues during preprocessing to ensure the model performs well.

Handling Missing Data: When dealing with missing data, it is essential to either replace the missing values with a placeholder, such as the mean or median of the column, or to remove instances that contain missing data. This ensures that the dataset remains clean and usable for analysis or model training.
Addressing Imbalanced Data: To address the issue of imbalanced data, which can adversely affect model performance, various techniques can be employed. These include oversampling the minority class, undersampling the majority class, or generating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique). Balancing the dataset in this manner helps in achieving more reliable and accurate results.

Example: Handling Missing Data and Imbalanced Data:

from imblearn.over_sampling import SMOTE

# Check for missing data
print(f"Missing values: {np.isnan(X).sum()}")

# Handle missing data (if any)
X = np.nan_to_num(X)

# Balance the dataset using SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Train the model with the balanced dataset
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

Manual Data Collection: Manually create a list of common user queries and responses.
Public Datasets: Use publicly available datasets that contain conversational data, such as the Cornell Movie Dialogs Corpus or the ChatterBot dataset.
API Documentation: For specific tasks like weather updates or setting reminders, refer to API documentation to understand the data format and sample queries.

Let's enhance our intents.json file with more patterns and responses to make the chatbot more robust.

{
    "intents": [
        {
            "tag": "greeting",
            "patterns": ["Hi", "Hello", "Hey"],
            "responses": ["Hello! How can I assist you today?", "Hi there! What can I do for you?", "Hey! How can I help?"]
        },
        {
            "tag": "goodbye",
            "patterns": ["Bye", "Goodbye", "See you later"],
            "responses": ["Goodbye! Have a great day!", "See you later! Take care!"]
        },
        {
            "tag": "weather",
            "patterns": ["What's the weather like?", "Tell me the weather", "How's the weather today?"],
            "responses": ["Let me check the weather for you.", "Fetching the weather details..."]
        },
        {
            "tag": "reminder",
            "patterns": ["Set a reminder", "Remind me to", "Add a reminder"],
            "responses": ["Sure, what would you like to be reminded about?", "When would you like the reminder to be set?"]
        }
    ]
}

This file defines a few basic intents: greeting, goodbye, weather, and reminder. Each intent has patterns (possible user inputs) and responses (predefined chatbot replies).

If you want a deeper understanding of handling JSON files, we recommend reading this blog post: https://www.cuantum.tech/post/mastering-json-creating-handling-and-working-with-json-files

11.2.2 Building the NLP Engine

Next, we'll build the NLP engine to process user inputs, recognize intents, and extract entities. We'll use TensorFlow to train a simple model for intent recognition.

nlp_engine.py:

import json
import numpy as np
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout
from sklearn.preprocessing import LabelEncoder
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

# Load the intents file
with open('data/intents.json') as file:
    intents = json.load(file)

# Extract patterns and corresponding tags
patterns = []
tags = []
for intent in intents['intents']:
    for pattern in intent['patterns']:
        patterns.append(pattern)
        tags.append(intent['tag'])

# Encode the tags
label_encoder = LabelEncoder()
labels = label_encoder.fit_transform(tags)

# Vectorize the patterns
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(patterns).toarray()
y = np.array(labels)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build the model
model = Sequential()
model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.5))
model.add(Dense(len(label_encoder.classes_), activation='softmax'))

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Train the model
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

# Save the model and tokenizer
model.save('models/nlp_model.h5')
with open('models/tokenizer.pickle', 'wb') as file:
    pickle.dump(vectorizer, file)
with open('models/label_encoder.pickle', 'wb') as file:
    pickle.dump(label_encoder, file)

Here's a detailed breakdown of each part of the script:

Importing Libraries:
import json import numpy as np import tensorflow as tf from tensorflow.keras.models import Sequential from tensorflow.keras.layers import Dense, Dropout from sklearn.preprocessing import LabelEncoder from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.model_selection import train_test_split
This section imports essential libraries:
- json: For handling JSON files.
- numpy: For numerical operations.
- tensorflow and keras: For building and training the neural network.
- LabelEncoder and TfidfVectorizer from scikit-learn: For encoding labels and vectorizing text data.
- train_test_split from scikit-learn: For splitting the dataset into training and test sets.
Loading the Intents File:
with open('data/intents.json') as file: intents = json.load(file)
This code snippet loads the intents JSON file, which contains various user intents and their corresponding patterns and responses.
Extracting Patterns and Tags:
patterns = [] tags = [] for intent in intents['intents']: for pattern in intent['patterns']: patterns.append(pattern) tags.append(intent['tag'])
Here, the script iterates through the intents and extracts the patterns (user inputs) and their corresponding tags (intent labels). These are stored in the patterns and tags lists, respectively.
Encoding the Tags:
label_encoder = LabelEncoder() labels = label_encoder.fit_transform(tags)
The tags are encoded into numerical values using LabelEncoder, which is necessary for training the neural network.
Vectorizing the Patterns:
vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(patterns).toarray() y = np.array(labels)
The TfidfVectorizer converts the text patterns into numerical vectors based on the Term Frequency-Inverse Document Frequency (TF-IDF) scheme. This transformation is crucial for feeding the text data into the neural network.
Splitting the Data:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The dataset is split into training and test sets using an 80-20 ratio. The random_state parameter ensures reproducibility.
Building the Neural Network Model:
model = Sequential() model.add(Dense(128, input_shape=(X_train.shape[1],), activation='relu')) model.add(Dropout(0.5)) model.add(Dense(64, activation='relu')) model.add(Dropout(0.5)) model.add(Dense(len(label_encoder.classes_), activation='softmax'))
A sequential neural network model is built using Keras. It consists of:
- An input layer with 128 neurons and ReLU activation.
- A dropout layer with a 50% dropout rate to prevent overfitting.
- A hidden layer with 64 neurons and ReLU activation.
- Another dropout layer with a 50% dropout rate.
- An output layer with the number of neurons equal to the number of unique intents, using softmax activation for multi-class classification.
Compiling the Model:
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
The model is compiled with:
- sparse_categorical_crossentropy loss function, suitable for multi-class classification with integer labels.
- adam optimizer, a popular choice for its efficiency.
- accuracy as the evaluation metric.
Training the Model:
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))
The model is trained for 100 epochs with a batch size of 8. The training process uses the training data and evaluates the performance on the test data after each epoch.
Saving the Model and Tokenizer:
model.save('models/nlp_model.h5') with open('models/tokenizer.pickle', 'wb') as file: pickle.dump(vectorizer, file) with open('models/label_encoder.pickle', 'wb') as file: pickle.dump(label_encoder, file)
Once trained, the model is saved to an HDF5 file (nlp_model.h5). Additionally, the TfidfVectorizer and LabelEncoder objects are saved using the pickle module. These saved objects are essential for preprocessing new data during inference.

In summary, this script processes the chatbot's training data, builds a neural network for intent recognition, trains the model, and saves the necessary components for future use.

11.2.3 Handling Missing or Imbalanced Data

In real-world applications, data may be missing or imbalanced. It's important to address these issues during preprocessing to ensure the model performs well.

Handling Missing Data: When dealing with missing data, it is essential to either replace the missing values with a placeholder, such as the mean or median of the column, or to remove instances that contain missing data. This ensures that the dataset remains clean and usable for analysis or model training.
Addressing Imbalanced Data: To address the issue of imbalanced data, which can adversely affect model performance, various techniques can be employed. These include oversampling the minority class, undersampling the majority class, or generating synthetic samples using methods like SMOTE (Synthetic Minority Over-sampling Technique). Balancing the dataset in this manner helps in achieving more reliable and accurate results.

Example: Handling Missing Data and Imbalanced Data:

from imblearn.over_sampling import SMOTE

# Check for missing data
print(f"Missing values: {np.isnan(X).sum()}")

# Handle missing data (if any)
X = np.nan_to_num(X)

# Balance the dataset using SMOTE (Synthetic Minority Over-sampling Technique)
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X, y)

# Split the resampled data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X_resampled, y_resampled, test_size=0.2, random_state=42)

# Train the model with the balanced dataset
model.fit(X_train, y_train, epochs=100, batch_size=8, verbose=1, validation_data=(X_test, y_test))

Purchase this book

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Chapter 11: Chatbot Project: Personal Assistant Chatbot

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

11.2.2 Building the NLP Engine

11.2.3 Handling Missing or Imbalanced Data

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

11.2.2 Building the NLP Engine

11.2.3 Handling Missing or Imbalanced Data

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

11.2.2 Building the NLP Engine

11.2.3 Handling Missing or Imbalanced Data

11.2 Data Collection and Preprocessing

11.2.1 Collecting Data

11.2.2 Building the NLP Engine

11.2.3 Handling Missing or Imbalanced Data