Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconMachine Learning with Python
Machine Learning with Python

Chapter 3: Data Preprocessing

3.5 Train-Test Split

Welcome to the final stage of our data preprocessing journey - the Train-Test Split! This is an important step in preparing our dataset for machine learning analysis. At this stage, we divide our dataset into two parts: a training set and a test set. The training set is used to train our machine learning model, and the test set is used to evaluate the model's performance. This process is similar to a dress rehearsal before the actual performance.

By splitting the dataset into two parts, we can train our model on one part and test it on another, ensuring that our model is accurate and generalizable. In this section, we will explore how to perform a train-test split using Scikit-learn, an open-source machine learning library for Python. We will provide a step-by-step guide on how to split your dataset into training and testing sets using Scikit-learn's built-in functions, and we will also discuss best practices for splitting your data. So, let's get started on this crucial step towards building a successful machine learning model!

In this section, we will explore how to perform a train-test split using Scikit-learn.

3.5.1 Performing a Train-Test Split

The train-test split is an essential technique in machine learning. It is used to evaluate the performance of a machine learning algorithm by splitting the dataset into two parts: the training set and the test set.

The training set is used to train the model and is a crucial part of the process. By using the training set, the learning algorithm can learn from the data and improve the model's performance. On the other hand, the test set is used to evaluate the model's performance and measure how well it can generalize to new, unseen data.

This is an important step in the process as it helps to ensure that the model is not overfitting to the training data. In summary, the train-test split is a powerful technique that is used to assess the performance of a machine learning algorithm, and it is essential to ensure that the model is both accurate and robust.

Example:

Here's how we can perform a train-test split using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

X_train:

   A  B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

X_test:

   A  B
5  6  60
6  7  70
7  8  80
8  9  90
9  10  100

y_train:

0 0 0 0 0

y_test:

1 1 1 1 1

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before.

In this example, we've used a test size of 0.2, which means that 20% of the data will be used for the test set and 80% for the training set. The random_state parameter is used for reproducibility of the results.

3.5.2 Stratified Sampling

When splitting a dataset into training and testing sets, it is crucial to ensure that both sets accurately represent the original distribution of classes. This is especially important when the dataset has a significant class imbalance.

To ensure that the proportions of each class are maintained, we can use a technique called stratified sampling. This method involves utilizing the train_test_split function and providing the stratify parameter with the feature column that contains class labels. The function then splits the dataset in a manner that preserves the original class distribution in both the training and testing sets.

By using stratified sampling, we can ensure that our models are trained and evaluated on a representative sample of the original dataset. This can help prevent issues such as overfitting to the majority class or underestimating the importance of minority classes.

Example:

Here's how we can perform a train-test split with stratified sampling using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable with imbalanced class distribution
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

# Perform a train-test split with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42, stratify=y)

print("y_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

y_train:

[0 0 0 0 0]

y_test:

[1 1 1]

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The stratify argument is set to y, which ensures that the training and testing sets have the same class distribution as the original data. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before. The class distribution of the training and testing sets is the same as the class distribution of the original data.

In this example, despite the imbalance in the class distribution in y, the y_train and y_test splits have the same proportion of class 0 and class 1 samples due to stratified sampling.

With this, we come to the end of our journey through the land of data preprocessing. Throughout this journey, we have learned various techniques such as data cleaning, normalization, and transformation, that will help us make sense of our data. In the upcoming chapters, we will apply these techniques to prepare our data for various machine learning algorithms.

We will explore different types of algorithms such as linear regression, decision trees, and support vector machines, and see how preprocessing plays a crucial role in the performance of these algorithms. So, stay tuned for an exciting journey ahead!

3.5 Train-Test Split

Welcome to the final stage of our data preprocessing journey - the Train-Test Split! This is an important step in preparing our dataset for machine learning analysis. At this stage, we divide our dataset into two parts: a training set and a test set. The training set is used to train our machine learning model, and the test set is used to evaluate the model's performance. This process is similar to a dress rehearsal before the actual performance.

By splitting the dataset into two parts, we can train our model on one part and test it on another, ensuring that our model is accurate and generalizable. In this section, we will explore how to perform a train-test split using Scikit-learn, an open-source machine learning library for Python. We will provide a step-by-step guide on how to split your dataset into training and testing sets using Scikit-learn's built-in functions, and we will also discuss best practices for splitting your data. So, let's get started on this crucial step towards building a successful machine learning model!

In this section, we will explore how to perform a train-test split using Scikit-learn.

3.5.1 Performing a Train-Test Split

The train-test split is an essential technique in machine learning. It is used to evaluate the performance of a machine learning algorithm by splitting the dataset into two parts: the training set and the test set.

The training set is used to train the model and is a crucial part of the process. By using the training set, the learning algorithm can learn from the data and improve the model's performance. On the other hand, the test set is used to evaluate the model's performance and measure how well it can generalize to new, unseen data.

This is an important step in the process as it helps to ensure that the model is not overfitting to the training data. In summary, the train-test split is a powerful technique that is used to assess the performance of a machine learning algorithm, and it is essential to ensure that the model is both accurate and robust.

Example:

Here's how we can perform a train-test split using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

X_train:

   A  B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

X_test:

   A  B
5  6  60
6  7  70
7  8  80
8  9  90
9  10  100

y_train:

0 0 0 0 0

y_test:

1 1 1 1 1

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before.

In this example, we've used a test size of 0.2, which means that 20% of the data will be used for the test set and 80% for the training set. The random_state parameter is used for reproducibility of the results.

3.5.2 Stratified Sampling

When splitting a dataset into training and testing sets, it is crucial to ensure that both sets accurately represent the original distribution of classes. This is especially important when the dataset has a significant class imbalance.

To ensure that the proportions of each class are maintained, we can use a technique called stratified sampling. This method involves utilizing the train_test_split function and providing the stratify parameter with the feature column that contains class labels. The function then splits the dataset in a manner that preserves the original class distribution in both the training and testing sets.

By using stratified sampling, we can ensure that our models are trained and evaluated on a representative sample of the original dataset. This can help prevent issues such as overfitting to the majority class or underestimating the importance of minority classes.

Example:

Here's how we can perform a train-test split with stratified sampling using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable with imbalanced class distribution
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

# Perform a train-test split with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42, stratify=y)

print("y_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

y_train:

[0 0 0 0 0]

y_test:

[1 1 1]

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The stratify argument is set to y, which ensures that the training and testing sets have the same class distribution as the original data. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before. The class distribution of the training and testing sets is the same as the class distribution of the original data.

In this example, despite the imbalance in the class distribution in y, the y_train and y_test splits have the same proportion of class 0 and class 1 samples due to stratified sampling.

With this, we come to the end of our journey through the land of data preprocessing. Throughout this journey, we have learned various techniques such as data cleaning, normalization, and transformation, that will help us make sense of our data. In the upcoming chapters, we will apply these techniques to prepare our data for various machine learning algorithms.

We will explore different types of algorithms such as linear regression, decision trees, and support vector machines, and see how preprocessing plays a crucial role in the performance of these algorithms. So, stay tuned for an exciting journey ahead!

3.5 Train-Test Split

Welcome to the final stage of our data preprocessing journey - the Train-Test Split! This is an important step in preparing our dataset for machine learning analysis. At this stage, we divide our dataset into two parts: a training set and a test set. The training set is used to train our machine learning model, and the test set is used to evaluate the model's performance. This process is similar to a dress rehearsal before the actual performance.

By splitting the dataset into two parts, we can train our model on one part and test it on another, ensuring that our model is accurate and generalizable. In this section, we will explore how to perform a train-test split using Scikit-learn, an open-source machine learning library for Python. We will provide a step-by-step guide on how to split your dataset into training and testing sets using Scikit-learn's built-in functions, and we will also discuss best practices for splitting your data. So, let's get started on this crucial step towards building a successful machine learning model!

In this section, we will explore how to perform a train-test split using Scikit-learn.

3.5.1 Performing a Train-Test Split

The train-test split is an essential technique in machine learning. It is used to evaluate the performance of a machine learning algorithm by splitting the dataset into two parts: the training set and the test set.

The training set is used to train the model and is a crucial part of the process. By using the training set, the learning algorithm can learn from the data and improve the model's performance. On the other hand, the test set is used to evaluate the model's performance and measure how well it can generalize to new, unseen data.

This is an important step in the process as it helps to ensure that the model is not overfitting to the training data. In summary, the train-test split is a powerful technique that is used to assess the performance of a machine learning algorithm, and it is essential to ensure that the model is both accurate and robust.

Example:

Here's how we can perform a train-test split using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

X_train:

   A  B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

X_test:

   A  B
5  6  60
6  7  70
7  8  80
8  9  90
9  10  100

y_train:

0 0 0 0 0

y_test:

1 1 1 1 1

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before.

In this example, we've used a test size of 0.2, which means that 20% of the data will be used for the test set and 80% for the training set. The random_state parameter is used for reproducibility of the results.

3.5.2 Stratified Sampling

When splitting a dataset into training and testing sets, it is crucial to ensure that both sets accurately represent the original distribution of classes. This is especially important when the dataset has a significant class imbalance.

To ensure that the proportions of each class are maintained, we can use a technique called stratified sampling. This method involves utilizing the train_test_split function and providing the stratify parameter with the feature column that contains class labels. The function then splits the dataset in a manner that preserves the original class distribution in both the training and testing sets.

By using stratified sampling, we can ensure that our models are trained and evaluated on a representative sample of the original dataset. This can help prevent issues such as overfitting to the majority class or underestimating the importance of minority classes.

Example:

Here's how we can perform a train-test split with stratified sampling using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable with imbalanced class distribution
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

# Perform a train-test split with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42, stratify=y)

print("y_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

y_train:

[0 0 0 0 0]

y_test:

[1 1 1]

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The stratify argument is set to y, which ensures that the training and testing sets have the same class distribution as the original data. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before. The class distribution of the training and testing sets is the same as the class distribution of the original data.

In this example, despite the imbalance in the class distribution in y, the y_train and y_test splits have the same proportion of class 0 and class 1 samples due to stratified sampling.

With this, we come to the end of our journey through the land of data preprocessing. Throughout this journey, we have learned various techniques such as data cleaning, normalization, and transformation, that will help us make sense of our data. In the upcoming chapters, we will apply these techniques to prepare our data for various machine learning algorithms.

We will explore different types of algorithms such as linear regression, decision trees, and support vector machines, and see how preprocessing plays a crucial role in the performance of these algorithms. So, stay tuned for an exciting journey ahead!

3.5 Train-Test Split

Welcome to the final stage of our data preprocessing journey - the Train-Test Split! This is an important step in preparing our dataset for machine learning analysis. At this stage, we divide our dataset into two parts: a training set and a test set. The training set is used to train our machine learning model, and the test set is used to evaluate the model's performance. This process is similar to a dress rehearsal before the actual performance.

By splitting the dataset into two parts, we can train our model on one part and test it on another, ensuring that our model is accurate and generalizable. In this section, we will explore how to perform a train-test split using Scikit-learn, an open-source machine learning library for Python. We will provide a step-by-step guide on how to split your dataset into training and testing sets using Scikit-learn's built-in functions, and we will also discuss best practices for splitting your data. So, let's get started on this crucial step towards building a successful machine learning model!

In this section, we will explore how to perform a train-test split using Scikit-learn.

3.5.1 Performing a Train-Test Split

The train-test split is an essential technique in machine learning. It is used to evaluate the performance of a machine learning algorithm by splitting the dataset into two parts: the training set and the test set.

The training set is used to train the model and is a crucial part of the process. By using the training set, the learning algorithm can learn from the data and improve the model's performance. On the other hand, the test set is used to evaluate the model's performance and measure how well it can generalize to new, unseen data.

This is an important step in the process as it helps to ensure that the model is not overfitting to the training data. In summary, the train-test split is a powerful technique that is used to assess the performance of a machine learning algorithm, and it is essential to ensure that the model is both accurate and robust.

Example:

Here's how we can perform a train-test split using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable
y = [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]

# Perform a train-test split
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42)

print("X_train:")
print(X_train)
print("\nX_test:")
print(X_test)
print("\ny_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

X_train:

   A  B
0  1  10
1  2  20
2  3  30
3  4  40
4  5  50

X_test:

   A  B
5  6  60
6  7  70
7  8  80
8  9  90
9  10  100

y_train:

0 0 0 0 0

y_test:

1 1 1 1 1

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before.

In this example, we've used a test size of 0.2, which means that 20% of the data will be used for the test set and 80% for the training set. The random_state parameter is used for reproducibility of the results.

3.5.2 Stratified Sampling

When splitting a dataset into training and testing sets, it is crucial to ensure that both sets accurately represent the original distribution of classes. This is especially important when the dataset has a significant class imbalance.

To ensure that the proportions of each class are maintained, we can use a technique called stratified sampling. This method involves utilizing the train_test_split function and providing the stratify parameter with the feature column that contains class labels. The function then splits the dataset in a manner that preserves the original class distribution in both the training and testing sets.

By using stratified sampling, we can ensure that our models are trained and evaluated on a representative sample of the original dataset. This can help prevent issues such as overfitting to the majority class or underestimating the importance of minority classes.

Example:

Here's how we can perform a train-test split with stratified sampling using Scikit-learn:

import pandas as pd
from sklearn.model_selection import train_test_split

# Create a DataFrame
df = pd.DataFrame({
    'A': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
    'B': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100]
})

# Create a target variable with imbalanced class distribution
y = [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]

# Perform a train-test split with stratified sampling
X_train, X_test, y_train, y_test = train_test_split(df, y, test_size=0.2, random_state=42, stratify=y)

print("y_train:")
print(y_train)
print("\ny_test:")
print(y_test)

Output:

y_train:

[0 0 0 0 0]

y_test:

[1 1 1]

The code first imports the sklearn.model_selection module as train_test_split. The code then creates a DataFrame called df with the columns A and B and the values [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and [10, 20, 30, 40, 50, 60, 70, 80, 90, 100] respectively. The code then creates a target variable y with the values [0, 0, 0, 0, 0, 0, 0, 1, 1, 1]. The code then performs a train-test split using the train_test_split function. The test_size argument is set to 0.2, which means that 20% of the data will be used for testing and 80% of the data will be used for training. The random_state argument is set to 42, which ensures that the same split will be used every time the code is run. The stratify argument is set to y, which ensures that the training and testing sets have the same class distribution as the original data. The code then prints the training and testing sets.

The output shows that the training set contains 80% of the data and the testing set contains 20% of the data. The target variable y has been split into the training and testing sets in the same way. This ensures that the model is not trained on the testing set and that it can be evaluated on data that it has not seen before. The class distribution of the training and testing sets is the same as the class distribution of the original data.

In this example, despite the imbalance in the class distribution in y, the y_train and y_test splits have the same proportion of class 0 and class 1 samples due to stratified sampling.

With this, we come to the end of our journey through the land of data preprocessing. Throughout this journey, we have learned various techniques such as data cleaning, normalization, and transformation, that will help us make sense of our data. In the upcoming chapters, we will apply these techniques to prepare our data for various machine learning algorithms.

We will explore different types of algorithms such as linear regression, decision trees, and support vector machines, and see how preprocessing plays a crucial role in the performance of these algorithms. So, stay tuned for an exciting journey ahead!