Code icon

The App is Under a Quick Maintenance

We apologize for the inconvenience. Please come back later

Menu iconMenu iconData Engineering Foundations
Data Engineering Foundations

Chapter 4: Techniques for Handling Missing Data

4.3 Practical Exercises for Chapter 4

Now that you’ve completed Chapter 4, it's time to apply what you've learned through hands-on exercises. These exercises focus on handling missing data in different contexts, using both simple and advanced techniques. The exercises will help you reinforce the concepts of KNN imputation, MICE, handling missing data in large datasets, and distributed imputation techniques.

Exercise 1: KNN Imputation

You are given a dataset containing information about employees, including their AgeSalary, and Experience. The dataset has some missing values. Your task is to:

Use KNN imputation to fill in the missing values for the dataset.

Solution:

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the KNN Imputer with k=2
knn_imputer = KNNImputer(n_neighbors=2)

# Apply KNN imputation
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_imputed)

Exercise 2: MICE Imputation

You are working with a dataset that contains missing values in multiple columns. The dataset includes AgeSalary, and Experience. Your task is to:

Use MICE (Multivariate Imputation by Chained Equations) to impute the missing values.

Solution:

from sklearn.experimental import enable_iterative_imputer  # To enable IterativeImputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the MICE imputer
mice_imputer = IterativeImputer()

# Apply MICE imputation
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_mice_imputed)

Exercise 3: Dropping Columns with High Missingness

You are working with a large dataset that contains several columns with varying levels of missing values. Your task is to:

Drop any column where more than 50% of the values are missing.

Solution:

import pandas as pd
import numpy as np

# Sample large dataset with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, np.nan, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3],
        'JobTitle': [np.nan, np.nan, 'Engineer', 'Analyst', 'Manager']}

df = pd.DataFrame(data)

# Calculate the proportion of missing values in each column
missing_proportion = df.isnull().mean()

# Drop columns with more than 50% missing values
df_cleaned = df.drop(columns=missing_proportion[missing_proportion > 0.5].index)

# View the cleaned dataframe
print(df_cleaned)

Exercise 4: Simple Imputation for Large Datasets

You are given a large dataset with numerical features, including AgeSalary, and Experience. The dataset contains some missing values, but you want to use simple imputation to fill in the missing values efficiently. Your task is to:

Apply SimpleImputer to fill missing values with the mean of each column.

Solution:

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Use SimpleImputer to impute missing values for numeric columns
simple_imputer = SimpleImputer(strategy='mean')
df_large_imputed = pd.DataFrame(simple_imputer.fit_transform(df_large), columns=df_large.columns)

# View the first few rows of the imputed dataframe
print(df_large_imputed.head())

Exercise 5: Distributed Imputation with Dask

You are working with an extremely large dataset that contains missing values in several columns. To handle the missing data efficiently, you decide to use Dask to distribute the computation. Your task is to:

Convert the dataset to a Dask dataframe and use SimpleImputer to impute the missing values.

Solution:

import dask.dataframe as dd
from sklearn.impute import SimpleImputer
import pandas as pd

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Convert the Pandas dataframe to a Dask dataframe
df_dask = dd.from_pandas(df_large, npartitions=10)

# Define a SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')

# Apply the imputer on the Dask dataframe
df_dask_imputed = df_dask.map_partitions(lambda df: pd.DataFrame(simple_imputer.fit_transform(df), columns=df.columns))

# Compute the result
df_dask_imputed = df_dask_imputed.compute()

# View the first few rows of the imputed dataframe
print(df_dask_imputed.head())

These practical exercises give you hands-on experience with various techniques for handling missing data, from basic imputation methods to advanced distributed computing. By practicing these techniques, you can handle missing data effectively in both small and large datasets, ensuring that your models remain accurate and robust. Keep practicing and exploring these methods as you encounter different datasets in your work!

4.3 Practical Exercises for Chapter 4

Now that you’ve completed Chapter 4, it's time to apply what you've learned through hands-on exercises. These exercises focus on handling missing data in different contexts, using both simple and advanced techniques. The exercises will help you reinforce the concepts of KNN imputation, MICE, handling missing data in large datasets, and distributed imputation techniques.

Exercise 1: KNN Imputation

You are given a dataset containing information about employees, including their AgeSalary, and Experience. The dataset has some missing values. Your task is to:

Use KNN imputation to fill in the missing values for the dataset.

Solution:

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the KNN Imputer with k=2
knn_imputer = KNNImputer(n_neighbors=2)

# Apply KNN imputation
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_imputed)

Exercise 2: MICE Imputation

You are working with a dataset that contains missing values in multiple columns. The dataset includes AgeSalary, and Experience. Your task is to:

Use MICE (Multivariate Imputation by Chained Equations) to impute the missing values.

Solution:

from sklearn.experimental import enable_iterative_imputer  # To enable IterativeImputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the MICE imputer
mice_imputer = IterativeImputer()

# Apply MICE imputation
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_mice_imputed)

Exercise 3: Dropping Columns with High Missingness

You are working with a large dataset that contains several columns with varying levels of missing values. Your task is to:

Drop any column where more than 50% of the values are missing.

Solution:

import pandas as pd
import numpy as np

# Sample large dataset with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, np.nan, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3],
        'JobTitle': [np.nan, np.nan, 'Engineer', 'Analyst', 'Manager']}

df = pd.DataFrame(data)

# Calculate the proportion of missing values in each column
missing_proportion = df.isnull().mean()

# Drop columns with more than 50% missing values
df_cleaned = df.drop(columns=missing_proportion[missing_proportion > 0.5].index)

# View the cleaned dataframe
print(df_cleaned)

Exercise 4: Simple Imputation for Large Datasets

You are given a large dataset with numerical features, including AgeSalary, and Experience. The dataset contains some missing values, but you want to use simple imputation to fill in the missing values efficiently. Your task is to:

Apply SimpleImputer to fill missing values with the mean of each column.

Solution:

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Use SimpleImputer to impute missing values for numeric columns
simple_imputer = SimpleImputer(strategy='mean')
df_large_imputed = pd.DataFrame(simple_imputer.fit_transform(df_large), columns=df_large.columns)

# View the first few rows of the imputed dataframe
print(df_large_imputed.head())

Exercise 5: Distributed Imputation with Dask

You are working with an extremely large dataset that contains missing values in several columns. To handle the missing data efficiently, you decide to use Dask to distribute the computation. Your task is to:

Convert the dataset to a Dask dataframe and use SimpleImputer to impute the missing values.

Solution:

import dask.dataframe as dd
from sklearn.impute import SimpleImputer
import pandas as pd

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Convert the Pandas dataframe to a Dask dataframe
df_dask = dd.from_pandas(df_large, npartitions=10)

# Define a SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')

# Apply the imputer on the Dask dataframe
df_dask_imputed = df_dask.map_partitions(lambda df: pd.DataFrame(simple_imputer.fit_transform(df), columns=df.columns))

# Compute the result
df_dask_imputed = df_dask_imputed.compute()

# View the first few rows of the imputed dataframe
print(df_dask_imputed.head())

These practical exercises give you hands-on experience with various techniques for handling missing data, from basic imputation methods to advanced distributed computing. By practicing these techniques, you can handle missing data effectively in both small and large datasets, ensuring that your models remain accurate and robust. Keep practicing and exploring these methods as you encounter different datasets in your work!

4.3 Practical Exercises for Chapter 4

Now that you’ve completed Chapter 4, it's time to apply what you've learned through hands-on exercises. These exercises focus on handling missing data in different contexts, using both simple and advanced techniques. The exercises will help you reinforce the concepts of KNN imputation, MICE, handling missing data in large datasets, and distributed imputation techniques.

Exercise 1: KNN Imputation

You are given a dataset containing information about employees, including their AgeSalary, and Experience. The dataset has some missing values. Your task is to:

Use KNN imputation to fill in the missing values for the dataset.

Solution:

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the KNN Imputer with k=2
knn_imputer = KNNImputer(n_neighbors=2)

# Apply KNN imputation
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_imputed)

Exercise 2: MICE Imputation

You are working with a dataset that contains missing values in multiple columns. The dataset includes AgeSalary, and Experience. Your task is to:

Use MICE (Multivariate Imputation by Chained Equations) to impute the missing values.

Solution:

from sklearn.experimental import enable_iterative_imputer  # To enable IterativeImputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the MICE imputer
mice_imputer = IterativeImputer()

# Apply MICE imputation
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_mice_imputed)

Exercise 3: Dropping Columns with High Missingness

You are working with a large dataset that contains several columns with varying levels of missing values. Your task is to:

Drop any column where more than 50% of the values are missing.

Solution:

import pandas as pd
import numpy as np

# Sample large dataset with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, np.nan, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3],
        'JobTitle': [np.nan, np.nan, 'Engineer', 'Analyst', 'Manager']}

df = pd.DataFrame(data)

# Calculate the proportion of missing values in each column
missing_proportion = df.isnull().mean()

# Drop columns with more than 50% missing values
df_cleaned = df.drop(columns=missing_proportion[missing_proportion > 0.5].index)

# View the cleaned dataframe
print(df_cleaned)

Exercise 4: Simple Imputation for Large Datasets

You are given a large dataset with numerical features, including AgeSalary, and Experience. The dataset contains some missing values, but you want to use simple imputation to fill in the missing values efficiently. Your task is to:

Apply SimpleImputer to fill missing values with the mean of each column.

Solution:

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Use SimpleImputer to impute missing values for numeric columns
simple_imputer = SimpleImputer(strategy='mean')
df_large_imputed = pd.DataFrame(simple_imputer.fit_transform(df_large), columns=df_large.columns)

# View the first few rows of the imputed dataframe
print(df_large_imputed.head())

Exercise 5: Distributed Imputation with Dask

You are working with an extremely large dataset that contains missing values in several columns. To handle the missing data efficiently, you decide to use Dask to distribute the computation. Your task is to:

Convert the dataset to a Dask dataframe and use SimpleImputer to impute the missing values.

Solution:

import dask.dataframe as dd
from sklearn.impute import SimpleImputer
import pandas as pd

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Convert the Pandas dataframe to a Dask dataframe
df_dask = dd.from_pandas(df_large, npartitions=10)

# Define a SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')

# Apply the imputer on the Dask dataframe
df_dask_imputed = df_dask.map_partitions(lambda df: pd.DataFrame(simple_imputer.fit_transform(df), columns=df.columns))

# Compute the result
df_dask_imputed = df_dask_imputed.compute()

# View the first few rows of the imputed dataframe
print(df_dask_imputed.head())

These practical exercises give you hands-on experience with various techniques for handling missing data, from basic imputation methods to advanced distributed computing. By practicing these techniques, you can handle missing data effectively in both small and large datasets, ensuring that your models remain accurate and robust. Keep practicing and exploring these methods as you encounter different datasets in your work!

4.3 Practical Exercises for Chapter 4

Now that you’ve completed Chapter 4, it's time to apply what you've learned through hands-on exercises. These exercises focus on handling missing data in different contexts, using both simple and advanced techniques. The exercises will help you reinforce the concepts of KNN imputation, MICE, handling missing data in large datasets, and distributed imputation techniques.

Exercise 1: KNN Imputation

You are given a dataset containing information about employees, including their AgeSalary, and Experience. The dataset has some missing values. Your task is to:

Use KNN imputation to fill in the missing values for the dataset.

Solution:

import numpy as np
import pandas as pd
from sklearn.impute import KNNImputer

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the KNN Imputer with k=2
knn_imputer = KNNImputer(n_neighbors=2)

# Apply KNN imputation
df_imputed = pd.DataFrame(knn_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_imputed)

Exercise 2: MICE Imputation

You are working with a dataset that contains missing values in multiple columns. The dataset includes AgeSalary, and Experience. Your task is to:

Use MICE (Multivariate Imputation by Chained Equations) to impute the missing values.

Solution:

from sklearn.experimental import enable_iterative_imputer  # To enable IterativeImputer
from sklearn.impute import IterativeImputer
import pandas as pd

# Sample data with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, 60000, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3]}

df = pd.DataFrame(data)

# Initialize the MICE imputer
mice_imputer = IterativeImputer()

# Apply MICE imputation
df_mice_imputed = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)

# View the imputed dataframe
print(df_mice_imputed)

Exercise 3: Dropping Columns with High Missingness

You are working with a large dataset that contains several columns with varying levels of missing values. Your task is to:

Drop any column where more than 50% of the values are missing.

Solution:

import pandas as pd
import numpy as np

# Sample large dataset with missing values
data = {'Age': [25, np.nan, 22, 35, np.nan],
        'Salary': [50000, np.nan, 52000, np.nan, 58000],
        'Experience': [2, 4, 1, np.nan, 3],
        'JobTitle': [np.nan, np.nan, 'Engineer', 'Analyst', 'Manager']}

df = pd.DataFrame(data)

# Calculate the proportion of missing values in each column
missing_proportion = df.isnull().mean()

# Drop columns with more than 50% missing values
df_cleaned = df.drop(columns=missing_proportion[missing_proportion > 0.5].index)

# View the cleaned dataframe
print(df_cleaned)

Exercise 4: Simple Imputation for Large Datasets

You are given a large dataset with numerical features, including AgeSalary, and Experience. The dataset contains some missing values, but you want to use simple imputation to fill in the missing values efficiently. Your task is to:

Apply SimpleImputer to fill missing values with the mean of each column.

Solution:

import pandas as pd
from sklearn.impute import SimpleImputer

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Use SimpleImputer to impute missing values for numeric columns
simple_imputer = SimpleImputer(strategy='mean')
df_large_imputed = pd.DataFrame(simple_imputer.fit_transform(df_large), columns=df_large.columns)

# View the first few rows of the imputed dataframe
print(df_large_imputed.head())

Exercise 5: Distributed Imputation with Dask

You are working with an extremely large dataset that contains missing values in several columns. To handle the missing data efficiently, you decide to use Dask to distribute the computation. Your task is to:

Convert the dataset to a Dask dataframe and use SimpleImputer to impute the missing values.

Solution:

import dask.dataframe as dd
from sklearn.impute import SimpleImputer
import pandas as pd

# Sample large dataset with missing values
data = {'Age': [25, None, 22, 35, None] * 200000,
        'Salary': [50000, 60000, None, 80000, 58000] * 200000,
        'Experience': [2, 4, 1, None, 3] * 200000}

df_large = pd.DataFrame(data)

# Convert the Pandas dataframe to a Dask dataframe
df_dask = dd.from_pandas(df_large, npartitions=10)

# Define a SimpleImputer
simple_imputer = SimpleImputer(strategy='mean')

# Apply the imputer on the Dask dataframe
df_dask_imputed = df_dask.map_partitions(lambda df: pd.DataFrame(simple_imputer.fit_transform(df), columns=df.columns))

# Compute the result
df_dask_imputed = df_dask_imputed.compute()

# View the first few rows of the imputed dataframe
print(df_dask_imputed.head())

These practical exercises give you hands-on experience with various techniques for handling missing data, from basic imputation methods to advanced distributed computing. By practicing these techniques, you can handle missing data effectively in both small and large datasets, ensuring that your models remain accurate and robust. Keep practicing and exploring these methods as you encounter different datasets in your work!