Menu iconMenu iconData Analysis Foundations with Python
Data Analysis Foundations with Python

Chapter 9: Data Preprocessing

9.4 Practical Exercises: Chapter 9: Data Preprocessing

Exercise 9.1: Data Cleaning

You have a dataset with missing values and outliers.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
        'Age': [25, np.nan, 35, 40, 50],
        'Salary': [50000, 70000, 120000, 110000, 90000],
        'Experience': [2, 10, np.nan, 7, 15]}

df = pd.DataFrame(data)
  1. Remove rows where Name is missing.
  2. Fill missing values in the Age and Experience columns with their respective means.

Solution

# Remove rows where Name is missing
df.dropna(subset=['Name'], inplace=True)

# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Experience'].fillna(df['Experience'].mean(), inplace=True)

Exercise 9.2: Feature Engineering

Create a new feature called AgeGroup in the above DataFrame based on the Age column. Use the following groups: "Young" for Age <=30, "Middle-aged" for Age between 31 and 45, and "Senior" for Age > 45.

Solution

df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 45, np.inf], labels=['Young', 'Middle-aged', 'Senior'])

Exercise 9.3: Data Transformation

Apply Min-Max scaling to the Salary column of the DataFrame.

Solution

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['Salary'] = scaler.fit_transform(df[['Salary']])

9.4 Practical Exercises: Chapter 9: Data Preprocessing

Exercise 9.1: Data Cleaning

You have a dataset with missing values and outliers.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
        'Age': [25, np.nan, 35, 40, 50],
        'Salary': [50000, 70000, 120000, 110000, 90000],
        'Experience': [2, 10, np.nan, 7, 15]}

df = pd.DataFrame(data)
  1. Remove rows where Name is missing.
  2. Fill missing values in the Age and Experience columns with their respective means.

Solution

# Remove rows where Name is missing
df.dropna(subset=['Name'], inplace=True)

# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Experience'].fillna(df['Experience'].mean(), inplace=True)

Exercise 9.2: Feature Engineering

Create a new feature called AgeGroup in the above DataFrame based on the Age column. Use the following groups: "Young" for Age <=30, "Middle-aged" for Age between 31 and 45, and "Senior" for Age > 45.

Solution

df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 45, np.inf], labels=['Young', 'Middle-aged', 'Senior'])

Exercise 9.3: Data Transformation

Apply Min-Max scaling to the Salary column of the DataFrame.

Solution

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['Salary'] = scaler.fit_transform(df[['Salary']])

9.4 Practical Exercises: Chapter 9: Data Preprocessing

Exercise 9.1: Data Cleaning

You have a dataset with missing values and outliers.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
        'Age': [25, np.nan, 35, 40, 50],
        'Salary': [50000, 70000, 120000, 110000, 90000],
        'Experience': [2, 10, np.nan, 7, 15]}

df = pd.DataFrame(data)
  1. Remove rows where Name is missing.
  2. Fill missing values in the Age and Experience columns with their respective means.

Solution

# Remove rows where Name is missing
df.dropna(subset=['Name'], inplace=True)

# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Experience'].fillna(df['Experience'].mean(), inplace=True)

Exercise 9.2: Feature Engineering

Create a new feature called AgeGroup in the above DataFrame based on the Age column. Use the following groups: "Young" for Age <=30, "Middle-aged" for Age between 31 and 45, and "Senior" for Age > 45.

Solution

df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 45, np.inf], labels=['Young', 'Middle-aged', 'Senior'])

Exercise 9.3: Data Transformation

Apply Min-Max scaling to the Salary column of the DataFrame.

Solution

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['Salary'] = scaler.fit_transform(df[['Salary']])

9.4 Practical Exercises: Chapter 9: Data Preprocessing

Exercise 9.1: Data Cleaning

You have a dataset with missing values and outliers.

import pandas as pd
import numpy as np

# Create a sample DataFrame
data = {'Name': ['Alice', 'Bob', 'Charlie', None, 'Eve'],
        'Age': [25, np.nan, 35, 40, 50],
        'Salary': [50000, 70000, 120000, 110000, 90000],
        'Experience': [2, 10, np.nan, 7, 15]}

df = pd.DataFrame(data)
  1. Remove rows where Name is missing.
  2. Fill missing values in the Age and Experience columns with their respective means.

Solution

# Remove rows where Name is missing
df.dropna(subset=['Name'], inplace=True)

# Fill missing values with mean
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Experience'].fillna(df['Experience'].mean(), inplace=True)

Exercise 9.2: Feature Engineering

Create a new feature called AgeGroup in the above DataFrame based on the Age column. Use the following groups: "Young" for Age <=30, "Middle-aged" for Age between 31 and 45, and "Senior" for Age > 45.

Solution

df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 30, 45, np.inf], labels=['Young', 'Middle-aged', 'Senior'])

Exercise 9.3: Data Transformation

Apply Min-Max scaling to the Salary column of the DataFrame.

Solution

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
df['Salary'] = scaler.fit_transform(df[['Salary']])