Chapter 13: Practical Machine Learning Projects
13.1 Project 1: Predicting House Prices with Regression
In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
13.1.1 Problem Statement
The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.
13.1.2 Dataset
The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.
The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s
13.1.3 Implementation
Step 1
Let's start by loading the dataset and removing the non-essential features.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Code breakdown:
The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.
We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.
Next, we will calculate some descriptive statistics about the Boston housing prices.
import numpy as np
import pandas as pd
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
# Minimum price of the data
minimum_price = np.min(prices)
# Maximum price of the data
maximum_price = np.max(prices)
# Mean price of the data
mean_price = np.mean(prices)
# Median price of the data
median_price = np.median(prices)
# Standard deviation of prices of the data
std_price = np.std(prices)
# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(minimum_price))
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${:.2f}".format(std_price))
Code breakdown:
The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called prices
, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions amin()
, amax()
, mean()
, median()
, and std()
to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.
We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.
Next, we will split the data into training and testing subsets.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis=1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
# Success
print("Training and testing split was successful.")
Code breakdown:
The code first imports the train_test_split
function from the sklearn.model_selection
library. Next, the code defines two variables, features
and prices
, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the train_test_split
function to split the data into training and testing subsets. The test_size
parameter specifies that 20% of the data should be used for testing, and the random_state
parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.
We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.
# Import 'ShuffleSplit'
from sklearn.model_selection import ShuffleSplit
def fit_model(X, y):
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth': list(range(1, 11))}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
Code breakdown:
The code first imports the DecisionTreeRegressor
, make_scorer
, and GridSearchCV
functions from the sklearn.tree
, sklearn.metrics
, and sklearn.model_selection
libraries, respectively. Next, the code defines a function called fit_model()
, which takes two arguments, X
and y
, which represent the training data and the target values, respectively. The code then creates a ShuffleSplit
object called cv_sets
, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a DecisionTreeRegressor
object called regressor
. The code then creates a dictionary called params
, which maps the parameter name max_depth
to a list of values from 1 to 10. The code then uses the make_scorer()
function to create a scoring function called scoring_fnc
, which will be used to evaluate the performance of the different models. Finally, the code creates a GridSearchCV
object called grid
, which will be used to search for the optimal model. The grid
object is passed the regressor
, params
, scoring_fnc
, and cv_sets
objects. The grid
object is then fit to the data, which will find the optimal model. The optimal model is then returned from the fit_model()
function.
Finally, we will make predictions on new sets of input data.
# Assume reg is the trained model obtained from fit_model
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
# Show predictions
for i, price in enumerate(reg.predict(client_data)):
print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))
Code breakdown:
The code first creates a matrix called client_data
, which contains the client data. The code then uses the reg.predict()
function to predict the selling price for each client. The code then uses the enumerate()
function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.
This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.
13.1 Project 1: Predicting House Prices with Regression
In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
13.1.1 Problem Statement
The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.
13.1.2 Dataset
The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.
The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s
13.1.3 Implementation
Step 1
Let's start by loading the dataset and removing the non-essential features.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Code breakdown:
The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.
We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.
Next, we will calculate some descriptive statistics about the Boston housing prices.
import numpy as np
import pandas as pd
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
# Minimum price of the data
minimum_price = np.min(prices)
# Maximum price of the data
maximum_price = np.max(prices)
# Mean price of the data
mean_price = np.mean(prices)
# Median price of the data
median_price = np.median(prices)
# Standard deviation of prices of the data
std_price = np.std(prices)
# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(minimum_price))
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${:.2f}".format(std_price))
Code breakdown:
The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called prices
, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions amin()
, amax()
, mean()
, median()
, and std()
to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.
We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.
Next, we will split the data into training and testing subsets.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis=1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
# Success
print("Training and testing split was successful.")
Code breakdown:
The code first imports the train_test_split
function from the sklearn.model_selection
library. Next, the code defines two variables, features
and prices
, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the train_test_split
function to split the data into training and testing subsets. The test_size
parameter specifies that 20% of the data should be used for testing, and the random_state
parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.
We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.
# Import 'ShuffleSplit'
from sklearn.model_selection import ShuffleSplit
def fit_model(X, y):
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth': list(range(1, 11))}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
Code breakdown:
The code first imports the DecisionTreeRegressor
, make_scorer
, and GridSearchCV
functions from the sklearn.tree
, sklearn.metrics
, and sklearn.model_selection
libraries, respectively. Next, the code defines a function called fit_model()
, which takes two arguments, X
and y
, which represent the training data and the target values, respectively. The code then creates a ShuffleSplit
object called cv_sets
, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a DecisionTreeRegressor
object called regressor
. The code then creates a dictionary called params
, which maps the parameter name max_depth
to a list of values from 1 to 10. The code then uses the make_scorer()
function to create a scoring function called scoring_fnc
, which will be used to evaluate the performance of the different models. Finally, the code creates a GridSearchCV
object called grid
, which will be used to search for the optimal model. The grid
object is passed the regressor
, params
, scoring_fnc
, and cv_sets
objects. The grid
object is then fit to the data, which will find the optimal model. The optimal model is then returned from the fit_model()
function.
Finally, we will make predictions on new sets of input data.
# Assume reg is the trained model obtained from fit_model
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
# Show predictions
for i, price in enumerate(reg.predict(client_data)):
print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))
Code breakdown:
The code first creates a matrix called client_data
, which contains the client data. The code then uses the reg.predict()
function to predict the selling price for each client. The code then uses the enumerate()
function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.
This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.
13.1 Project 1: Predicting House Prices with Regression
In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
13.1.1 Problem Statement
The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.
13.1.2 Dataset
The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.
The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s
13.1.3 Implementation
Step 1
Let's start by loading the dataset and removing the non-essential features.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Code breakdown:
The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.
We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.
Next, we will calculate some descriptive statistics about the Boston housing prices.
import numpy as np
import pandas as pd
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
# Minimum price of the data
minimum_price = np.min(prices)
# Maximum price of the data
maximum_price = np.max(prices)
# Mean price of the data
mean_price = np.mean(prices)
# Median price of the data
median_price = np.median(prices)
# Standard deviation of prices of the data
std_price = np.std(prices)
# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(minimum_price))
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${:.2f}".format(std_price))
Code breakdown:
The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called prices
, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions amin()
, amax()
, mean()
, median()
, and std()
to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.
We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.
Next, we will split the data into training and testing subsets.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis=1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
# Success
print("Training and testing split was successful.")
Code breakdown:
The code first imports the train_test_split
function from the sklearn.model_selection
library. Next, the code defines two variables, features
and prices
, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the train_test_split
function to split the data into training and testing subsets. The test_size
parameter specifies that 20% of the data should be used for testing, and the random_state
parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.
We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.
# Import 'ShuffleSplit'
from sklearn.model_selection import ShuffleSplit
def fit_model(X, y):
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth': list(range(1, 11))}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
Code breakdown:
The code first imports the DecisionTreeRegressor
, make_scorer
, and GridSearchCV
functions from the sklearn.tree
, sklearn.metrics
, and sklearn.model_selection
libraries, respectively. Next, the code defines a function called fit_model()
, which takes two arguments, X
and y
, which represent the training data and the target values, respectively. The code then creates a ShuffleSplit
object called cv_sets
, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a DecisionTreeRegressor
object called regressor
. The code then creates a dictionary called params
, which maps the parameter name max_depth
to a list of values from 1 to 10. The code then uses the make_scorer()
function to create a scoring function called scoring_fnc
, which will be used to evaluate the performance of the different models. Finally, the code creates a GridSearchCV
object called grid
, which will be used to search for the optimal model. The grid
object is passed the regressor
, params
, scoring_fnc
, and cv_sets
objects. The grid
object is then fit to the data, which will find the optimal model. The optimal model is then returned from the fit_model()
function.
Finally, we will make predictions on new sets of input data.
# Assume reg is the trained model obtained from fit_model
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
# Show predictions
for i, price in enumerate(reg.predict(client_data)):
print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))
Code breakdown:
The code first creates a matrix called client_data
, which contains the client data. The code then uses the reg.predict()
function to predict the selling price for each client. The code then uses the enumerate()
function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.
This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.
13.1 Project 1: Predicting House Prices with Regression
In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.
13.1.1 Problem Statement
The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.
13.1.2 Dataset
The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.
The features can be summarized as follows:
- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s
13.1.3 Implementation
Step 1
Let's start by loading the dataset and removing the non-essential features.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import ShuffleSplit
# Import supplementary visualizations code visuals.py
import visuals as vs
# Pretty display for notebooks
%matplotlib inline
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis = 1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
Code breakdown:
The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.
We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.
Next, we will calculate some descriptive statistics about the Boston housing prices.
import numpy as np
import pandas as pd
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
# Minimum price of the data
minimum_price = np.min(prices)
# Maximum price of the data
maximum_price = np.max(prices)
# Mean price of the data
mean_price = np.mean(prices)
# Median price of the data
median_price = np.median(prices)
# Standard deviation of prices of the data
std_price = np.std(prices)
# Show the calculated statistics
print("Statistics for Boston housing dataset:\n")
print("Minimum price: ${}".format(minimum_price))
print("Maximum price: ${}".format(maximum_price))
print("Mean price: ${}".format(mean_price))
print("Median price ${}".format(median_price))
print("Standard deviation of prices: ${:.2f}".format(std_price))
Code breakdown:
The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called prices
, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions amin()
, amax()
, mean()
, median()
, and std()
to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.
We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.
Next, we will split the data into training and testing subsets.
# Import libraries necessary for this project
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
# Load the Boston housing dataset
data = pd.read_csv('housing.csv')
prices = data['MEDV']
features = data.drop('MEDV', axis=1)
# Success
print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))
# Shuffle and split the data into training and testing subsets
X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)
# Success
print("Training and testing split was successful.")
Code breakdown:
The code first imports the train_test_split
function from the sklearn.model_selection
library. Next, the code defines two variables, features
and prices
, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the train_test_split
function to split the data into training and testing subsets. The test_size
parameter specifies that 20% of the data should be used for testing, and the random_state
parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.
We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.
# Import 'ShuffleSplit'
from sklearn.model_selection import ShuffleSplit
def fit_model(X, y):
# Create cross-validation sets from the training data
cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)
# Create a decision tree regressor object
regressor = DecisionTreeRegressor()
# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10
params = {'max_depth': list(range(1, 11))}
# Transform 'performance_metric' into a scoring function using 'make_scorer'
scoring_fnc = make_scorer(performance_metric)
# Create the grid search cv object --> GridSearchCV()
grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)
# Fit the grid search object to the data to compute the optimal model
grid = grid.fit(X, y)
# Return the optimal model after fitting the data
return grid.best_estimator_
Code breakdown:
The code first imports the DecisionTreeRegressor
, make_scorer
, and GridSearchCV
functions from the sklearn.tree
, sklearn.metrics
, and sklearn.model_selection
libraries, respectively. Next, the code defines a function called fit_model()
, which takes two arguments, X
and y
, which represent the training data and the target values, respectively. The code then creates a ShuffleSplit
object called cv_sets
, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a DecisionTreeRegressor
object called regressor
. The code then creates a dictionary called params
, which maps the parameter name max_depth
to a list of values from 1 to 10. The code then uses the make_scorer()
function to create a scoring function called scoring_fnc
, which will be used to evaluate the performance of the different models. Finally, the code creates a GridSearchCV
object called grid
, which will be used to search for the optimal model. The grid
object is passed the regressor
, params
, scoring_fnc
, and cv_sets
objects. The grid
object is then fit to the data, which will find the optimal model. The optimal model is then returned from the fit_model()
function.
Finally, we will make predictions on new sets of input data.
# Assume reg is the trained model obtained from fit_model
# Produce a matrix for client data
client_data = [[5, 17, 15], # Client 1
[4, 32, 22], # Client 2
[8, 3, 12]] # Client 3
# Show predictions
for i, price in enumerate(reg.predict(client_data)):
print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))
Code breakdown:
The code first creates a matrix called client_data
, which contains the client data. The code then uses the reg.predict()
function to predict the selling price for each client. The code then uses the enumerate()
function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.
This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.