# Chapter 13: Practical Machine Learning Projects

## 13.1 Project 1: Predicting House Prices with Regression

In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

**13.1.1 Problem Statement**

The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.

**13.1.2 Dataset**

The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.

The features can be summarized as follows:

- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s

**13.1.3 Implementation**

**Step 1**

Let's start by loading the dataset and removing the non-essential features.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import ShuffleSplit

# Import supplementary visualizations code visuals.py

import visuals as vs

# Pretty display for notebooks

%matplotlib inline

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis = 1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

__Code breakdown__**:**

The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.

We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.

Next, we will calculate some descriptive statistics about the Boston housing prices.

`import numpy as np`

import pandas as pd

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

# Minimum price of the data

minimum_price = np.min(prices)

# Maximum price of the data

maximum_price = np.max(prices)

# Mean price of the data

mean_price = np.mean(prices)

# Median price of the data

median_price = np.median(prices)

# Standard deviation of prices of the data

std_price = np.std(prices)

# Show the calculated statistics

print("Statistics for Boston housing dataset:\n")

print("Minimum price: ${}".format(minimum_price))

print("Maximum price: ${}".format(maximum_price))

print("Mean price: ${}".format(mean_price))

print("Median price ${}".format(median_price))

print("Standard deviation of prices: ${:.2f}".format(std_price))

__Code breakdown__**:**

The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called `prices`

, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions `amin()`

, `amax()`

, `mean()`

, `median()`

, and `std()`

to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.

We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.

Next, we will split the data into training and testing subsets.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis=1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

# Shuffle and split the data into training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

# Success

print("Training and testing split was successful.")

__Code breakdown__**:**

The code first imports the `train_test_split`

function from the `sklearn.model_selection`

library. Next, the code defines two variables, `features`

and `prices`

, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the `train_test_split`

function to split the data into training and testing subsets. The `test_size`

parameter specifies that 20% of the data should be used for testing, and the `random_state`

parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.

We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.

`# Import 'ShuffleSplit'`

from sklearn.model_selection import ShuffleSplit

def fit_model(X, y):

# Create cross-validation sets from the training data

cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)

# Create a decision tree regressor object

regressor = DecisionTreeRegressor()

# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10

params = {'max_depth': list(range(1, 11))}

# Transform 'performance_metric' into a scoring function using 'make_scorer'

scoring_fnc = make_scorer(performance_metric)

# Create the grid search cv object --> GridSearchCV()

grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)

# Fit the grid search object to the data to compute the optimal model

grid = grid.fit(X, y)

# Return the optimal model after fitting the data

return grid.best_estimator_

__Code breakdown__**:**

The code first imports the `DecisionTreeRegressor`

, `make_scorer`

, and `GridSearchCV`

functions from the `sklearn.tree`

, `sklearn.metrics`

, and `sklearn.model_selection`

libraries, respectively. Next, the code defines a function called `fit_model()`

, which takes two arguments, `X`

and `y`

, which represent the training data and the target values, respectively. The code then creates a `ShuffleSplit`

object called `cv_sets`

, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a `DecisionTreeRegressor`

object called `regressor`

. The code then creates a dictionary called `params`

, which maps the parameter name `max_depth`

to a list of values from 1 to 10. The code then uses the `make_scorer()`

function to create a scoring function called `scoring_fnc`

, which will be used to evaluate the performance of the different models. Finally, the code creates a `GridSearchCV`

object called `grid`

, which will be used to search for the optimal model. The `grid`

object is passed the `regressor`

, `params`

, `scoring_fnc`

, and `cv_sets`

objects. The `grid`

object is then fit to the data, which will find the optimal model. The optimal model is then returned from the `fit_model()`

function.

Finally, we will make predictions on new sets of input data.

`# Assume reg is the trained model obtained from fit_model`

# Produce a matrix for client data

client_data = [[5, 17, 15], # Client 1

[4, 32, 22], # Client 2

[8, 3, 12]] # Client 3

# Show predictions

for i, price in enumerate(reg.predict(client_data)):

print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))

__Code breakdown__**:**

The code first creates a matrix called `client_data`

, which contains the client data. The code then uses the `reg.predict()`

function to predict the selling price for each client. The code then uses the `enumerate()`

function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.

This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.

## 13.1 Project 1: Predicting House Prices with Regression

In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

**13.1.1 Problem Statement**

The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.

**13.1.2 Dataset**

The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.

The features can be summarized as follows:

- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s

**13.1.3 Implementation**

**Step 1**

Let's start by loading the dataset and removing the non-essential features.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import ShuffleSplit

# Import supplementary visualizations code visuals.py

import visuals as vs

# Pretty display for notebooks

%matplotlib inline

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis = 1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

__Code breakdown__**:**

The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.

We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.

Next, we will calculate some descriptive statistics about the Boston housing prices.

`import numpy as np`

import pandas as pd

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

# Minimum price of the data

minimum_price = np.min(prices)

# Maximum price of the data

maximum_price = np.max(prices)

# Mean price of the data

mean_price = np.mean(prices)

# Median price of the data

median_price = np.median(prices)

# Standard deviation of prices of the data

std_price = np.std(prices)

# Show the calculated statistics

print("Statistics for Boston housing dataset:\n")

print("Minimum price: ${}".format(minimum_price))

print("Maximum price: ${}".format(maximum_price))

print("Mean price: ${}".format(mean_price))

print("Median price ${}".format(median_price))

print("Standard deviation of prices: ${:.2f}".format(std_price))

__Code breakdown__**:**

The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called `prices`

, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions `amin()`

, `amax()`

, `mean()`

, `median()`

, and `std()`

to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.

We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.

Next, we will split the data into training and testing subsets.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis=1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

# Shuffle and split the data into training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

# Success

print("Training and testing split was successful.")

__Code breakdown__**:**

The code first imports the `train_test_split`

function from the `sklearn.model_selection`

library. Next, the code defines two variables, `features`

and `prices`

, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the `train_test_split`

function to split the data into training and testing subsets. The `test_size`

parameter specifies that 20% of the data should be used for testing, and the `random_state`

parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.

We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.

`# Import 'ShuffleSplit'`

from sklearn.model_selection import ShuffleSplit

def fit_model(X, y):

# Create cross-validation sets from the training data

cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)

# Create a decision tree regressor object

regressor = DecisionTreeRegressor()

# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10

params = {'max_depth': list(range(1, 11))}

# Transform 'performance_metric' into a scoring function using 'make_scorer'

scoring_fnc = make_scorer(performance_metric)

# Create the grid search cv object --> GridSearchCV()

grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)

# Fit the grid search object to the data to compute the optimal model

grid = grid.fit(X, y)

# Return the optimal model after fitting the data

return grid.best_estimator_

__Code breakdown__**:**

The code first imports the `DecisionTreeRegressor`

, `make_scorer`

, and `GridSearchCV`

functions from the `sklearn.tree`

, `sklearn.metrics`

, and `sklearn.model_selection`

libraries, respectively. Next, the code defines a function called `fit_model()`

, which takes two arguments, `X`

and `y`

, which represent the training data and the target values, respectively. The code then creates a `ShuffleSplit`

object called `cv_sets`

, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a `DecisionTreeRegressor`

object called `regressor`

. The code then creates a dictionary called `params`

, which maps the parameter name `max_depth`

to a list of values from 1 to 10. The code then uses the `make_scorer()`

function to create a scoring function called `scoring_fnc`

, which will be used to evaluate the performance of the different models. Finally, the code creates a `GridSearchCV`

object called `grid`

, which will be used to search for the optimal model. The `grid`

object is passed the `regressor`

, `params`

, `scoring_fnc`

, and `cv_sets`

objects. The `grid`

object is then fit to the data, which will find the optimal model. The optimal model is then returned from the `fit_model()`

function.

Finally, we will make predictions on new sets of input data.

`# Assume reg is the trained model obtained from fit_model`

# Produce a matrix for client data

client_data = [[5, 17, 15], # Client 1

[4, 32, 22], # Client 2

[8, 3, 12]] # Client 3

# Show predictions

for i, price in enumerate(reg.predict(client_data)):

print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))

__Code breakdown__**:**

The code first creates a matrix called `client_data`

, which contains the client data. The code then uses the `reg.predict()`

function to predict the selling price for each client. The code then uses the `enumerate()`

function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.

This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.

## 13.1 Project 1: Predicting House Prices with Regression

In this project, we will develop a machine learning model to predict house prices. This is a common real-world application of regression, a type of supervised learning method in machine learning. We will use the Boston Housing dataset, which contains information collected by the U.S Census Service concerning housing in the area of Boston Mass.

**13.1.1 Problem Statement**

The goal of this project is to build a model that can predict the median value of owner-occupied homes in Boston, given a set of features such as crime rate, average number of rooms per dwelling, and others.

**13.1.2 Dataset**

The dataset used in this project comes from the UCI Machine Learning Repository. This data was collected in 1978 and each of the 506 entries represents aggregate information about 14 features of homes from various suburbs located in Boston.

The features can be summarized as follows:

- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- CHAS: This is the Charles River dummy variable (this is equal to 1 if tract bounds river; 0 otherwise)
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- B: This is calculated as 1000(Bk — 0.63)², where Bk is the proportion of people of African American descent by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s

**13.1.3 Implementation**

**Step 1**

Let's start by loading the dataset and removing the non-essential features.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import ShuffleSplit

# Import supplementary visualizations code visuals.py

import visuals as vs

# Pretty display for notebooks

%matplotlib inline

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis = 1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

__Code breakdown__**:**

The first line imports the NumPy library, which provides a high-level interface to numerical computing. The second line imports the Pandas library, which provides high-level data structures and data analysis tools. The third line imports the ShuffleSplit class from scikit-learn, which is used to create train/test splits of data. The fourth line imports the supplementary visualizations code from the visuals.py file. The fifth line sets up the notebook for pretty printing. The sixth line loads the Boston housing dataset from the housing.csv file. The seventh line creates the prices variable, which contains the median value of owner-occupied homes in thousands of dollars. The eighth line creates the features variable, which contains the 13 features of the dataset. The ninth line prints a success message, followed by the number of data points and variables in the dataset.

We will then split the dataset into features and the target variable. The features 'RM', 'LSTAT', and 'PTRATIO', give us quantitative information about each data point. The target variable, 'MEDV', will be the variable we seek to predict.

Next, we will calculate some descriptive statistics about the Boston housing prices.

`import numpy as np`

import pandas as pd

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

# Minimum price of the data

minimum_price = np.min(prices)

# Maximum price of the data

maximum_price = np.max(prices)

# Mean price of the data

mean_price = np.mean(prices)

# Median price of the data

median_price = np.median(prices)

# Standard deviation of prices of the data

std_price = np.std(prices)

# Show the calculated statistics

print("Statistics for Boston housing dataset:\n")

print("Minimum price: ${}".format(minimum_price))

print("Maximum price: ${}".format(maximum_price))

print("Mean price: ${}".format(mean_price))

print("Median price ${}".format(median_price))

print("Standard deviation of prices: ${:.2f}".format(std_price))

__Code breakdown__**:**

The code first imports the NumPy library, which provides a number of functions for working with numerical data. Next, the code defines a variable called `prices`

, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions `amin()`

, `amax()`

, `mean()`

, `median()`

, and `std()`

to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.

We can make some assumptions about the data. For example, houses with more rooms (higher 'RM' value) will be worth more. Neighborhoods with more lower-class workers (higher 'LSTAT' value) will be worth less. Neighborhoods with a higher student to teacher ratio ('PTRATIO') will be worth less.

Next, we will split the data into training and testing subsets.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis=1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

# Shuffle and split the data into training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

# Success

print("Training and testing split was successful.")

__Code breakdown__**:**

The code first imports the `train_test_split`

function from the `sklearn.model_selection`

library. Next, the code defines two variables, `features`

and `prices`

, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the `train_test_split`

function to split the data into training and testing subsets. The `test_size`

parameter specifies that 20% of the data should be used for testing, and the `random_state`

parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.

We will then train a model using the decision tree algorithm. To ensure that we are producing an optimized model, we will train the model using the grid search technique to optimize the 'max_depth' parameter for the decision tree.

`# Import 'ShuffleSplit'`

from sklearn.model_selection import ShuffleSplit

def fit_model(X, y):

# Create cross-validation sets from the training data

cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)

# Create a decision tree regressor object

regressor = DecisionTreeRegressor()

# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10

params = {'max_depth': list(range(1, 11))}

# Transform 'performance_metric' into a scoring function using 'make_scorer'

scoring_fnc = make_scorer(performance_metric)

# Create the grid search cv object --> GridSearchCV()

grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)

# Fit the grid search object to the data to compute the optimal model

grid = grid.fit(X, y)

# Return the optimal model after fitting the data

return grid.best_estimator_

__Code breakdown__**:**

The code first imports the `DecisionTreeRegressor`

, `make_scorer`

, and `GridSearchCV`

functions from the `sklearn.tree`

, `sklearn.metrics`

, and `sklearn.model_selection`

libraries, respectively. Next, the code defines a function called `fit_model()`

, which takes two arguments, `X`

and `y`

, which represent the training data and the target values, respectively. The code then creates a `ShuffleSplit`

object called `cv_sets`

, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a `DecisionTreeRegressor`

object called `regressor`

. The code then creates a dictionary called `params`

, which maps the parameter name `max_depth`

to a list of values from 1 to 10. The code then uses the `make_scorer()`

function to create a scoring function called `scoring_fnc`

, which will be used to evaluate the performance of the different models. Finally, the code creates a `GridSearchCV`

object called `grid`

, which will be used to search for the optimal model. The `grid`

object is passed the `regressor`

, `params`

, `scoring_fnc`

, and `cv_sets`

objects. The `grid`

object is then fit to the data, which will find the optimal model. The optimal model is then returned from the `fit_model()`

function.

Finally, we will make predictions on new sets of input data.

`# Assume reg is the trained model obtained from fit_model`

# Produce a matrix for client data

client_data = [[5, 17, 15], # Client 1

[4, 32, 22], # Client 2

[8, 3, 12]] # Client 3

# Show predictions

for i, price in enumerate(reg.predict(client_data)):

print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))

__Code breakdown__**:**

The code first creates a matrix called `client_data`

, which contains the client data. The code then uses the `reg.predict()`

function to predict the selling price for each client. The code then uses the `enumerate()`

function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.

This project provides a practical application of machine learning in a real-world setting. It demonstrates how to use regression to predict house prices based on various features. The code provided can be used as a starting point for further exploration and experimentation.

## 13.1 Project 1: Predicting House Prices with Regression

**13.1.1 Problem Statement**

**13.1.2 Dataset**

The features can be summarized as follows:

- CRIM: This is the per capita crime rate by town
- ZN: This is the proportion of residential land zoned for lots larger than 25,000 sq.ft.
- INDUS: This is the proportion of non-retail business acres per town.
- NOX: This is the nitric oxides concentration (parts per 10 million)
- RM: This is the average number of rooms per dwelling
- AGE: This is the proportion of owner-occupied units built prior to 1940
- DIS: This is the weighted distances to five Boston employment centers
- RAD: This is the index of accessibility to radial highways
- TAX: This is the full-value property-tax rate per $10,000
- PTRATIO: This is the pupil-teacher ratio by town
- LSTAT: This is the percentage lower status of the population
- MEDV: This is the median value of owner-occupied homes in $1000s

**13.1.3 Implementation**

**Step 1**

Let's start by loading the dataset and removing the non-essential features.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import ShuffleSplit

# Import supplementary visualizations code visuals.py

import visuals as vs

# Pretty display for notebooks

%matplotlib inline

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis = 1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

__Code breakdown__**:**

Next, we will calculate some descriptive statistics about the Boston housing prices.

`import numpy as np`

import pandas as pd

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

# Minimum price of the data

minimum_price = np.min(prices)

# Maximum price of the data

maximum_price = np.max(prices)

# Mean price of the data

mean_price = np.mean(prices)

# Median price of the data

median_price = np.median(prices)

# Standard deviation of prices of the data

std_price = np.std(prices)

# Show the calculated statistics

print("Statistics for Boston housing dataset:\n")

print("Minimum price: ${}".format(minimum_price))

print("Maximum price: ${}".format(maximum_price))

print("Mean price: ${}".format(mean_price))

print("Median price ${}".format(median_price))

print("Standard deviation of prices: ${:.2f}".format(std_price))

__Code breakdown__**:**

`prices`

, which contains the median home prices in the Boston housing dataset. The code then uses the NumPy functions `amin()`

, `amax()`

, `mean()`

, `median()`

, and `std()`

to calculate the minimum, maximum, mean, median, and standard deviation of the prices, respectively. Finally, the code prints the calculated statistics.

Next, we will split the data into training and testing subsets.

`# Import libraries necessary for this project`

import numpy as np

import pandas as pd

from sklearn.model_selection import train_test_split

# Load the Boston housing dataset

data = pd.read_csv('housing.csv')

prices = data['MEDV']

features = data.drop('MEDV', axis=1)

# Success

print("Boston housing dataset has {} data points with {} variables each.".format(*data.shape))

# Shuffle and split the data into training and testing subsets

X_train, X_test, y_train, y_test = train_test_split(features, prices, test_size=0.2, random_state=42)

# Success

print("Training and testing split was successful.")

__Code breakdown__**:**

`train_test_split`

function from the `sklearn.model_selection`

library. Next, the code defines two variables, `features`

and `prices`

, which contain the features and prices of the Boston housing dataset, respectively. The code then uses the `train_test_split`

function to split the data into training and testing subsets. The `test_size`

parameter specifies that 20% of the data should be used for testing, and the `random_state`

parameter specifies that the data should be shuffled randomly. Finally, the code prints a message indicating that the training and testing split was successful.

`# Import 'ShuffleSplit'`

from sklearn.model_selection import ShuffleSplit

def fit_model(X, y):

# Create cross-validation sets from the training data

cv_sets = ShuffleSplit(n_splits=10, test_size=0.20, random_state=0)

# Create a decision tree regressor object

regressor = DecisionTreeRegressor()

# Create a dictionary for the parameter 'max_depth' with a range from 1 to 10

params = {'max_depth': list(range(1, 11))}

# Transform 'performance_metric' into a scoring function using 'make_scorer'

scoring_fnc = make_scorer(performance_metric)

# Create the grid search cv object --> GridSearchCV()

grid = GridSearchCV(estimator=regressor, param_grid=params, scoring=scoring_fnc, cv=cv_sets)

# Fit the grid search object to the data to compute the optimal model

grid = grid.fit(X, y)

# Return the optimal model after fitting the data

return grid.best_estimator_

__Code breakdown__**:**

`DecisionTreeRegressor`

, `make_scorer`

, and `GridSearchCV`

functions from the `sklearn.tree`

, `sklearn.metrics`

, and `sklearn.model_selection`

libraries, respectively. Next, the code defines a function called `fit_model()`

, which takes two arguments, `X`

and `y`

, which represent the training data and the target values, respectively. The code then creates a `ShuffleSplit`

object called `cv_sets`

, which splits the training data into 10 folds, with 20% of the data used for testing in each fold. Next, the code creates a `DecisionTreeRegressor`

object called `regressor`

. The code then creates a dictionary called `params`

, which maps the parameter name `max_depth`

to a list of values from 1 to 10. The code then uses the `make_scorer()`

function to create a scoring function called `scoring_fnc`

, which will be used to evaluate the performance of the different models. Finally, the code creates a `GridSearchCV`

object called `grid`

, which will be used to search for the optimal model. The `grid`

object is passed the `regressor`

, `params`

, `scoring_fnc`

, and `cv_sets`

objects. The `grid`

object is then fit to the data, which will find the optimal model. The optimal model is then returned from the `fit_model()`

function.

Finally, we will make predictions on new sets of input data.

`# Assume reg is the trained model obtained from fit_model`

# Produce a matrix for client data

client_data = [[5, 17, 15], # Client 1

[4, 32, 22], # Client 2

[8, 3, 12]] # Client 3

# Show predictions

for i, price in enumerate(reg.predict(client_data)):

print("Predicted selling price for Client {}'s home: ${:,.2f}".format(i + 1, price))

__Code breakdown__**:**

`client_data`

, which contains the client data. The code then uses the `reg.predict()`

function to predict the selling price for each client. The code then uses the `enumerate()`

function to iterate over the predicted prices and the client IDs. The code then prints the predicted selling price for each client.