20. Linear regression¶

In previous classes we have used exploratory approaches to visualize and quantify relationships between variables. Now we will use linear regression to start to build models that can make predictions based on these relationships.

Import the needed libraries

import sklearn as sk
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import statsmodels.api as sm #for running regression!
import statsmodels.formula.api as smf

20.1. Load the data¶

Download the ‘bostonHouses.csv’ from the class’s shared data folder and load it into a dataframe.

Note: one way to get the data in, is just to drag and drop the csv file into the files tab on the left. However, using this method means that when we leave our session the data will be removed (i.e., it isn’t on your gdrive).

df_boston = pd.read_csv('/content/bostonHouses.csv')

Q: What kinds of data do you have?

Q: Are there missing values anywhere?

20.2. Visualize and Explore¶

Plot the house prices on the y-axis, with some other variables on the x-axis.

Generally the value we are trying to predict is called the response variable, while the values we are using to make those predictions are the predictor variables.

sns.scatterplot(data=df_boston, x="?",y="price")

Create a heat map to help you identify potentially interesting relationships.

df_boston_corr = df_boston.?()
sns.heatmap(data=?)

20.3. Build and train a model ¶

Let’s look at building our first model – linear regression!

How well can we predict the price of a house based on the proportion of large lots in the area (i.e., the ZN value)?

#Build the model
linear_reg_model = smf.ols(formula='price ~ RM', data=df_boston)

#Use the data to fit the model (i.e., find the best intercept and slope parameters)
linear_reg_results = linear_reg_model.fit()

#make predictions using the model
df_boston['price_pred'] = linear_reg_results.predict(df_boston)

Let’s take a look at the predictions!

df_boston

Let’s plot the predictions

sns.scatterplot(data=df_boston,x='RM', y='price')
sns.scatterplot(data=df_boston,x='RM', y='price_pred')

We can see that all the predicted points fall along a line. This is y = a + b*rooms.

Let’s take a look at what values for a (intercept) and b (slope RM) the model estimated.

print(linear_reg_results.summary())

Before moving on let’s drop the predictions we have made so far.

df_boston = df_boston.drop('price_pred',axis=1)

20.4. Training / Testing Split ¶

We will follow a general approach when building models. We will divide the dataset into *training* and *testing* datasets.
This lets us fit the model to one part of the data and then use the withheld data to test the predictions of the model. This helps us avoid *overfitting* our model!

#load libraries to do the training and testing split
from sklearn.model_selection import train_test_split

#Split the dataframe into 80% training and 20% testing datasets
df_train, df_test = train_test_split(df_boston, test_size=0.20)

#take a look at the shape of the training dataset
df_train.shape

#take a look at the shape of the testing dataset
df_test.shape

20.5. Fit a model ¶

In general when using sklearn to fit a model we will follow these steps:

#define model parameters
linear_reg_split_model = smf.ols(formula='?', data=df_train) #note: using training data

#fit the model to the training data
linear_reg_split_results = linear_reg_split_model.?

#predict values in the training and testing dataset
df_train['price_pred'] = ?(df_train) #note: using train data
df_test['price_pred'] = ?(df_test) #note: using test data

#Get a summary of the model parameters
print(linear_reg_split_results.summary())

Visualize predictions on the training dataset

?(data=df_train,x="RM",y="price") # observed price
?(data=df_train,x="RM",y="price_pred") #predicted price

Visualize predictions on the testing dataset

?(data=df_test,x="RM",y="price") # observed price
?(data=df_test,x="RM",y="price_pred") # predicted price

How good is the model at predicting?

Making predictions in the training dataset

#mean squared error
mse_train = sk.metrics.mean_squared_error(df_train['price'], df_train['price_pred']) 

print(" Mean squared error = ", mse_train)

Making predictions in the testing dataset (not used to fit the model)

#mean squared error
mse_test = sk.metrics.mean_squared_error(df_test['price'], df_test['price_pred']) 

print(" Mean squared error = ", mse_test)

Q: Which prediction error is higher?

Q: Is all that error just noise? Or could there be other variables that might explain why the predictions are off?

20.6. Fit a more complex model¶

This time we will try multiple linear regression

#define model parameters, and the training data to be used
multi_linear_reg = smf.ols(formula='price ~ RM + ZN', data=df_train) #use training data

#fit the model to the training data
results_RN_ZN = multi_linear_reg.fit() 

#Predict values in the testing dataset
df_test['price_pred_RM_ZN'] = results_RN_ZN.predict(df_test) #predict on testing data

#Get a summary of the model parameters
print(results_RN_ZN.summary()) 

Visualize and explore these predictions

df_test

First let’s look at how the model predicts the price of houses in the testing dataset. Now that we have two predictors we’ll have to look at one at a time.
Let’s look at the number of rooms (RM) first:

sns.scatterplot(data=df_test,x='RM', y='price')
sns.scatterplot(data=df_test,x='RM', y='price_pred_RM_ZN')

Then at lot size (ZN):

sns.scatterplot(data=df_test,x='ZN', y='price')
sns.scatterplot(data=df_test,x='ZN', y='price_pred_RM_ZN')

How good is the model at predicting?

#mean squared error
mse_multi = sk.metrics.mean_squared_error(df_test['price'], df_test['price_pred_RM_ZN']) 

print(" Mean squared error = ", mse_multi.round(2))

Q: how does that compare to our simple model?

20.7. Try adding more variables!¶

Run a linear regression model to predict house prices. Try and beat the RMSE of the previous models! Feel free to post to slack your results and RMSE scores! Does RMSE always decrease as you add more variables?

#define model parameters
large_linear_reg = ?

#fit the model to the training data
large_linear_reg_res = ?

#predit with the full model
df_test['price_pred_full'] = ?

#Get a summary of the model parameters
print(?) 

How well does it do on the test data?

#mean squared error
mse_full = ?

print(" Mean squared error = ", ?)

20.8. Explaining how the model is making predictions¶

With linear regression we can look to see what features are important when making predictions. We can also see the direction and magnitude of the effect of these features.

E.g., more rooms in a house are positively associated with house price

Let’s take a look at how to make it easier to see which features are important when making predictions.

To do this we’ll ensure that all numeric features are on the same scale (e.g., mean of 0, and standard deviation of 1).

Data wrangling: preprocessing the data

from sklearn.preprocessing import StandardScaler

Data preprocessing should be done after the training testing split

#Split the dataframe into 80% training and 20% testing datasets
df_train, df_test = train_test_split(df_boston, test_size=0.20)

As the preprocessing step can be quite time consuming, we will use a function called DataFrameMapper to help make it easier. The steps to use this function are:

Build the transformer
Fit the transformer and
Use the transformer to transform the data

#create a copy of your dataframe to transform
df_train_scaled = df_train.copy()

#1. build the preprocessing transformer
scal = StandardScaler()

#2. Fit the DataFrameMapper, and transform the data 
df_train_scaled[:] = scal.fit_transform(df_train)

#take a look
df_train_scaled

Q: estimate the mean and standard deviation for one of the new transformed features.

Now run your best model again, this time with the scaled training data.

#define model parameters, and the training data to be used
best_linear_reg_scaled = smf.ols(formula='price ~ RM', data=df_train_scaled) 

#fit the model to the training data
?

#Get a summary of the model parameters
?

We can now compare the relative effect of each of the features in predicting the price of a house.

e.g the magnitude and direction of each parameter estimate.

20.9. Further reading¶

Read more about using statsmodel to run regression models.

If you would like the notebook without missing code check out the full code version.

Practical exercises in data science - PEDS

Linear regression

Contents