A linear regression model minimizes the mean squared error on the training set. This means that the parameters obtained after the fit (i.e. coef and intercept) are the optimal parameters that minimizes the mean squared error. In other words, any other choice of parameters will yield a model with a higher mean squared error on the training set.

However, the mean squared error is difficult to interpret. The mean absolute error is more intuitive since it provides an error in the same unit as the one of the target.

Preparation

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_absolute_error
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_regression.csv")
myDataFrame.head()
Flipper Length (mm) Body Mass (g)
0 181.0 3750.0
1 186.0 3800.0
2 195.0 3250.0
3 193.0 3450.0
4 190.0 3650.0
feature_names = "Flipper Length (mm)"
target_name = "Body Mass (g)"

data, target = myDataFrame[[feature_names]], myDataFrame[target_name]
sns.scatterplot(x=data[feature_names], y=target, color="black", alpha=0.5);
sns.pairplot(myDataFrame);
corr_df = myDataFrame.corr(method='pearson')

plt.figure(figsize=(8, 6))
sns.heatmap(corr_df, annot=True)
plt.show()

Model : linear regression

model = LinearRegression();

R2 coefficient of determination

The 𝑅2 score represents the proportion of variance of the target that is explained by the independent variables in the model. The best score possible is 1 but there is no lower bound. However, a model that predicts the expected value of the target would get a score of 0.

cv_results = cross_validate(model, data, target, scoring='r2')

scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The R2 is "
      f"{scores.mean():,.3f} +/- {scores.std():,.3f}, for {fit_time.mean():,.3f} seconds")
The R2 is 0.225 +/- 0.341, for 0.002 seconds

Mean squared error of linear regresion

cv_results = cross_validate(model, data, target, scoring='neg_mean_squared_error', return_train_score=True)
train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regresion model on the train set:\n"
      f"{train_error.mean():,.2f} +/- {train_error.std():,.2f}")
Mean squared error of linear regresion model on the train set:
152,698.64 +/- 9,237.95
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regresion model on the test set:\n"
      f"{test_error.mean():,.2f} +/- {test_error.std():,.2f}")
Mean squared error of linear regresion model on the test set:
173,026.93 +/- 44,622.06

We see that the training and testing scores are closed. It indicates that our model is not overfitting.

Mean absolute percentage error

The mean absolute percentage error introduce this relative scaling.

cv_results = cross_validate(model, data, target, scoring='neg_mean_absolute_percentage_error', return_train_score=True)
train_error = -cv_results["train_score"]*100
print(f"Mean absolute percentage error of linear regresion model on the train set:\n"
      f"{train_error.mean():,.2f} % +/- {train_error.std():,.2f}")
Mean absolute percentage error of linear regresion model on the train set:
7.75 % +/- 0.50
test_error = -cv_results["test_score"]*100
print(f"Mean absolute percentage error of linear regresion model on the test set:\n"
      f"{test_error.mean():,.2f} % +/- {test_error.std():,.2f}")
Mean absolute percentage error of linear regresion model on the test set:
8.21 % +/- 2.18

Predictions

model.fit(data, target);
a = model.coef_[0]
print(f"Optimal first parameter is {a:,.2f}")
Optimal first parameter is 49.69
b = model.intercept_
print(f"Optimal intercept is {b:,.2f}")
Optimal intercept is -5,780.83
predicted_target = a * data + b
predicted_target = model.predict(data)
model_error = mean_absolute_error(target, predicted_target)

print(f"The mean absolute error of the optimal model is {model_error:,.2f}")
The mean absolute error of the optimal model is 313.00
A mean absolute error of 313 means that in average, our model make an error of +/- 313 grams when predicting the body mass of a penguin given its flipper length.
sns.scatterplot(x=data[feature_names], y=target, color="black", alpha=0.5)

_ = plt.title("Model using LinearRegression from scikit-learn")
plt.plot(data, predicted_target);