Linear regression
A linear regression model minimizes the mean squared error on the training set.
A linear regression model minimizes the mean squared error on the training set. This means that the parameters obtained after the fit (i.e. coef and intercept) are the optimal parameters that minimizes the mean squared error. In other words, any other choice of parameters will yield a model with a higher mean squared error on the training set.
However, the mean squared error is difficult to interpret. The mean absolute error is more intuitive since it provides an error in the same unit as the one of the target.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import time
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_validate
from sklearn.metrics import mean_absolute_error
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_regression.csv")
myDataFrame.head()
feature_names = "Flipper Length (mm)"
target_name = "Body Mass (g)"
data, target = myDataFrame[[feature_names]], myDataFrame[target_name]
sns.scatterplot(x=data[feature_names], y=target, color="black", alpha=0.5);
sns.pairplot(myDataFrame);
corr_df = myDataFrame.corr(method='pearson')
plt.figure(figsize=(8, 6))
sns.heatmap(corr_df, annot=True)
plt.show()
model = LinearRegression();
R2 coefficient of determination
The 𝑅2 score represents the proportion of variance of the target that is explained by the independent variables in the model. The best score possible is 1 but there is no lower bound. However, a model that predicts the expected value of the target would get a score of 0.
cv_results = cross_validate(model, data, target, scoring='r2')
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The R2 is "
f"{scores.mean():,.3f} +/- {scores.std():,.3f}, for {fit_time.mean():,.3f} seconds")
cv_results = cross_validate(model, data, target, scoring='neg_mean_squared_error', return_train_score=True)
train_error = -cv_results["train_score"]
print(f"Mean squared error of linear regresion model on the train set:\n"
f"{train_error.mean():,.2f} +/- {train_error.std():,.2f}")
test_error = -cv_results["test_score"]
print(f"Mean squared error of linear regresion model on the test set:\n"
f"{test_error.mean():,.2f} +/- {test_error.std():,.2f}")
We see that the training and testing scores are closed. It indicates that our model is not overfitting.
cv_results = cross_validate(model, data, target, scoring='neg_mean_absolute_percentage_error', return_train_score=True)
train_error = -cv_results["train_score"]*100
print(f"Mean absolute percentage error of linear regresion model on the train set:\n"
f"{train_error.mean():,.2f} % +/- {train_error.std():,.2f}")
test_error = -cv_results["test_score"]*100
print(f"Mean absolute percentage error of linear regresion model on the test set:\n"
f"{test_error.mean():,.2f} % +/- {test_error.std():,.2f}")
model.fit(data, target);
a = model.coef_[0]
print(f"Optimal first parameter is {a:,.2f}")
b = model.intercept_
print(f"Optimal intercept is {b:,.2f}")
predicted_target = a * data + b
predicted_target = model.predict(data)
model_error = mean_absolute_error(target, predicted_target)
print(f"The mean absolute error of the optimal model is {model_error:,.2f}")
A mean absolute error of 313 means that in average, our model make an error of +/- 313 grams when predicting the body mass of a penguin given its flipper length.
sns.scatterplot(x=data[feature_names], y=target, color="black", alpha=0.5)
_ = plt.title("Model using LinearRegression from scikit-learn")
plt.plot(data, predicted_target);