Under and over fitting (v2)
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns
import time
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing(as_frame=True)
data, target = housing.data, housing.target
target *= 100 # rescale the target in k$
print(f"The dataset data contains {data.shape[0]} samples and {data.shape[1]} features")
data.dtypes
data.head()
target.head()
target.plot.hist(bins=20, edgecolor="black")
plt.xlabel("Median House Value (k$)")
_ = plt.title("Target distribution")
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
#
model = DecisionTreeRegressor(random_state=0)
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
cv_results = cross_validate(model, data, target, cv=cv, return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
cv_results = cross_validate(model, data, target, cv=cv, scoring="r2", return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
cv_results = cross_validate(model,
data, target,
cv=cv,
scoring="neg_mean_absolute_error",
return_train_score=True)
cv_results = pd.DataFrame(cv_results)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
scores = -cv_results["test_score"]
train_scores = -cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The mean absolute error (k$) in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The mean absolute error (k$) in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
from sklearn.model_selection import validation_curve
#
max_depth = [x for x in range(1, 26)]
train_scores, test_scores = validation_curve(
model,
data, target,
param_name="max_depth",
param_range=max_depth,
cv=cv)
plt.errorbar(max_depth, train_scores.mean(axis=1),
yerr=train_scores.std(axis=1), label='Training error')
plt.errorbar(max_depth, test_scores.mean(axis=1),
yerr=test_scores.std(axis=1), label='Testing error')
plt.legend()
plt.xlabel("Maximum depth of decision tree")
plt.ylabel("R2")
_ = plt.title("Validation curve for decision tree")
The validation curve can be divided into three areas:
For max_depth < 9, the decision tree underfits. The training error and therefore the testing error are both high. The model is too constrained and cannot capture much of the variability of the target variable.
The region around max_depth = 9 corresponds to the parameter for which the decision tree generalizes the best. It is flexible enough to capture a fraction of the variability of the target that generalizes, while not memorizing all of the noise in the target.
For max_depth > 9, the decision tree overfits. The training error becomes very small, while the testing error increases. In this region, the models create decisions specifically for noisy samples harming its ability to generalize to test data.
Note that for max_depth = 9, the model overfits a bit as there is a gap between the training error and the testing error. It can also potentially underfit also a bit at the same time, because the training error is still far from zero (more than 30 k$), meaning that the model might still be too constrained to model interesting parts of the data. However, the testing error is minimal, and this is what really matters. This is the best compromise we could reach by just tuning this parameter.
We should also look at the standard deviation to assess the dispersion of the score. We show the standard deviation of the errors as well.
We were lucky that the variance of the errors was small compared to their respective values, and therefore the conclusions above are quite clear. This is not necessarily always the case.
model = DecisionTreeRegressor(random_state=0, max_depth=9)
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
cv_results = cross_validate(model, data, target, cv=cv, return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
from sklearn.model_selection import learning_curve
#
train_sizes = [x/100 for x in range(1, 101)]
results = learning_curve(
model, data, target,
train_sizes=train_sizes,
cv=cv)
train_size, train_scores, test_scores = results[:3]
plt.errorbar(train_size, train_scores.mean(axis=1),
yerr=train_scores.std(axis=1), label="Training error")
plt.errorbar(train_size, test_scores.mean(axis=1),
yerr=test_scores.std(axis=1), label="Testing error")
plt.legend()
plt.xscale("log")
plt.xlabel("Number of samples in the training set")
plt.ylabel("R2")
_ = plt.title("Learning curve for decision tree")
If we achieve a plateau and adding new samples in the training set does not reduce the testing error, we might have reach the Bayes error rate using the available model. Using a more complex model might be the only possibility to reduce the testing error further.
Here, it's not the case.