Loading

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# import seaborn as sns
import time
from sklearn.datasets import fetch_california_housing

housing = fetch_california_housing(as_frame=True)
data, target = housing.data, housing.target
target *= 100 # rescale the target in k$
print(f"The dataset data contains {data.shape[0]} samples and {data.shape[1]} features")
The dataset data contains 20640 samples and 8 features
data.dtypes
MedInc        float64
HouseAge      float64
AveRooms      float64
AveBedrms     float64
Population    float64
AveOccup      float64
Latitude      float64
Longitude     float64
dtype: object
data.head()
MedInc HouseAge AveRooms AveBedrms Population AveOccup Latitude Longitude
0 8.3252 41.0 6.984127 1.023810 322.0 2.555556 37.88 -122.23
1 8.3014 21.0 6.238137 0.971880 2401.0 2.109842 37.86 -122.22
2 7.2574 52.0 8.288136 1.073446 496.0 2.802260 37.85 -122.24
3 5.6431 52.0 5.817352 1.073059 558.0 2.547945 37.85 -122.25
4 3.8462 52.0 6.281853 1.081081 565.0 2.181467 37.85 -122.25
target.head()
0    452.6
1    358.5
2    352.1
3    341.3
4    342.2
Name: MedHouseVal, dtype: float64
target.plot.hist(bins=20, edgecolor="black")
plt.xlabel("Median House Value (k$)")
_ = plt.title("Target distribution")

Overfit-generalization-underfit

Underfit

R2

from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_validate
# 

model = DecisionTreeRegressor(random_state=0)

cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

cv_results = cross_validate(model, data, target, cv=cv, return_train_score=True)

scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]

print("The accuracy in TRAIN is "
      f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST  is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
The accuracy in TRAIN is 1.000 +/- 0.000
The accuracy in TEST  is 0.605 +/- 0.014, for 0.112 seconds
cv_results = cross_validate(model, data, target, cv=cv, scoring="r2", return_train_score=True)

scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]

print("The accuracy in TRAIN is "
      f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST  is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
The accuracy in TRAIN is 1.000 +/- 0.000
The accuracy in TEST  is 0.605 +/- 0.014, for 0.112 seconds

neg_mean_absolute_error

cv_results = cross_validate(model, 
                            data, target, 
                            cv=cv, 
                            scoring="neg_mean_absolute_error",
                            return_train_score=True)

cv_results = pd.DataFrame(cv_results)

scores = cv_results["test_score"]
train_scores = cv_results["train_score"]

fit_time = cv_results["fit_time"]

scores = -cv_results["test_score"]
train_scores = -cv_results["train_score"]
fit_time = cv_results["fit_time"]

print("The mean absolute error (k$) in TRAIN is "
      f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The mean absolute error (k$) in TEST  is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
The mean absolute error (k$) in TRAIN is 0.000 +/- 0.000
The mean absolute error (k$) in TEST  is 46.266 +/- 0.961, for 0.114 seconds

Validation curve = find the best value of a parameter

from sklearn.model_selection import validation_curve
# 
max_depth = [x for x in range(1, 26)]

train_scores, test_scores = validation_curve(
                                    model, 
                                    data, target, 
                                    param_name="max_depth", 
                                    param_range=max_depth,
                                    cv=cv)

plt.errorbar(max_depth, train_scores.mean(axis=1),
             yerr=train_scores.std(axis=1), label='Training error')
plt.errorbar(max_depth, test_scores.mean(axis=1),
             yerr=test_scores.std(axis=1), label='Testing error')
plt.legend()

plt.xlabel("Maximum depth of decision tree")
plt.ylabel("R2")
_ = plt.title("Validation curve for decision tree")

The validation curve can be divided into three areas:

For max_depth < 9, the decision tree underfits. The training error and therefore the testing error are both high. The model is too constrained and cannot capture much of the variability of the target variable.

The region around max_depth = 9 corresponds to the parameter for which the decision tree generalizes the best. It is flexible enough to capture a fraction of the variability of the target that generalizes, while not memorizing all of the noise in the target.

For max_depth > 9, the decision tree overfits. The training error becomes very small, while the testing error increases. In this region, the models create decisions specifically for noisy samples harming its ability to generalize to test data.

Note that for max_depth = 9, the model overfits a bit as there is a gap between the training error and the testing error. It can also potentially underfit also a bit at the same time, because the training error is still far from zero (more than 30 k$), meaning that the model might still be too constrained to model interesting parts of the data. However, the testing error is minimal, and this is what really matters. This is the best compromise we could reach by just tuning this parameter.

We should also look at the standard deviation to assess the dispersion of the score. We show the standard deviation of the errors as well.

We were lucky that the variance of the errors was small compared to their respective values, and therefore the conclusions above are quite clear. This is not necessarily always the case.

Best number for max_depth

model = DecisionTreeRegressor(random_state=0, max_depth=9)

cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)

cv_results = cross_validate(model, data, target, cv=cv, return_train_score=True)

scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]

print("The accuracy in TRAIN is "
      f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST  is "
      f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
The accuracy in TRAIN is 0.800 +/- 0.008
The accuracy in TEST  is 0.688 +/- 0.016, for 0.083 seconds

Learning curve = size of the data

from sklearn.model_selection import learning_curve
# 
train_sizes = [x/100 for x in range(1, 101)]

results = learning_curve(
                    model, data, target, 
                    train_sizes=train_sizes, 
                    cv=cv)

train_size, train_scores, test_scores = results[:3]

plt.errorbar(train_size, train_scores.mean(axis=1),
             yerr=train_scores.std(axis=1), label="Training error")
plt.errorbar(train_size, test_scores.mean(axis=1),
             yerr=test_scores.std(axis=1), label="Testing error")
plt.legend()

plt.xscale("log")
plt.xlabel("Number of samples in the training set")
plt.ylabel("R2")
_ = plt.title("Learning curve for decision tree")

If we achieve a plateau and adding new samples in the training set does not reduce the testing error, we might have reach the Bayes error rate using the available model. Using a more complex model might be the only possibility to reduce the testing error further.

Here, it's not the case.