Hyperparameter tuning
How to optimize hyperparameters using a grid-search or random approach
- Preparation
- Construction of the model with default hyperparameters
- Search for hyperparameters with a random and cross validation
- Search for hyperparameters with a grid and cross validation
- Search for hyperparameters with a grid and without cross validation (bad)
In the previous notebook, we saw that hyperparameters can affect the statistical performance of a model. In this notebook, we will show how to optimize hyperparameters using a grid-search approach.
import pandas as pd
import matplotlib.pyplot as plt
import time
import random
from sklearn.compose import make_column_selector as selector
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
# for the moment this line is required to import HistGradientBoostingClassifier
from sklearn.experimental import enable_hist_gradient_boosting
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.model_selection import cross_validate
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import GridSearchCV
import seaborn as sns
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")
myDataFrame = myDataFrame.drop(columns="education-num")
target_column = 'class'
data = myDataFrame.drop(columns=target_column)
target = myDataFrame[target_column]
numerical_columns = selector(dtype_exclude=object)(data)
data_numerical = myDataFrame[numerical_columns]
categorical_columns = selector(dtype_include=object)(data)
data_categorical = myDataFrame[categorical_columns]
all_columns = numerical_columns + categorical_columns
data = data[all_columns]
categorical_preprocessor = OrdinalEncoder(handle_unknown="use_encoded_value",
unknown_value=-1)
preprocessor = ColumnTransformer([
('cat-preprocessor', categorical_preprocessor, categorical_columns)],
remainder='passthrough', sparse_threshold=0)
model = Pipeline([
("preprocessor", preprocessor),
("classifier",
HistGradientBoostingClassifier(random_state=42, max_leaf_nodes=4))])
cv_results = cross_validate(model, data, target)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The accuracy is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
for parameter in model.get_params():
print(parameter)
from scipy.stats import uniform
from scipy.stats import loguniform
param_distributions = {
'classifier__learning_rate': loguniform(0.0001, 5),
'classifier__max_leaf_nodes': uniform(1, 100)}
model_random_search = RandomizedSearchCV(model, param_distributions, n_jobs=4, cv=2)
cv_results = cross_validate(model_random_search, data, target, return_estimator=True)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print(f"The accuracy via cross-validation is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
print(f"Best parameter found")
for fold_idx, estimator in enumerate(cv_results["estimator"]):
print(f" on fold #{fold_idx + 1} : {estimator.best_params_}")
param_grid = {
'classifier__learning_rate': (0.05, 0.1, 0.5, 1, 5),
'classifier__max_leaf_nodes': (3, 10, 30, 100)}
model_grid_search = GridSearchCV(model, param_grid, n_jobs=4, cv=2)
cv_results = cross_validate(model_grid_search, data, target, return_estimator=True)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print(f"The accuracy via cross-validation is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
print(f"Best parameter found")
for fold_idx, estimator in enumerate(cv_results["estimator"]):
print(f" on fold #{fold_idx + 1} : {estimator.best_params_}")
Search for hyperparameters with a grid and without cross validation (bad)
Be aware that the evaluation should normally be performed in a cross-validation framework by providing model_grid_search as a model to the cross_validate function as above.
Here, we are using a single train-test split to highlight the specificities of the model_grid_search instance.
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=42)
model_grid_search.fit(data_train, target_train)
print(f"The best set of parameters is: "
f"{model_grid_search.best_params_}")
cv_results_grid_train = pd.DataFrame(model_grid_search.cv_results_)
column_results = [f"param_{name}" for name in param_grid.keys()]
column_results += [
"mean_test_score", "std_test_score", "rank_test_score"]
cv_results_grid_train = cv_results_grid_train[column_results]
def shorten_param(param_name):
if "__" in param_name:
return param_name.rsplit("__", 1)[1]
return param_name
cv_results_grid_train = cv_results_grid_train.rename(shorten_param, axis=1)
cv_results_grid_train
pivoted_cv_results_grid_train = cv_results_grid_train.pivot_table(
values="mean_test_score", index=["learning_rate"],
columns=["max_leaf_nodes"])
pivoted_cv_results_grid_train
ax = sns.heatmap(pivoted_cv_results_grid_train, annot=True, cmap="YlGnBu", vmin=0.7,
vmax=0.9)
ax.invert_yaxis()
The above tables highlights the following things:
- for too high values of
learning_rate
, the statistical performance of the model is degraded and adjusting the value ofmax_leaf_nodes
cannot fix that problem; - outside of this pathological region, we observe that the optimal choice
of
max_leaf_nodes
depends on the value oflearning_rate
; - in particular, we observe a "diagonal" of good models with an accuracy
close to the maximal of 0.87: when the value of
max_leaf_nodes
is increased, one should increase the value oflearning_rate
accordingly to preserve a good accuracy.
For now we will note that, in general, there is no unique optimal parameter setting: 6 models out of the 16 parameter configuration reach the maximal accuracy (up to small random fluctuations caused by the sampling of the training set).