Set and get hyperparameters (v2)
The process of learning a predictive model is driven by a set of internal parameters and a set of training data. These internal parameters are called hyperparameters and are specific for each family of models. In addition, a specific set of hyperparameters are optimal for a specific dataset and thus they need to be optimized.
- Loading
- LogisticRegression on default value
- Put hyperparameter ourself
- Test an hyperparameter manualy
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
import time
myData = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")
myData = myData.drop(columns="education-num")
print(f"The dataset data contains {myData.shape[0]} samples and {myData.shape[1]} features")
target_column = 'class'
target = myData[target_column]
data = myData.drop(columns=target_column)
from sklearn.compose import make_column_selector as selector
#
numerical_columns = selector(dtype_exclude=object)(data)
categorical_columns = selector(dtype_include=object)(data)
all_columns = numerical_columns + categorical_columns
data = data[all_columns]
data_numerical = data[numerical_columns]
data_categorical = data[categorical_columns]
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.model_selection import cross_validate
from sklearn.model_selection import ShuffleSplit
#
model = Pipeline([
("preprocessor", StandardScaler()),
("classifier", LogisticRegression(max_iter=500))])
cv = ShuffleSplit(n_splits=10, test_size=0.3, random_state=0)
cv_results = cross_validate(model, data_numerical, target, cv=cv, return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
We created a model with the default C
value that is equal to 1. If we
wanted to use a different C
parameter we could have done so when we created
the LogisticRegression
object with something like LogisticRegression(C=1e-3)
.
`C` : Inverse of regularization strength; must be a positive float. Smaller values specify stronger regularization.
for parameter in model.get_params():
print(parameter)
model.set_params(classifier__C=1e-3)
cv_results = cross_validate(model, data_numerical, target, cv=cv, return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
model.get_params()['classifier__C']
for C in [1e-4, 1e-3, 1e-2, 1e-1, 1, 10]:
model.set_params(classifier__C=C)
cv_results = cross_validate(model, data_numerical, target, cv=cv, return_train_score=True)
scores = cv_results["test_score"]
train_scores = cv_results["train_score"]
fit_time = cv_results["fit_time"]
print(f"Accuracy score via cross-validation with C={C}:")
print("The accuracy in TRAIN is "
f"{train_scores.mean():.3f} +/- {train_scores.std():.3f}")
print("The accuracy in TEST is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds\n")
We can see that as long as C
is high enough, the model seems to perform
well.