Encoding of categorical variables (v2)
Dealing with categorical variables by encoding them, namely ordinal encoding and one-hot encoding
- Loading
- Identify categorical variables
- Encoding ordinal categories
- Encoding nominal categories (without assuming any order)
- LogisticRegression on categorical variables
import pandas as pd
import numpy as np
# import matplotlib.pyplot as plt
# import seaborn as sns
import time
myData = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")
myData = myData.drop(columns="education-num")
print(f"The dataset data contains {myData.shape[0]} samples and {myData.shape[1]} features")
target_column = 'class'
target = myData[target_column]
data = myData.drop(columns=target_column)
from sklearn.compose import make_column_selector as selector
#
numerical_columns = selector(dtype_exclude=object)(data)
categorical_columns = selector(dtype_include=object)(data)
all_columns = numerical_columns + categorical_columns
data = data[all_columns]
data_numerical = data[numerical_columns]
data_categorical = data[categorical_columns]
data_categorical
data_categorical["native-country"].value_counts()
data_categorical["native-country"].value_counts().sort_index()
Encoding ordinal categories
Using an OrdinalEncoder will output ordinal categories.
This means that there is an order in the resulting categories (e.g. 0 < 1 < 2). The impact of violating this ordering assumption is really dependent on the downstream models. Linear models will be impacted by misordered categories while tree-based models will not.
OrdinalEncoder is often a good strategy with tree-based models
You can still use an OrdinalEncoder with linear models but you need to be sure that:- the original categories (before encoding) have an ordering;- the encoded categories follow the same ordering than the original categories.
categorical_column = data_categorical[["workclass"]]
categorical_column.value_counts()
from sklearn.preprocessing import OrdinalEncoder
#
encoder = OrdinalEncoder()
categorical_encoded = encoder.fit_transform(categorical_column)
categorical_encoded[:5]
categorical_column[:5]
encoder.categories_
Encoding nominal categories (without assuming any order)
Each category (unique value) became a column; the encoding returned, for each sample, a 1 to specify which category it belongs to.
OneHotEncoder is the encoding strategy used when the downstream models are linear models
One-hot encoding categorical variables with high cardinality can cause computational inefficiency in tree-based models. Because of this, it is not recommended to use OneHotEncoder in such cases even if the original categories do not have a given order.
from sklearn.preprocessing import OneHotEncoder
#
encoder = OneHotEncoder(sparse=False)
categorical_column = data_categorical[["workclass"]]
categorical_encoded = encoder.fit_transform(categorical_column)
categorical_encoded[:5]
sparse=False is used in the OneHotEncoder for didactic purposes, namely easier visualization of the data. Sparse matrices are efficient data structures when most of your matrix elements are zero.
feature_names = encoder.get_feature_names_out(input_features=["workclass"])
categorical_encoded = pd.DataFrame(categorical_encoded, columns=feature_names)
categorical_encoded[:5]
data_categorical.head()
data_encoded = encoder.fit_transform(data_categorical)
data_encoded[:2]
columns_encoded = encoder.get_feature_names_out(data_categorical.columns)
pd.DataFrame(data_encoded, columns=columns_encoded)[:2]
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_validate
#
model = make_pipeline(OneHotEncoder(handle_unknown="ignore"), LogisticRegression(max_iter=500))
cv_results = cross_validate(model, data_categorical, target, cv=10)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The accuracy is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")
model = make_pipeline(OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=100),
LogisticRegression(max_iter=500))
cv_results = cross_validate(model, data_categorical, target, cv=10)
scores = cv_results["test_score"]
fit_time = cv_results["fit_time"]
print("The accuracy is "
f"{scores.mean():.3f} +/- {scores.std():.3f}, for {fit_time.mean():.3f} seconds")