Preparation

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

The set

target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

target.value_counts(normalize=True)

Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

Continuation of preparation

data = myDataFrame.drop(columns=target_column)
data.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')

numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

data_train.describe()

_ = data_train.hist(figsize=(10, 5))

Normalization

scaler = StandardScaler()
data_train_scaled = scaler.fit_transform(data_train)
data_train_scaled = pd.DataFrame(data_train_scaled,
                                 columns=data_train.columns)
data_train_scaled.describe()

_ = data_train_scaled.hist(figsize=(10, 5))

Conclusion

This transformer shifts and scales each feature individually so that they all have a 0-mean and a unit standard deviation.

	Culmen Length (mm)	Culmen Depth (mm)
count	256.000000	256.000000
mean	43.830469	17.151953
std	5.461854	1.917841
min	32.100000	13.100000
25%	39.200000	15.675000
50%	43.700000	17.300000
75%	48.500000	18.600000
max	59.600000	21.500000

	Culmen Length (mm)	Culmen Depth (mm)
count	2.560000e+02	2.560000e+02
mean	-2.151057e-16	7.910339e-16
std	1.001959e+00	1.001959e+00
min	-2.151915e+00	-2.116907e+00
25%	-8.494440e-01	-7.716210e-01
50%	-2.393405e-02	7.734577e-02
75%	8.566099e-01	7.565192e-01
max	2.892868e+00	2.271598e+00