The Dummy classifier will give us an idea of the "minimum" quality we can achieve.

It returns either a fixed value or the most frequent value of the training sample.

The quality of its score will be used as a floor for the future estimation. The objective is to do better or much better than the idiot!

Preparation

import pandas as pd
import numpy  as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

The set

target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()
Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

Here we have the weight of each class, so also the minimum quality of the estimate

target.value_counts(normalize=True)
Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

Continuation of preparation

data = myDataFrame.drop(columns=target_column)
data.columns
Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')
numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]
data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

The dummy model

Prior (default) same as most frequent

The value returne is the most frequent in the training set

model = DummyClassifier(strategy='prior')
model = DummyClassifier()
model.fit(data_train, target_train);
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))
{'Adelie': 1.0}
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")
Accuracy of logistic regression: 0.465

Stratified

The value return is found randomly by respecting the class distribution of the training

model = DummyClassifier(strategy='stratified', random_state= oneInt)
model = DummyClassifier(strategy='stratified')
model.fit(data_train, target_train);
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))
{'Adelie': 0.5116279069767442,
 'Chinstrap': 0.16279069767441862,
 'Gentoo': 0.32558139534883723}
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")
Accuracy of logistic regression: 0.337

Uniform

The value return is generated uniformly at random

model = DummyClassifier(strategy='uniform', random_state= oneInt)
model = DummyClassifier(strategy='uniform')
model.fit(data_train, target_train);
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))
{'Adelie': 0.3488372093023256,
 'Chinstrap': 0.3023255813953488,
 'Gentoo': 0.3488372093023256}
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")
Accuracy of logistic regression: 0.372

Constant

Always predicts a constant label that is provided. This is useful for metrics that evaluate a non-majority class

model = DummyClassifier(strategy='constant', constant="oneConstant")
model = DummyClassifier(strategy='constant', constant="Chinstrap")
model.fit(data_train, target_train);
a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))
{'Chinstrap': 1.0}
accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")
Accuracy of logistic regression: 0.221

Conclusion

The best estimation is to prior, so any model better than that is good.

We thus have a floor value, then obviously the higher the score the better the estimation.