The Dummy classifier will give us an idea of the "minimum" quality we can achieve.

It returns either a fixed value or the most frequent value of the training sample.

The quality of its score will be used as a floor for the future estimation. The objective is to do better or much better than the idiot!

Preparation

import pandas as pd
import numpy  as np
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier

myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

The set

target_column = 'Species'
target = myDataFrame[target_column]
target.value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

target.value_counts(normalize=True)

Adelie       0.441520
Gentoo       0.359649
Chinstrap    0.198830
Name: Species, dtype: float64

data = myDataFrame.drop(columns=target_column)
data.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)'], dtype='object')

numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
data_numeric = data[numerical_columns]

data_train, data_test, target_train, target_test = train_test_split(
    data_numeric, 
    target, 
    #random_state=42, 
    test_size=0.25)

The value returne is the most frequent in the training set

model = DummyClassifier(strategy='prior')

model = DummyClassifier()
model.fit(data_train, target_train);

a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 1.0}

accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.465

The value return is found randomly by respecting the class distribution of the training

model = DummyClassifier(strategy='stratified', random_state= oneInt)

model = DummyClassifier(strategy='stratified')
model.fit(data_train, target_train);

a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 0.5116279069767442,
 'Chinstrap': 0.16279069767441862,
 'Gentoo': 0.32558139534883723}

accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.337

The value return is generated uniformly at random

model = DummyClassifier(strategy='uniform', random_state= oneInt)

model = DummyClassifier(strategy='uniform')
model.fit(data_train, target_train);

a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Adelie': 0.3488372093023256,
 'Chinstrap': 0.3023255813953488,
 'Gentoo': 0.3488372093023256}

accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.372

Always predicts a constant label that is provided. This is useful for metrics that evaluate a non-majority class

model = DummyClassifier(strategy='constant', constant="oneConstant")

model = DummyClassifier(strategy='constant', constant="Chinstrap")
model.fit(data_train, target_train);

a = model.predict(data_test)
n = a.size
unique, counts = np.unique(a, return_counts=True)
dict(zip(unique, counts/n))

{'Chinstrap': 1.0}

accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")

Accuracy of logistic regression: 0.221

The best estimation is to prior, so any model better than that is good.

We thus have a floor value, then obviously the higher the score the better the estimation.