Imports

import pandas as pd
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")

First analysis

print(f"The dataset contains {myDataFrame.shape[0]} samples and "
      f"{myDataFrame.shape[1]} columns")
The dataset contains 48842 samples and 14 columns
myDataFrame.head()
age workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country class
0 25 Private 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States <=50K
1 38 Private HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States <=50K
2 28 Local-gov Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States >50K
3 44 Private Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States >50K
4 18 ? Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States <=50K

Which column is our target to predict?

target_column = 'class'

target_y = myDataFrame["class"]
data_X = myDataFrame.drop(columns="class")
target_y.value_counts()
 <=50K    37155
 >50K     11687
Name: class, dtype: int64
data_X.head()
age workclass education education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country
0 25 Private 11th 7 Never-married Machine-op-inspct Own-child Black Male 0 0 40 United-States
1 38 Private HS-grad 9 Married-civ-spouse Farming-fishing Husband White Male 0 0 50 United-States
2 28 Local-gov Assoc-acdm 12 Married-civ-spouse Protective-serv Husband White Male 0 0 40 United-States
3 44 Private Some-college 10 Married-civ-spouse Machine-op-inspct Husband Black Male 7688 0 40 United-States
4 18 ? Some-college 10 Never-married ? Own-child White Female 0 0 30 United-States

Crosstab

Useful to detect columns containing the same information in two different forms (thus correlated). If this is the case, one of the columns is excluded. Here we excluded "education-num".

pd.crosstab(index=data_X['education'],
            columns=data_X['education-num'])
education-num 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
education
10th 0 0 0 0 0 1389 0 0 0 0 0 0 0 0 0 0
11th 0 0 0 0 0 0 1812 0 0 0 0 0 0 0 0 0
12th 0 0 0 0 0 0 0 657 0 0 0 0 0 0 0 0
1st-4th 0 247 0 0 0 0 0 0 0 0 0 0 0 0 0 0
5th-6th 0 0 509 0 0 0 0 0 0 0 0 0 0 0 0 0
7th-8th 0 0 0 955 0 0 0 0 0 0 0 0 0 0 0 0
9th 0 0 0 0 756 0 0 0 0 0 0 0 0 0 0 0
Assoc-acdm 0 0 0 0 0 0 0 0 0 0 0 1601 0 0 0 0
Assoc-voc 0 0 0 0 0 0 0 0 0 0 2061 0 0 0 0 0
Bachelors 0 0 0 0 0 0 0 0 0 0 0 0 8025 0 0 0
Doctorate 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 594
HS-grad 0 0 0 0 0 0 0 0 15784 0 0 0 0 0 0 0
Masters 0 0 0 0 0 0 0 0 0 0 0 0 0 2657 0 0
Preschool 83 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
Prof-school 0 0 0 0 0 0 0 0 0 0 0 0 0 0 834 0
Some-college 0 0 0 0 0 0 0 0 0 10878 0 0 0 0 0 0
data_X = data_X.drop(columns="education-num")

Separation between numerical and categorical columns

print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")
The dataset data_X contains 48842 samples and 12 columns
data_X.dtypes
age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

We sort the variable names according to their type

by hand

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 
                       'race', 'sex', 'native-country']

select columns based on their data type

from sklearn.compose import make_column_selector as selector
categorical_columns = selector(dtype_include="object")(data_X)
categorical_columns
['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']
numerical_columns = selector(dtype_include="int64")(data_X)
numerical_columns
['age', 'capital-gain', 'capital-loss', 'hours-per-week']
all_columns = numerical_columns + categorical_columns

data_X = data_X[all_columns]
print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")
The dataset data_X contains 48842 samples and 12 columns
data_X[numerical_columns].describe()
age capital-gain capital-loss hours-per-week
count 48842.000000 48842.000000 48842.000000 48842.000000
mean 38.643585 1079.067626 87.502314 40.422382
std 13.710510 7452.019058 403.004552 12.391444
min 17.000000 0.000000 0.000000 1.000000
25% 28.000000 0.000000 0.000000 40.000000
50% 37.000000 0.000000 0.000000 40.000000
75% 48.000000 0.000000 0.000000 45.000000
max 90.000000 99999.000000 4356.000000 99.000000
data_X_numerical = data_X[numerical_columns]

The model

Train-test split the dataset

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_X_numerical, 
    target_y, 
    random_state=42, 
    test_size=0.25)
print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")
Number of samples in testing: 12211 => 25.0% of the original set
print(f"Number of samples in training: {data_train.shape[0]} => "
      f"{data_train.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")
Number of samples in training: 36631 => 75.0% of the original set

To display nice model diagram

from sklearn import set_config
set_config(display='diagram')

To create a logistic regression model in scikit-learn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

Use the fit method to train the model using the training data and labels

model.fit(data_train, target_train)
<div id="sk-4c18ffa1-4a6f-4894-bde5-f4ed278b4c41" class"sk-top-container">
LogisticRegression()

Use the score method to check the model statistical performance on the test set

accuracy = model.score(data_test, target_test)
print(f"Accuracy of logistic regression: {accuracy:.3f}")
Accuracy of logistic regression: 0.807

</div>