Imports

import pandas as pd

myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/adult-census.csv")

First analysis

print(f"The dataset contains {myDataFrame.shape[0]} samples and "
      f"{myDataFrame.shape[1]} columns")

The dataset contains 48842 samples and 14 columns

myDataFrame.head()

Which column is our target to predict?

target_column = 'class'

target_y = myDataFrame["class"]
data_X = myDataFrame.drop(columns="class")

target_y.value_counts()

 <=50K    37155
 >50K     11687
Name: class, dtype: int64

data_X.head()

Crosstab

Useful to detect columns containing the same information in two different forms (thus correlated). If this is the case, one of the columns is excluded. Here we excluded "education-num".

pd.crosstab(index=data_X['education'],
            columns=data_X['education-num'])

data_X = data_X.drop(columns="education-num")

Separation between numerical and categorical columns

print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")

The dataset data_X contains 48842 samples and 12 columns

data_X.dtypes

age                int64
workclass         object
education         object
marital-status    object
occupation        object
relationship      object
race              object
sex               object
capital-gain       int64
capital-loss       int64
hours-per-week     int64
native-country    object
dtype: object

We sort the variable names according to their type

by hand

numerical_columns = ["age", "capital-gain", "capital-loss", "hours-per-week"]
categorical_columns = ['workclass', 'education', 'marital-status', 'occupation', 'relationship', 
                       'race', 'sex', 'native-country']

select columns based on their data type

from sklearn.compose import make_column_selector as selector

categorical_columns = selector(dtype_include="object")(data_X)
categorical_columns

['workclass',
 'education',
 'marital-status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native-country']

numerical_columns = selector(dtype_include="int64")(data_X)
numerical_columns

['age', 'capital-gain', 'capital-loss', 'hours-per-week']

all_columns = numerical_columns + categorical_columns

data_X = data_X[all_columns]

print(f"The dataset data_X contains {data_X.shape[0]} samples and "
      f"{data_X.shape[1]} columns")

The dataset data_X contains 48842 samples and 12 columns

data_X[numerical_columns].describe()

data_X_numerical = data_X[numerical_columns]

The model

Train-test split the dataset

from sklearn.model_selection import train_test_split

data_train, data_test, target_train, target_test = train_test_split(
    data_X_numerical, 
    target_y, 
    random_state=42, 
    test_size=0.25)

print(f"Number of samples in testing: {data_test.shape[0]} => "
      f"{data_test.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in testing: 12211 => 25.0% of the original set

print(f"Number of samples in training: {data_train.shape[0]} => "
      f"{data_train.shape[0] / data_X_numerical.shape[0] * 100:.1f}% of the"
      f" original set")

Number of samples in training: 36631 => 75.0% of the original set

To display nice model diagram

from sklearn import set_config
set_config(display='diagram')

To create a logistic regression model in scikit-learn

from sklearn.linear_model import LogisticRegression

model = LogisticRegression()

Use the fit method to train the model using the training data and labels

model.fit(data_train, target_train)

LogisticRegression()

	age	workclass	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country	class
0	25	Private	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States	<=50K
1	38	Private	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States	<=50K
2	28	Local-gov	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States	>50K
3	44	Private	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States	>50K
4	18	?	Some-college	10	Never-married	?	Own-child	White	Female	0	30	United-States	<=50K

	age	workclass	education	education-num	marital-status	occupation	relationship	race	sex	capital-gain	hours-per-week	native-country
0	25	Private	11th	7	Never-married	Machine-op-inspct	Own-child	Black	Male	0	40	United-States
1	38	Private	HS-grad	9	Married-civ-spouse	Farming-fishing	Husband	White	Male	0	50	United-States
2	28	Local-gov	Assoc-acdm	12	Married-civ-spouse	Protective-serv	Husband	White	Male	0	40	United-States
3	44	Private	Some-college	10	Married-civ-spouse	Machine-op-inspct	Husband	Black	Male	7688	40	United-States
4	18	?	Some-college	10	Never-married	?	Own-child	White	Female	0	30	United-States

education-num	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16
education
10th	0	0	0	0	0	1389	0	0	0	0	0	0	0	0	0	0
11th	0	0	0	0	0	0	1812	0	0	0	0	0	0	0	0	0
12th	0	0	0	0	0	0	0	657	0	0	0	0	0	0	0	0
1st-4th	0	247	0	0	0	0	0	0	0	0	0	0	0	0	0	0
5th-6th	0	0	509	0	0	0	0	0	0	0	0	0	0	0	0	0
7th-8th	0	0	0	955	0	0	0	0	0	0	0	0	0	0	0	0
9th	0	0	0	0	756	0	0	0	0	0	0	0	0	0	0	0
Assoc-acdm	0	0	0	0	0	0	0	0	0	0	0	1601	0	0	0	0
Assoc-voc	0	0	0	0	0	0	0	0	0	0	2061	0	0	0	0	0
Bachelors	0	0	0	0	0	0	0	0	0	0	0	0	8025	0	0	0
Doctorate	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0	594
HS-grad	0	0	0	0	0	0	0	0	15784	0	0	0	0	0	0	0
Masters	0	0	0	0	0	0	0	0	0	0	0	0	0	2657	0	0
Preschool	83	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0
Prof-school	0	0	0	0	0	0	0	0	0	0	0	0	0	0	834	0
Some-college	0	0	0	0	0	0	0	0	0	10878	0	0	0	0	0	0

	age	capital-gain	capital-loss	hours-per-week
count	48842.000000	48842.000000	48842.000000	48842.000000
mean	38.643585	1079.067626	87.502314	40.422382
std	13.710510	7452.019058	403.004552	12.391444
min	17.000000	0.000000	0.000000	1.000000
25%	28.000000	0.000000	0.000000	40.000000
50%	37.000000	0.000000	0.000000	40.000000
75%	48.000000	0.000000	0.000000	45.000000
max	90.000000	99999.000000	4356.000000	99.000000

Basic model with scikit-learn