Auto-scrolling

To disable auto-scrolling, execute this javascript in a notebook cell before other cells are executed 'source stackoverflow'

%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Imports

import pandas as pd
import seaborn as sns

myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

First analysis

print(f"The dataset contains {myDataFrame.shape[0]} samples and "
      f"{myDataFrame.shape[1]} columns")

The dataset contains 342 samples and 3 columns

myDataFrame.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')

myDataFrame.head()

Which column is our target to predict?

target_column = 'Species'

myDataFrame[target_column].value_counts()

Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

Separation between numerical and categorical columns

Type of the objects

myDataFrame.dtypes

Culmen Length (mm)    float64
Culmen Depth (mm)     float64
Species                object
dtype: object

We sort the variable names according to their type

numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
categorical_columns = []
all_columns = numerical_columns + categorical_columns + [target_column]

myDataFrame = myDataFrame[all_columns]
myDataFrame.columns

Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')

To look at the amplitude and distribution of the data

Note: the "_" is to store a variable that we will not reuse

myDataFrame[numerical_columns].describe()

_ = myDataFrame.hist(figsize=(10, 5))

Same with seaborn

seaborn.pairplot

_ = sns.pairplot(myDataFrame)

To detect link between the features and the target column

_ = sns.pairplot(myDataFrame, height=4, hue=target_column, corner=True)

Idem but with circle of "same" data

g = sns.pairplot(myDataFrame, height=4, hue=target_column, corner=True)
g.map_lower(sns.kdeplot, levels=3, color=".2");

Crosstab

Useful to detect columns containing the same information in two different forms (thus correlated). If this is the case, one of the columns is excluded.

Here, we don't see this kind of link

pd.crosstab(index=myDataFrame[numerical_columns[0]],
            columns=myDataFrame[numerical_columns[1]])

	Culmen Length (mm)	Culmen Depth (mm)	Species
0	39.1	18.7	Adelie
1	39.5	17.4	Adelie
2	40.3	18.0	Adelie
3	36.7	19.3	Adelie
4	39.3	20.6	Adelie

	Culmen Length (mm)	Culmen Depth (mm)
count	342.000000	342.000000
mean	43.921930	17.151170
std	5.459584	1.974793
min	32.100000	13.100000
25%	39.225000	15.600000
50%	44.450000	17.300000
75%	48.500000	18.700000
max	59.600000	21.500000

Culmen Depth (mm)	13.1	13.2	13.3	13.4	13.5	13.6	13.7	13.8	13.9	14.0	...	20.1	20.2	20.3	20.5	20.6	20.7	20.8	21.1	21.2	21.5
Culmen Length (mm)
32.1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
33.1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
33.5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
34.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
34.1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
55.1	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
55.8	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
55.9	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
58.0	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
59.6	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0