Auto-scrolling

To disable auto-scrolling, execute this javascript in a notebook cell before other cells are executed 'source stackoverflow'

%%javascript
IPython.OutputArea.prototype._should_scroll = function(lines) {
    return false;
}

Imports

import pandas as pd
import seaborn as sns
myDataFrame = pd.read_csv("../../scikit-learn-mooc/datasets/penguins_classification.csv")

First analysis

print(f"The dataset contains {myDataFrame.shape[0]} samples and "
      f"{myDataFrame.shape[1]} columns")
The dataset contains 342 samples and 3 columns
myDataFrame.columns
Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')
myDataFrame.head()
Culmen Length (mm) Culmen Depth (mm) Species
0 39.1 18.7 Adelie
1 39.5 17.4 Adelie
2 40.3 18.0 Adelie
3 36.7 19.3 Adelie
4 39.3 20.6 Adelie

Which column is our target to predict?

target_column = 'Species'
myDataFrame[target_column].value_counts()
Adelie       151
Gentoo       123
Chinstrap     68
Name: Species, dtype: int64

Separation between numerical and categorical columns

Type of the objects

myDataFrame.dtypes
Culmen Length (mm)    float64
Culmen Depth (mm)     float64
Species                object
dtype: object

We sort the variable names according to their type

numerical_columns = ['Culmen Length (mm)', 'Culmen Depth (mm)']
categorical_columns = []
all_columns = numerical_columns + categorical_columns + [target_column]

myDataFrame = myDataFrame[all_columns]
myDataFrame.columns
Index(['Culmen Length (mm)', 'Culmen Depth (mm)', 'Species'], dtype='object')

To look at the amplitude and distribution of the data

Note: the "_" is to store a variable that we will not reuse
myDataFrame[numerical_columns].describe()
Culmen Length (mm) Culmen Depth (mm)
count 342.000000 342.000000
mean 43.921930 17.151170
std 5.459584 1.974793
min 32.100000 13.100000
25% 39.225000 15.600000
50% 44.450000 17.300000
75% 48.500000 18.700000
max 59.600000 21.500000
_ = myDataFrame.hist(figsize=(10, 5))

Same with seaborn

seaborn.pairplot

_ = sns.pairplot(myDataFrame)
_ = sns.pairplot(myDataFrame, height=4, hue=target_column, corner=True)

Idem but with circle of "same" data

g = sns.pairplot(myDataFrame, height=4, hue=target_column, corner=True)
g.map_lower(sns.kdeplot, levels=3, color=".2");

Crosstab

Useful to detect columns containing the same information in two different forms (thus correlated). If this is the case, one of the columns is excluded.

Here, we don't see this kind of link

pd.crosstab(index=myDataFrame[numerical_columns[0]],
            columns=myDataFrame[numerical_columns[1]])
Culmen Depth (mm) 13.1 13.2 13.3 13.4 13.5 13.6 13.7 13.8 13.9 14.0 ... 20.1 20.2 20.3 20.5 20.6 20.7 20.8 21.1 21.2 21.5
Culmen Length (mm)
32.1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
33.1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
33.5 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
34.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
34.1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
55.1 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
55.8 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
55.9 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
58.0 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
59.6 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

164 rows × 80 columns