forked from INRIA/scikit-learn-mooc
-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathtrees_sol_01.py
More file actions
126 lines (101 loc) · 4.04 KB
/
trees_sol_01.py
File metadata and controls
126 lines (101 loc) · 4.04 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
# coding: utf-8
# %% [markdown]
# # 📃 Solution for Exercise M5.01
#
# In the previous notebook, we showed how a tree with a depth of 1 level was
# working. The aim of this exercise is to repeat part of the previous
# experiment for a depth with 2 levels to show how the process of partitioning
# is repeated over time.
#
# Before to start, we will:
#
# * load the dataset;
# * split the dataset into training and testing dataset;
# * define the function to show the classification decision function.
# %%
import pandas as pd
penguins = pd.read_csv("../datasets/penguins_classification.csv")
culmen_columns = ["Culmen Length (mm)", "Culmen Depth (mm)"]
target_column = "Species"
# %% [markdown]
# ```{note}
# If you want a deeper overview regarding this dataset, you can refer to the
# Appendix - Datasets description section at the end of this MOOC.
# ```
# %%
from sklearn.model_selection import train_test_split
data, target = penguins[culmen_columns], penguins[target_column]
data_train, data_test, target_train, target_test = train_test_split(
data, target, random_state=0
)
range_features = {
feature_name: (data[feature_name].min() - 1, data[feature_name].max() + 1)
for feature_name in data.columns
}
# %%
import numpy as np
import matplotlib.pyplot as plt
def plot_decision_function(fitted_classifier, range_features, ax=None):
"""Plot the boundary of the decision function of a classifier."""
from sklearn.preprocessing import LabelEncoder
feature_names = list(range_features.keys())
# create a grid to evaluate all possible samples
plot_step = 0.02
xx, yy = np.meshgrid(
np.arange(*range_features[feature_names[0]], plot_step),
np.arange(*range_features[feature_names[1]], plot_step),
)
# compute the associated prediction
Z = fitted_classifier.predict(np.c_[xx.ravel(), yy.ravel()])
Z = LabelEncoder().fit_transform(Z)
Z = Z.reshape(xx.shape)
# make the plot of the boundary and the data samples
if ax is None:
_, ax = plt.subplots()
ax.contourf(xx, yy, Z, alpha=0.4, cmap="RdBu")
return ax
# %% [markdown]
# Create a decision tree classifier with a maximum depth of 2 levels and fit
# the training data. Once this classifier trained, plot the data and the
# decision boundary to see the benefit of increasing the depth.
# %%
# solution
from sklearn.tree import DecisionTreeClassifier
tree = DecisionTreeClassifier(max_depth=2)
tree.fit(data_train, target_train)
# %% tags=["solution"]
import seaborn as sns
palette = ["tab:red", "tab:blue", "black"]
ax = sns.scatterplot(data=penguins, x=culmen_columns[0], y=culmen_columns[1],
hue=target_column, palette=palette)
plot_decision_function(tree, range_features, ax=ax)
plt.legend(bbox_to_anchor=(1.05, 1), loc='upper left')
_ = plt.title("Decision boundary using a decision tree")
# %% [markdown]
# Did we make use of the feature "Culmen Length"?
# Plot the tree using the function `sklearn.tree.plot_tree` to find out!
# %%
# solution
from sklearn.tree import plot_tree
_, ax = plt.subplots(figsize=(16, 12))
_ = plot_tree(tree, feature_names=culmen_columns,
class_names=tree.classes_, impurity=False, ax=ax)
# %% [markdown] tags=["solution"]
# We see that the second tree level used the "Culmen Length" to make
# two new decisions. Qualitatively, we saw that such a simple tree was enough
# to classify the penguins' species.
# %% [markdown]
# Compute the accuracy of the decision tree on the testing data.
# %%
# solution
test_score = tree.fit(data_train, target_train).score(data_test, target_test)
print(f"Accuracy of the DecisionTreeClassifier: {test_score:.2f}")
# %% [markdown] tags=["solution"]
# At this stage, we have the intuition that a decision tree is built by
# successively partitioning the feature space, considering one feature at a
# time.
#
# We predict an Adelie penguin if the feature value is below the threshold,
# which is not surprising since this partition was almost pure. If the feature
# value is above the threshold, we predict the Gentoo penguin, the class that
# is most probable.