Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minor bugs in 09.md #239

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 20 additions & 24 deletions slides/09/09.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,7 @@ After this lecture you should be able to

- Implement Decision Trees and Random Forests for classification and regression

- Explain how the splitting criterion depend on optimized loss function

- Tell how Random Forests differ from Gradient Boosted Decision Trees
- Explain how the splitting criterion depends on the optimized loss function

---
section: Decision Trees
Expand All @@ -28,7 +26,7 @@ class: section
# Decision Trees

The idea of decision trees is to partition the input space into regions and
solving each region with a simpler model.
solve each region with a simpler model.

~~~
We focus on **Classification and Regression Trees** (CART; Breiman et al.,
Expand All @@ -53,15 +51,15 @@ We focus on **Classification and Regression Trees** (CART; Breiman et al.,
~~~
## Training

- Training data is stored in tree leaves -- the leaf prediction is based on what is data items are in the leaf.
- Training data is stored in tree leaves the leaf prediction is based on what data items are in the leaf.

- At the beginning the tree is a single leaf node.

- Adding a node = leaf $\rightarrow$ decision node + 2 leaves
- Adding a node = leaf $\rightarrow$ decision node + 2 leaves.

- The goal of training = finding the most consistent leaves for the prediction
- The goal of training = to find the most consistent leaves for the prediction.

Later, we will show that the consistency measures follow from the loss function, we are optimizing.
Later, we will show that the consistency measures follow from the loss function we are optimizing.

---
# Regression Decision Trees
Expand Down Expand Up @@ -94,9 +92,9 @@ class: middle

To split a node, the goal is to find

1. A feature and <small>(i.e., a for loop over all features)</small>
1. a feature and <small>(i.e., a for-loop over all features)</small>

2. Its value <small>(i.e., a for loop over all unique feature values)</small>
2. its value <small>(i.e., a for-loop over all unique feature values)</small>

such that when splitting a node $𝓣$ into $𝓣_L$ and $𝓣_R$, the resulting regions
decrease the overall criterion value the most, i.e., the difference $c_{𝓣_L} +
Expand All @@ -109,7 +107,7 @@ We usually employ several constraints, the most common ones are:
- **maximum tree depth**: we do not split nodes with this depth;

~~~
- **minimum examples to split**: we only split nodes with this many training
- **minimum examples to split**: we do not split nodes with this many training
examples;
~~~
- **maximum number of leaf nodes**: we split until we reach the given number of
Expand All @@ -119,7 +117,7 @@ We usually employ several constraints, the most common ones are:
The tree is usually built in one of two ways:
- if the number of leaf nodes is unlimited, we usually build the tree in
a depth-first manner, recursively splitting every leaf until one
of the above constraints is invalidated;
of the above constraints is met;
~~~
- if the maximum number of leaf nodes is given, we usually split such leaf $𝓣$
where the criterion difference $c_{𝓣_L} + c_{𝓣_R} - c_𝓣$ is the lowest.
Expand Down Expand Up @@ -148,10 +146,10 @@ For classification trees, one of the following two criteria is usually used:
- **Gini index**, also called **Gini impurity**, measuring how often a randomly
chosen element would be incorrectly labeled if it was randomly labeled
according to $→p_𝓣$:
$$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big),$$
$$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big);$$

~~~
- **Entropy Criterion**
- **entropy criterion**
$$c_\textrm{entropy}(𝓣) ≝ |I_𝓣| ⋅ H(→p_𝓣) = - |I_𝓣| ∑_{\substack{k\\p_𝓣(k) ≠ 0}} p_𝓣(k) \log p_𝓣(k).$$

---
Expand Down Expand Up @@ -184,12 +182,12 @@ class: section
---
# Binary Gini as (M)SE Loss

Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$,
let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$.
Let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
number of examples with target value 1, and let $p_𝓣 = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} t_i = \frac{n_𝓣(1)}{n_𝓣(0) + n_𝓣(1)}$.

~~~
Consider sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.
Consider the sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.

~~~
By setting the derivative of the loss to zero, we get that the $p$ minimizing
Expand Down Expand Up @@ -217,17 +215,17 @@ $\displaystyle \phantom{L(p_𝓣)} = \big(n_𝓣(0) + n_𝓣(1)\big) \textcolor{
---
# Entropy as NLL Loss

Again let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
Again, let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
let $n_𝓣(k)$ be the number of examples with target value $k$, and let
$p_𝓣(k) = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} [t_i = k] = \frac{n_𝓣(k)}{|I_𝓣|}$.

~~~
Consider a distribution $→p$ on $K$ classes and non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.
Consider a distribution $→p$ on $K$ classes and a non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.

~~~
By setting the derivative of the loss with respect to $p_k$ to zero (using
a Lagrangian with constraint $∑_k p_k = 1$), we get that the $→p$ minimizing the
loss fulfills $p_k = p_𝓣(k)$.
a Lagrangian with the constraint $∑_k p_k = 1$), we get that the $→p$ minimizing
the loss fulfills $p_k = p_𝓣(k)$.

~~~
The value of the loss with respect to $→p_𝓣$ is then
Expand Down Expand Up @@ -259,7 +257,7 @@ called _feature bagging_).

## Bagging

Every decision tree is trained using bagging (on a bootstrapped dataset).
Every decision tree is trained using bagging (i.e., on a bootstrapped dataset).

~~~
## Random Subset of Features
Expand Down Expand Up @@ -289,5 +287,3 @@ After this lecture you should be able to
- Implement Decision Trees and Random Forests for classification and regression

- Explain how the splitting criterion depends on optimized loss function

- Tell how Random Forests differ from Gradient Boosted Decision Trees