diff --git a/slides/09/09.md b/slides/09/09.md index 56a530c..c6350da 100644 --- a/slides/09/09.md +++ b/slides/09/09.md @@ -15,9 +15,7 @@ After this lecture you should be able to - Implement Decision Trees and Random Forests for classification and regression -- Explain how the splitting criterion depend on optimized loss function - -- Tell how Random Forests differ from Gradient Boosted Decision Trees +- Explain how the splitting criterion depends on the optimized loss function --- section: Decision Trees @@ -28,7 +26,7 @@ class: section # Decision Trees The idea of decision trees is to partition the input space into regions and -solving each region with a simpler model. +solve each region with a simpler model. ~~~ We focus on **Classification and Regression Trees** (CART; Breiman et al., @@ -53,15 +51,15 @@ We focus on **Classification and Regression Trees** (CART; Breiman et al., ~~~ ## Training -- Training data is stored in tree leaves -- the leaf prediction is based on what is data items are in the leaf. +- Training data is stored in tree leaves – the leaf prediction is based on what data items are in the leaf. - At the beginning the tree is a single leaf node. -- Adding a node = leaf $\rightarrow$ decision node + 2 leaves +- Adding a node = leaf $\rightarrow$ decision node + 2 leaves. -- The goal of training = finding the most consistent leaves for the prediction +- The goal of training = to find the most consistent leaves for the prediction. -Later, we will show that the consistency measures follow from the loss function, we are optimizing. +Later, we will show that the consistency measures follow from the loss function we are optimizing. --- # Regression Decision Trees @@ -94,9 +92,9 @@ class: middle To split a node, the goal is to find -1. A feature and (i.e., a for loop over all features) +1. a feature and (i.e., a for-loop over all features) -2. Its value (i.e., a for loop over all unique feature values) +2. its value (i.e., a for-loop over all unique feature values) such that when splitting a node $𝓣$ into $𝓣_L$ and $𝓣_R$, the resulting regions decrease the overall criterion value the most, i.e., the difference $c_{𝓣_L} + @@ -109,7 +107,7 @@ We usually employ several constraints, the most common ones are: - **maximum tree depth**: we do not split nodes with this depth; ~~~ -- **minimum examples to split**: we only split nodes with this many training +- **minimum examples to split**: we do not split nodes with this many training examples; ~~~ - **maximum number of leaf nodes**: we split until we reach the given number of @@ -119,7 +117,7 @@ We usually employ several constraints, the most common ones are: The tree is usually built in one of two ways: - if the number of leaf nodes is unlimited, we usually build the tree in a depth-first manner, recursively splitting every leaf until one - of the above constraints is invalidated; + of the above constraints is met; ~~~ - if the maximum number of leaf nodes is given, we usually split such leaf $𝓣$ where the criterion difference $c_{𝓣_L} + c_{𝓣_R} - c_𝓣$ is the lowest. @@ -148,10 +146,10 @@ For classification trees, one of the following two criteria is usually used: - **Gini index**, also called **Gini impurity**, measuring how often a randomly chosen element would be incorrectly labeled if it was randomly labeled according to $→p_𝓣$: - $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big),$$ + $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big);$$ ~~~ -- **Entropy Criterion** +- **entropy criterion** $$c_\textrm{entropy}(𝓣) ≝ |I_𝓣| ⋅ H(→p_𝓣) = - |I_𝓣| ∑_{\substack{k\\p_𝓣(k) ≠ 0}} p_𝓣(k) \log p_𝓣(k).$$ --- @@ -184,12 +182,12 @@ class: section --- # Binary Gini as (M)SE Loss -Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$, -let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the +Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$. +Let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the number of examples with target value 1, and let $p_𝓣 = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} t_i = \frac{n_𝓣(1)}{n_𝓣(0) + n_𝓣(1)}$. ~~~ -Consider sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$. +Consider the sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$. ~~~ By setting the derivative of the loss to zero, we get that the $p$ minimizing @@ -217,17 +215,17 @@ $\displaystyle \phantom{L(p_𝓣)} = \big(n_𝓣(0) + n_𝓣(1)\big) \textcolor{ --- # Entropy as NLL Loss -Again let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$, +Again, let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$, let $n_𝓣(k)$ be the number of examples with target value $k$, and let $p_𝓣(k) = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} [t_i = k] = \frac{n_𝓣(k)}{|I_𝓣|}$. ~~~ -Consider a distribution $→p$ on $K$ classes and non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$. +Consider a distribution $→p$ on $K$ classes and a non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$. ~~~ By setting the derivative of the loss with respect to $p_k$ to zero (using -a Lagrangian with constraint $∑_k p_k = 1$), we get that the $→p$ minimizing the -loss fulfills $p_k = p_𝓣(k)$. +a Lagrangian with the constraint $∑_k p_k = 1$), we get that the $→p$ minimizing +the loss fulfills $p_k = p_𝓣(k)$. ~~~ The value of the loss with respect to $→p_𝓣$ is then @@ -259,7 +257,7 @@ called _feature bagging_). ## Bagging -Every decision tree is trained using bagging (on a bootstrapped dataset). +Every decision tree is trained using bagging (i.e., on a bootstrapped dataset). ~~~ ## Random Subset of Features @@ -289,5 +287,3 @@ After this lecture you should be able to - Implement Decision Trees and Random Forests for classification and regression - Explain how the splitting criterion depends on optimized loss function - -- Tell how Random Forests differ from Gradient Boosted Decision Trees