diff --git a/slides/09/09.md b/slides/09/09.md
index 56a530c..c6350da 100644
--- a/slides/09/09.md
+++ b/slides/09/09.md
@@ -15,9 +15,7 @@ After this lecture you should be able to
- Implement Decision Trees and Random Forests for classification and regression
-- Explain how the splitting criterion depend on optimized loss function
-
-- Tell how Random Forests differ from Gradient Boosted Decision Trees
+- Explain how the splitting criterion depends on the optimized loss function
---
section: Decision Trees
@@ -28,7 +26,7 @@ class: section
# Decision Trees
The idea of decision trees is to partition the input space into regions and
-solving each region with a simpler model.
+solve each region with a simpler model.
~~~
We focus on **Classification and Regression Trees** (CART; Breiman et al.,
@@ -53,15 +51,15 @@ We focus on **Classification and Regression Trees** (CART; Breiman et al.,
~~~
## Training
-- Training data is stored in tree leaves -- the leaf prediction is based on what is data items are in the leaf.
+- Training data is stored in tree leaves – the leaf prediction is based on what data items are in the leaf.
- At the beginning the tree is a single leaf node.
-- Adding a node = leaf $\rightarrow$ decision node + 2 leaves
+- Adding a node = leaf $\rightarrow$ decision node + 2 leaves.
-- The goal of training = finding the most consistent leaves for the prediction
+- The goal of training = to find the most consistent leaves for the prediction.
-Later, we will show that the consistency measures follow from the loss function, we are optimizing.
+Later, we will show that the consistency measures follow from the loss function we are optimizing.
---
# Regression Decision Trees
@@ -94,9 +92,9 @@ class: middle
To split a node, the goal is to find
-1. A feature and (i.e., a for loop over all features)
+1. a feature and (i.e., a for-loop over all features)
-2. Its value (i.e., a for loop over all unique feature values)
+2. its value (i.e., a for-loop over all unique feature values)
such that when splitting a node $𝓣$ into $𝓣_L$ and $𝓣_R$, the resulting regions
decrease the overall criterion value the most, i.e., the difference $c_{𝓣_L} +
@@ -109,7 +107,7 @@ We usually employ several constraints, the most common ones are:
- **maximum tree depth**: we do not split nodes with this depth;
~~~
-- **minimum examples to split**: we only split nodes with this many training
+- **minimum examples to split**: we do not split nodes with this many training
examples;
~~~
- **maximum number of leaf nodes**: we split until we reach the given number of
@@ -119,7 +117,7 @@ We usually employ several constraints, the most common ones are:
The tree is usually built in one of two ways:
- if the number of leaf nodes is unlimited, we usually build the tree in
a depth-first manner, recursively splitting every leaf until one
- of the above constraints is invalidated;
+ of the above constraints is met;
~~~
- if the maximum number of leaf nodes is given, we usually split such leaf $𝓣$
where the criterion difference $c_{𝓣_L} + c_{𝓣_R} - c_𝓣$ is the lowest.
@@ -148,10 +146,10 @@ For classification trees, one of the following two criteria is usually used:
- **Gini index**, also called **Gini impurity**, measuring how often a randomly
chosen element would be incorrectly labeled if it was randomly labeled
according to $→p_𝓣$:
- $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big),$$
+ $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big);$$
~~~
-- **Entropy Criterion**
+- **entropy criterion**
$$c_\textrm{entropy}(𝓣) ≝ |I_𝓣| ⋅ H(→p_𝓣) = - |I_𝓣| ∑_{\substack{k\\p_𝓣(k) ≠ 0}} p_𝓣(k) \log p_𝓣(k).$$
---
@@ -184,12 +182,12 @@ class: section
---
# Binary Gini as (M)SE Loss
-Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$,
-let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
+Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$.
+Let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
number of examples with target value 1, and let $p_𝓣 = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} t_i = \frac{n_𝓣(1)}{n_𝓣(0) + n_𝓣(1)}$.
~~~
-Consider sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.
+Consider the sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.
~~~
By setting the derivative of the loss to zero, we get that the $p$ minimizing
@@ -217,17 +215,17 @@ $\displaystyle \phantom{L(p_𝓣)} = \big(n_𝓣(0) + n_𝓣(1)\big) \textcolor{
---
# Entropy as NLL Loss
-Again let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
+Again, let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
let $n_𝓣(k)$ be the number of examples with target value $k$, and let
$p_𝓣(k) = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} [t_i = k] = \frac{n_𝓣(k)}{|I_𝓣|}$.
~~~
-Consider a distribution $→p$ on $K$ classes and non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.
+Consider a distribution $→p$ on $K$ classes and a non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.
~~~
By setting the derivative of the loss with respect to $p_k$ to zero (using
-a Lagrangian with constraint $∑_k p_k = 1$), we get that the $→p$ minimizing the
-loss fulfills $p_k = p_𝓣(k)$.
+a Lagrangian with the constraint $∑_k p_k = 1$), we get that the $→p$ minimizing
+the loss fulfills $p_k = p_𝓣(k)$.
~~~
The value of the loss with respect to $→p_𝓣$ is then
@@ -259,7 +257,7 @@ called _feature bagging_).
## Bagging
-Every decision tree is trained using bagging (on a bootstrapped dataset).
+Every decision tree is trained using bagging (i.e., on a bootstrapped dataset).
~~~
## Random Subset of Features
@@ -289,5 +287,3 @@ After this lecture you should be able to
- Implement Decision Trees and Random Forests for classification and regression
- Explain how the splitting criterion depends on optimized loss function
-
-- Tell how Random Forests differ from Gradient Boosted Decision Trees