ufal · M1ke5225 · Feb 8, 2025
diff --git a/slides/09/09.md b/slides/09/09.md
@@ -15,9 +15,7 @@ After this lecture you should be able to
 
 - Implement Decision Trees and Random Forests for classification and regression
 
-- Explain how the splitting criterion depend on optimized loss function
-
-- Tell how Random Forests differ from Gradient Boosted Decision Trees
+- Explain how the splitting criterion depends on the optimized loss function
 
 ---
 section: Decision Trees
@@ -28,7 +26,7 @@ class: section
 # Decision Trees
 
 The idea of decision trees is to partition the input space into regions and
-solving each region with a simpler model.
+solve each region with a simpler model.
 
 ~~~
 We focus on **Classification and Regression Trees** (CART; Breiman et al.,
@@ -53,15 +51,15 @@ We focus on **Classification and Regression Trees** (CART; Breiman et al.,
 ~~~
 ## Training
 
-- Training data is stored in tree leaves -- the leaf prediction is based on what is data items are in the leaf.
+- Training data is stored in tree leaves – the leaf prediction is based on what data items are in the leaf.
 
 - At the beginning the tree is a single leaf node.
 
-- Adding a node = leaf $\rightarrow$ decision node + 2 leaves
+- Adding a node = leaf $\rightarrow$ decision node + 2 leaves.
 
-- The goal of training = finding the most consistent leaves for the prediction
+- The goal of training = to find the most consistent leaves for the prediction.
 
-Later, we will show that the consistency measures follow from the loss function, we are optimizing.
+Later, we will show that the consistency measures follow from the loss function we are optimizing.
 
 ---
 # Regression Decision Trees
@@ -94,9 +92,9 @@ class: middle
 
 To split a node, the goal is to find
 
-1. A feature and <small>(i.e., a for loop over all features)</small>
+1. a feature and <small>(i.e., a for-loop over all features)</small>
 
-2. Its value <small>(i.e., a for loop over all unique feature values)</small>
+2. its value <small>(i.e., a for-loop over all unique feature values)</small>
 
 such that when splitting a node $𝓣$ into $𝓣_L$ and $𝓣_R$, the resulting regions
 decrease the overall criterion value the most, i.e., the difference $c_{𝓣_L} +
@@ -109,7 +107,7 @@ We usually employ several constraints, the most common ones are:
 - **maximum tree depth**: we do not split nodes with this depth;
 
 ~~~
-- **minimum examples to split**: we only split nodes with this many training
+- **minimum examples to split**: we do not split nodes with this many training
   examples;
 ~~~
 - **maximum number of leaf nodes**: we split until we reach the given number of
@@ -119,7 +117,7 @@ We usually employ several constraints, the most common ones are:
 The tree is usually built in one of two ways:
 - if the number of leaf nodes is unlimited, we usually build the tree in
   a depth-first manner, recursively splitting every leaf until one
-  of the above constraints is invalidated;
+  of the above constraints is met;
 ~~~
 - if the maximum number of leaf nodes is given, we usually split such leaf $𝓣$
   where the criterion difference $c_{𝓣_L} + c_{𝓣_R} - c_𝓣$ is the lowest.
@@ -148,10 +146,10 @@ For classification trees, one of the following two criteria is usually used:
 - **Gini index**, also called **Gini impurity**, measuring how often a randomly
   chosen element would be incorrectly labeled if it was randomly labeled
   according to $→p_𝓣$:
-  $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big),$$
+  $$c_\textrm{Gini}(𝓣) ≝ |I_𝓣| ∑_k p_𝓣(k) \big(1 - p_𝓣(k)\big);$$
 
 ~~~
-- **Entropy Criterion**
+- **entropy criterion**
   $$c_\textrm{entropy}(𝓣) ≝ |I_𝓣| ⋅ H(→p_𝓣) = - |I_𝓣| ∑_{\substack{k\\p_𝓣(k) ≠ 0}} p_𝓣(k) \log p_𝓣(k).$$
 
 ---
@@ -184,12 +182,12 @@ class: section
 ---
 # Binary Gini as (M)SE Loss
 
-Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$,
-let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
+Recall that $I_𝓣$ denotes the set of training example indices belonging to a leaf node $𝓣$.
+Let $n_𝓣(0)$ be the number of examples with target value 0, $n_𝓣(1)$ be the
 number of examples with target value 1, and let $p_𝓣 = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} t_i = \frac{n_𝓣(1)}{n_𝓣(0) + n_𝓣(1)}$.
 
 ~~~
-Consider sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.
+Consider the sum of squares loss $L(p) = ∑_{i ∈ I_𝓣} (p - t_i)^2$.
 
 ~~~
 By setting the derivative of the loss to zero, we get that the $p$ minimizing
@@ -217,17 +215,17 @@ $\displaystyle \phantom{L(p_𝓣)} = \big(n_𝓣(0) + n_𝓣(1)\big) \textcolor{
 ---
 # Entropy as NLL Loss
 
-Again let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
+Again, let $I_𝓣$ denote the set of training example indices belonging to a leaf node $𝓣$,
 let $n_𝓣(k)$ be the number of examples with target value $k$, and let
 $p_𝓣(k) = \frac{1}{|I_𝓣|} ∑_{i ∈ I_𝓣} [t_i = k] = \frac{n_𝓣(k)}{|I_𝓣|}$.
 
 ~~~
-Consider a distribution $→p$ on $K$ classes and non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.
+Consider a distribution $→p$ on $K$ classes and a non-averaged NLL loss $L(→p) = ∑_{i ∈ I_𝓣} - \log p_{t_i}$.
 
 ~~~
 By setting the derivative of the loss with respect to $p_k$ to zero (using
-a Lagrangian with constraint $∑_k p_k = 1$), we get that the $→p$ minimizing the
-loss fulfills $p_k = p_𝓣(k)$.
+a Lagrangian with the constraint $∑_k p_k = 1$), we get that the $→p$ minimizing
+the loss fulfills $p_k = p_𝓣(k)$.
 
 ~~~
 The value of the loss with respect to $→p_𝓣$ is then
@@ -259,7 +257,7 @@ called _feature bagging_).
 
 ## Bagging
 
-Every decision tree is trained using bagging (on a bootstrapped dataset).
+Every decision tree is trained using bagging (i.e., on a bootstrapped dataset).
 
 ~~~
 ## Random Subset of Features
@@ -289,5 +287,3 @@ After this lecture you should be able to
 - Implement Decision Trees and Random Forests for classification and regression
 
 - Explain how the splitting criterion depends on optimized loss function
-
-- Tell how Random Forests differ from Gradient Boosted Decision Trees