Lecture 1 Introduction of Deep Learning Extra
Full course Syllabus reference to Machine Learing 2022 Spring
Note for lecture(Hung-yi Lee YouTube)
(1) ML Lecture 6: Brief Introduction of Deep Learning
(2) ML Lecture 7: Backpropagation
(3) ML Lecture 1: Regression - Case Study
(4) ML Lecture 4: Classification
(5) ML Lecture 5: Logistic Regression
(1) Define a set of function
(2) Goodness of function
(3) Pick the best function
Fully Connected Feedforward
Layer 1: input layer
Layer 2: hidden layer
...
Layer N-1: hidden layer (feature extractor replacing feature engineering)
Layer N: output layer
Write the calculation as a matrix:
Example: Handwriting Digit Recognition
Input 16*16: 256 dim, output 0~9: 10 dim
$$
C \Rightarrow \mathcal{L} = \sum C^n \quad \text{gradient descent}
$$
Universal Approximation Theorem
Any continuous f
$$
f: \mathcal{R}^N \rightarrow \mathcal{R}^M
$$
can by realized by a network with one hidden layer(with enough hidden neurons)
$$
\mathcal{L}(\theta) = \sum_{n=1}^{N} C^n(\theta)
$$
$$
\frac{\partial}{\partial w} \mathcal{L}(\theta) = \sum_{n=1}^{N} \frac{\partial C^n(\theta)}{\partial w}
$$
The derivative of the cost ( c ) with respect to weight ( w ):
$$
\frac{\partial c}{\partial w} = \frac{\partial z}{\partial w} \cdot \frac{\partial c}{\partial z}
$$
Forward pass
$$
z = x_1w_1 + x_2w_2 + b
$$
Partial derivatives:
$$
\frac{\partial z}{\partial w_1} = x_1, \quad \frac{\partial z}{\partial w_2} = x_2
$$
Backward pass
$$
\frac{\partial c}{\partial z} = \frac{\partial a}{\partial z} \cdot \frac{\partial c}{\partial a}
$$
$$
\text{where} \quad \frac{\partial a}{\partial z} \quad \text{ is the sigmoid derivative.}
$$
Gradient of ( c ) with respect to ( a ):
$$
\frac{\partial c}{\partial a} = \frac{\partial z'}{\partial a} \cdot \frac{\partial c}{\partial z'} + \frac{\partial z''}{\partial a} \cdot \frac{\partial c}{\partial z''}
$$
Given:
$$
\frac{\partial z'}{\partial a} = w_3, \quad \frac{\partial z''}{\partial a} = w_4
$$
Final expression:
$$
\frac{\partial c}{\partial z} = \sigma'(z) \cdot \left[w_3 \cdot \frac{\partial c}{\partial z'} + w_4 \cdot \frac{\partial c}{\partial z''}\right]
$$
$$
\text{where} \quad \sigma'(z) \quad \text{ is constant.}
$$
Case 1. Output Layer
$$
\frac{\partial c}{\partial z'} = \frac{\partial y_1}{\partial z'} \cdot \frac{\partial c}{\partial y_1}
$$
$$
\frac{\partial c}{\partial z''} = \frac{\partial y_2}{\partial z''} \cdot \frac{\partial c}{\partial y_2}
$$
Case 2. Not Output Layer
Continue to the next layer until reaching the Output Layer.
If we hope function is smooth, smaller wi is better
$$
y = b + \sum w_i x_i
$$
so define loss function as:
$$
L = \sum_n \left(\hat{y}^n - \left( b + \sum w_i x_i \right)\right)^2 + \lambda \sum (w_i)^2
$$
Given
$$
P(x) =P(x \mid C_1) \cdot P(C_1) + P(x \mid C_2) \cdot P(C_2)
$$
Bayes' theorem:
$$
P(C_1 \mid x) = \frac{P(x \mid C_1) \cdot P(C_1)}{P(x \mid C_1) \cdot P(C_1) + P(x \mid C_2) \cdot P(C_2)}
$$
If x not in C1 of training data but x definitly in Class 1, use C1 to find its Gaussian Distribution.
$$
L(\mu, \sigma) = f_{\mu, \sigma}(x_1) \cdot f_{\mu, \sigma}(x_2) \cdot \ldots \cdot f_{\mu, \sigma}(x_N)
$$
Find the parameters 𝜇∗ and 𝜎∗ that maximize the likelihood:
where
After substituting the values, we obtain the probability of
x, if P(C1|x) > 0.5, x belong of class 1.
Given class 1, class 2, same σ, less parameter.
Class 1: Ranges from 1 to N1
Class 2: Ranges from N1+1 to N1+N2
Functions for μ1 and μ2 are the same as above.
$$
\sigma^* = \frac{N_1}{N_1 + N_2} \sigma^1 + \frac{N_2}{N_1 + N_2} \sigma^2
$$
If all dims independent, we can use Naive Bayes Classifier:
$$
P(x \mid C_1) =P(x_1 \mid C_1) \cdot P(x_2 \mid C_1) \ldots \cdot P(x_k \mid C_1)
$$
$$
P(C_1 \mid x) = \frac{P(x \mid C_1) \cdot P(C_1)}{P(x \mid C_1) \cdot P(C_1) + P(x \mid C_2) \cdot P(C_2)}
$$
$$
= \frac{1}{1 + \frac{P(x \mid C_2) \cdot P(C_2)}{P(x \mid C_1) \cdot P(C_1)}}
$$
let
$$
z = \ln \frac{P(x \mid C_1) \cdot P(C_1)}{P(x \mid C_2) \cdot P(C_2)}
$$
so
$$
= \frac{1}{1 + \exp(-z)} = \sigma(z)
$$
calculate z:
$$
z = \ln \frac{P(x \mid C_1) \cdot P(C_1)}{P(x \mid C_2) \cdot P(C_2)}
$$
$$
= \ln \frac{P(x \mid C_1)}{P(x \mid C_2)} + \ln \frac{P(C_1)}{P(C_2)}
$$
and
$$
\frac{P(C_1)}{P(C_2)} = \frac{\frac{N_1}{N_1 + N_2}}{\frac{N_2}{N_1 + N_2}} = \frac{N_1}{N_2}
$$
By Gaussian Distribution
$$
P(x|C_1) = \frac{1}{(2\pi)^{D/2}} \cdot \frac{1}{|\sigma_1|^{1/2}} \cdot \exp \left( -\frac{1}{2} (x - \mu_1)^T (\sigma_1)^{-1} (x - \mu_1)\right)
$$
and
$$
P(x|C_2) = \frac{1}{(2\pi)^{D/2}} \cdot \frac{1}{|\sigma_2|^{1/2}} \exp\left (-\frac{1}{2} (x - \mu_2)^T (\sigma_2)^{-2} (x - \mu_2)\right)
$$
$$
\frac{P(x|C_1)}{P(x|C_2)} = \ln \frac{|\sigma_1|^{1/2}}{|\sigma_2|^{1/2}} \cdot \exp\left (-\frac{1}{2} [(x - \mu_1)^T (\sigma_1)^{-1} (x - \mu_1) - (x - \mu_2)^T (\sigma_2)^{-2} (x - \mu_2) ]\right )
$$
$$
= (\mu_1 - \mu_2)^T \sigma^{-1} x - \frac{1}{2} (\mu_1)^T \sigma^{-1} \mu_1 + \frac{1}{2} (\mu_2)^T \sigma^{-1} \mu_2 + \ln \frac{N_1}{N_2}
$$
by let σ1 = σ2 = σ
$$
P(x|C_1) = \sigma(w \cdot x + b)
$$
where
$$
w = (\mu_1 - \mu_2)^T \sigma^{-1}, b = - \frac{1}{2} (\mu_1)^T \sigma^{-1} \mu_1 + \frac{1}{2} (\mu_2)^T \sigma^{-1} \mu_2 + \ln \frac{N_1}{N_2}
$$
called Logistic Regression
Step 2. Goodness of function
$$
\begin{array}{cc}
\text{Training Data} & \text{Class} \\
x_1 & C_1 \\
x_2 & C_1 \\
x_3 & C_2 \\
\vdots & \vdots \\
x_N & C_1
\end{array}
$$
This table illustrates that the data points belong to either class C1 or class C2.
$$
f_{w,b}(x) = P_{w,b}(C_1|x)
$$
$$
L_{w,b}(x) = f_{w,b}(x_1) \cdot f_{w,b}(x_2) \cdot (1 - f_{w,b}(x_3)) \cdots f_{w,b}(x_N)
$$
we have
since
$$
-lnL(w, b) = -lnf_{w,b}(x_1) -lnf_{w,b}(x_2) -ln(1 - f_{w,b}(x_3)) - \cdots
$$
we write
$$
\hat{y} =
\begin{cases}
1 & \text{if class 1} \\
0 & \text{if class 2}
\end{cases}
$$
Step 3. Find the best function
$$
\frac{\partial-lnL}{\partial w_i} = \sum_{n}-[\hat{y}\frac{\partial lnf_{w,b}(x^n)}{\partial w_i} + (1 - \hat{y})\frac{\partial ln(1-f_{w,b}(x^n))}{\partial w_i}]
$$
we have
$$
\frac{\partial lnf_{w,b}(x)}{\partial w_i} = \frac{\partial lnf_{w,b}(x)}{\partial z} \cdot \frac{\partial z}{\partial w_i}
$$
where
$$
\frac{\partial z}{\partial w_i} = x_i
$$
and
$$
\frac{\partial lnf_{w,b}(x)}{\partial z} = \frac{\partial ln \sigma(x)}{\partial z} = \frac{1}{\sigma(z)} \cdot \frac{\partial \sigma(z)}{\partial z} = \frac{1}{\sigma(z)} \cdot \sigma(z) \cdot (1 - \sigma(z)) = 1 - \sigma(z)
$$