You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
<p><strong>Question.</strong> The third dimension of <code>x</code> is 3. Why?</p>
115
115
</blockquote>
116
-
<p>Now we will create a bank 10 of $5 \times 5 \times 3$ filters.</p>
116
+
<p>Next, we create a bank of 10 filters of dimension $5 \times 5 \times 3$, initialising their coefficients randomly:</p>
117
117
<pre><codeclass="language-matlab">% Create a bank of linear filters
118
118
w = randn(5,5,3,10,'single') ;
119
119
</code></pre>
120
120
121
-
<p>The filters are in single precision as well. Note that <code>w</code> has four dimensions, packing 10 filters. Note also that each filter is not flat, but rather a volume with three layers. The next step is applying the filter to the image. This uses the <code>vl_nnconv</code> function from MatConvNet:</p>
121
+
<p>The filters are in single precision as well. Note that <code>w</code> has four dimensions, packing 10 filters. Note also that each filter is not flat, but rather a volume containing three slices. The next step is applying the filter to the image. This uses the <code>vl_nnconv</code> function from MatConvNet:</p>
122
122
<pre><codeclass="language-matlab">% Apply the convolution operator
123
123
y = vl_nnconv(x, w, []) ;
124
124
</code></pre>
125
125
126
126
<p><strong>Remark:</strong> You might have noticed that the third argument to the <code>vl_nnconv</code> function is the empty matrix <code>[]</code>. It can be otherwise used to pass a vector of bias terms to add to the output of each filter.</p>
127
-
<p>The variable <code>y</code> contains the output of the convolution. Note that the filters are three-dimensional, in the sense that it operates on a map $\bx$ with $K$ channels. Furthermore, there are $K'$ such filters, generating a $K'$ dimensional map $\by$ as follows
127
+
<p>The variable <code>y</code> contains the output of the convolution. Note that the filters are three-dimensional. This is because they operate on a tensor $\bx$ with $K$ channels. Furthermore, there are $K'$ such filters, generating a $K'$ dimensional map $\by$ as follows:
128
128
<scripttype="math/tex; mode=display">
129
129
y_{i'j'k'} = \sum_{ijk} w_{ijkk'}x_{i+i',j+j',k}
130
130
</script>
@@ -287,7 +287,7 @@ <h3 id="part-21-the-theory-of-back-propagation">Part 2.1: the theory of back-pro
287
287
\bx_L
288
288
</script>
289
289
During learning, the last layer of the network is the <em>loss function</em> that should be minimized. Hence, the output $\bx_L = x_L$ of the network is a <strong>scalar</strong> quantity (a single number).</p>
290
-
<p>The gradient is easily computed using using the <strong>chain rule</strong>. If <em>all</em> network variables and parameters are scalar, this is given by[^derivative]:
290
+
<p>The gradient is easily computed using using the <strong>chain rule</strong>. If <em>all</em> network variables and parameters are scalar, this is given by:
291
291
<scripttype="math/tex; mode=display">
292
292
\frac{\partialf}{\partialw_l}(x_0;w_1,\dots,w_L)
293
293
=
@@ -302,7 +302,7 @@ <h3 id="part-21-the-theory-of-back-propagation">Part 2.1: the theory of back-pro
302
302
<blockquote>
303
303
<p><strong>Question:</strong> The output derivatives have the same size as the parameters in the network. Why?</p>
304
304
</blockquote>
305
-
<p><strong>Back-propagation</strong> allows computing the output derivatives in a memory-efficient manner. To see how, the first step is to generalize the equation above to tensors using a matrix notation. This is done by converting tensors into vectors by using the $\vv$ (stacking)[^stacking] operator:
305
+
<p><strong>Back-propagation</strong> allows computing the output derivatives in a memory-efficient manner. To see how, the first step is to generalize the equation above to tensors using a matrix notation. This is done by converting tensors into vectors by using the $\vv$ (stacking)<supid="fnref:stacking"><aclass="footnote-ref" href="#fn:stacking" rel="footnote">2</a></sup> operator:
306
306
<scripttype="math/tex; mode=display">
307
307
\frac{\partial \vvf}{\partial \vv^\top \bw_l}
308
308
=
@@ -532,7 +532,7 @@ <h3 id="part-33-learning-with-gradient-descent">Part 3.3: learning with gradient
532
532
<li>Note that the objective enforces a <em>margin</em> between the scores of the positive and negative pixels. How much is this margin?</li>
533
533
</ul>
534
534
</blockquote>
535
-
<p>We can now train the CNN by minimising the objective function with respect to $\bw$ and $b$. We do so by using an algorithm called <em>gradient descent with momentum</em>. Given the current solution $(\bw_t,b_t)$ and update it , this is updated to $(\bw_{t+1},b_t)$ by following the direction of fastest descent as given by the negative gradient $-\nabla E(\bw_t,b_t)$ of the objective. However, gradient updates are smoothed by considering a <em>momentum</em> term $(\bar\bw_{t}, \bar\mu_t)$, yielding the update equations
535
+
<p>We can now train the CNN by minimising the objective function with respect to $\bw$ and $b$. We do so by using an algorithm called <em>gradient descent with momentum</em>. Given the current solution $(\bw_t,b_t)$, this is updated to $(\bw_{t+1},b_{t+1})$ by following the direction of fastest descent of the objective $E(\bw_t,b_t)$ as given by the negative gradient $-\nabla E$. However, gradient updates are smoothed by considering a <em>momentum</em> term $(\bar\bw_{t}, \bar\mu_t)$, yielding the update equations
@@ -558,14 +558,18 @@ <h3 id="part-33-learning-with-gradient-descent">Part 3.3: learning with gradient
558
558
<p><strong>Tasks:</strong></p>
559
559
<ul>
560
560
<li>Inspect the code in the file <code>exercise3.m</code>. Convince yourself that the code is implementing the algorithm described above. Pay particular attention at the forward and backward passes as well as at how the objective function and its derivatives are computed.</li>
561
-
<li>Run the algorithm and observe the results. Then answer the following questions:</li>
561
+
<li>Run the algorithm and observe the results. Then answer the following questions:<ul>
562
562
<li>The learned filter should resemble the discretisation of a well-known differential operator. Which one? </li>
563
563
<li>What is the average of the filter values compared to the average of the absolute values?</li>
564
-
<li>Run the algorithm again and observe the evolution of the histograms of the score of the positive and negative pixels in relation to the values 0 and 1. Answer the following:</li>
564
+
</ul>
565
+
</li>
566
+
<li>Run the algorithm again and observe the evolution of the histograms of the score of the positive and negative pixels in relation to the values 0 and 1. Answer the following:<ul>
565
567
<li>Is the objective function minimised monotonically?</li>
566
568
<li>As the histograms evolve, can you identify at least two "phases" in the optimisation?</li>
567
569
<li>Once converged, do the score distribute in the manner that you would expect?</li>
568
570
</ul>
571
+
</li>
572
+
</ul>
569
573
<p><strong>Hint:</strong> the <code>plotPeriod</code> option can be changed to plot the diagnostic figure with a higher or lower frequency; this can significantly affect the speed of the algorithm.</p>
570
574
</blockquote>
571
575
<h3id="part-34-experimenting-with-the-tiny-cnn">Part 3.4: experimenting with the tiny CNN</h3>
@@ -766,17 +770,8 @@ <h3 id="part-47-training-using-the-gpu">Part 4.7: Training using the GPU</h3>
766
770
<p>In MatConvNet this is almost trivial as it builds on the easy-to-use GPU support in MATLAB. You can follow this list of steps to try it out:</p>
767
771
<ol>
768
772
<li>Clear the models generated and cached in the previous steps. To do this, rename or delete the directories <code>data/characters-experiment</code> and <code>data/characters-jit-experiment</code>.</li>
769
-
<li>
770
-
<p>Make sure that MatConvNet is compiled with GPU support. To do this, use</p>
771
-
<p>```matlab</p>
772
-
<blockquote>
773
-
<p>setup('useGpu', true) ;
774
-
```</p>
775
-
</blockquote>
776
-
</li>
777
-
<li>
778
-
<p>Try again training the model of <code>exercise4.m</code> switching to <code>true</code> the <code>useGpu</code> flag.</p>
779
-
</li>
773
+
<li>Make sure that MatConvNet is compiled with GPU support. To do this, use <code>setup('useGpu', true)</code>.</li>
774
+
<li>Try again training the model of <code>exercise4.m</code> switching to <code>true</code> the <code>useGpu</code> flag.</li>
780
775
</ol>
781
776
<blockquote>
782
777
<p><strong>Task:</strong> Follow the steps above and note the speed of training. How many images per second can you process now?</p>
<p>A two-dimensional <em>lattice</em> is a discrete grid embedded in $R^2$, similar for example to a checkerboard. <aclass="footnote-backref" href="#fnref:lattice" rev="footnote" title="Jump back to footnote 1 in the text">↩</a></p>
848
844
</li>
845
+
<liid="fn:stacking">
846
+
<p>The stacking of a tensor $\bx \in\mathbb{R}^{H\times W\times C}$ is the vector <scripttype="math/tex; mode=display"> \vv \bx= \begin{bmatrix}x_{111}\\ x_{211} \\ \vdots \\ x_{H11} \\ x_{121} \\\vdots \\ x_{HWC} \end{bmatrix}.</script> <aclass="footnote-backref" href="#fnref:stacking" rev="footnote" title="Jump back to footnote 2 in the text">↩</a></p>
Copy file name to clipboardexpand all lines: doc/instructions.md
+14-16
Original file line number
Diff line number
Diff line change
@@ -77,14 +77,14 @@ Use MATLAB `size` command to obtain the size of the array `x`. Note that the arr
77
77
78
78
> **Question.** The third dimension of `x` is 3. Why?
79
79
80
-
Now we will create a bank 10 of $5 \times 5 \times 3$ filters.
80
+
Next, we create a bank of 10 filters of dimension $5 \times 5 \times 3$, initialising their coefficients randomly:
81
81
82
82
```matlab
83
83
% Create a bank of linear filters
84
84
w = randn(5,5,3,10,'single') ;
85
85
```
86
86
87
-
The filters are in single precision as well. Note that `w` has four dimensions, packing 10 filters. Note also that each filter is not flat, but rather a volume with three layers. The next step is applying the filter to the image. This uses the `vl_nnconv` function from MatConvNet:
87
+
The filters are in single precision as well. Note that `w` has four dimensions, packing 10 filters. Note also that each filter is not flat, but rather a volume containing three slices. The next step is applying the filter to the image. This uses the `vl_nnconv` function from MatConvNet:
88
88
89
89
```matlab
90
90
% Apply the convolution operator
@@ -93,7 +93,7 @@ y = vl_nnconv(x, w, []) ;
93
93
94
94
**Remark:** You might have noticed that the third argument to the `vl_nnconv` function is the empty matrix `[]`. It can be otherwise used to pass a vector of bias terms to add to the output of each filter.
95
95
96
-
The variable `y` contains the output of the convolution. Note that the filters are three-dimensional, in the sense that it operates on a map $\bx$ with $K$ channels. Furthermore, there are $K'$ such filters, generating a $K'$ dimensional map $\by$ as follows
96
+
The variable `y` contains the output of the convolution. Note that the filters are three-dimensional. This is because they operate on a tensor $\bx$ with $K$ channels. Furthermore, there are $K'$ such filters, generating a $K'$ dimensional map $\by$ as follows:
97
97
$$
98
98
y_{i'j'k'} = \sum_{ijk} w_{ijkk'} x_{i+i',j+j',k}
99
99
$$
@@ -270,7 +270,7 @@ $$
270
270
$$
271
271
During learning, the last layer of the network is the *loss function* that should be minimized. Hence, the output $\bx_L = x_L$ of the network is a **scalar** quantity (a single number).
272
272
273
-
The gradient is easily computed using using the **chain rule**. If *all* network variables and parameters are scalar, this is given by[^derivative]:
273
+
The gradient is easily computed using using the **chain rule**. If *all* network variables and parameters are scalar, this is given by:
> - What can you say about the score of each pixel if $\lambda=0$ and $E(\bw,b) =0$?
534
534
> - Note that the objective enforces a *margin* between the scores of the positive and negative pixels. How much is this margin?
535
535
536
-
We can now train the CNN by minimising the objective function with respect to $\bw$ and $b$. We do so by using an algorithm called *gradient descent with momentum*. Given the current solution $(\bw_t,b_t)$ and update it , this is updated to $(\bw_{t+1},b_t)$ by following the direction of fastest descent as given by the negative gradient $-\nabla E(\bw_t,b_t)$ of the objective. However, gradient updates are smoothed by considering a *momentum* term $(\bar\bw_{t}, \bar\mu_t)$, yielding the update equations
536
+
We can now train the CNN by minimising the objective function with respect to $\bw$ and $b$. We do so by using an algorithm called *gradient descent with momentum*. Given the current solution $(\bw_t,b_t)$, this is updated to $(\bw_{t+1},b_{t+1})$ by following the direction of fastest descent of the objective $E(\bw_t,b_t)$ as given by the negative gradient $-\nabla E$. However, gradient updates are smoothed by considering a *momentum* term $(\bar\bw_{t}, \bar\mu_t)$, yielding the update equations
> - Inspect the code in the file `exercise3.m`. Convince yourself that the code is implementing the algorithm described above. Pay particular attention at the forward and backward passes as well as at how the objective function and its derivatives are computed.
562
562
> - Run the algorithm and observe the results. Then answer the following questions:
563
-
> * The learned filter should resemble the discretisation of a well-known differential operator. Which one?
564
-
> * What is the average of the filter values compared to the average of the absolute values?
563
+
> * The learned filter should resemble the discretisation of a well-known differential operator. Which one?
564
+
> * What is the average of the filter values compared to the average of the absolute values?
565
565
> - Run the algorithm again and observe the evolution of the histograms of the score of the positive and negative pixels in relation to the values 0 and 1. Answer the following:
566
-
> * Is the objective function minimised monotonically?
567
-
> * As the histograms evolve, can you identify at least two "phases" in the optimisation?
568
-
> * Once converged, do the score distribute in the manner that you would expect?
566
+
> * Is the objective function minimised monotonically?
567
+
> * As the histograms evolve, can you identify at least two "phases" in the optimisation?
568
+
> * Once converged, do the score distribute in the manner that you would expect?
569
569
>
570
570
> **Hint:** the `plotPeriod` option can be changed to plot the diagnostic figure with a higher or lower frequency; this can significantly affect the speed of the algorithm.
571
571
@@ -794,12 +794,7 @@ A key challenge in deep learning is the sheer amount of computation required to
794
794
In MatConvNet this is almost trivial as it builds on the easy-to-use GPU support in MATLAB. You can follow this list of steps to try it out:
795
795
796
796
1. Clear the models generated and cached in the previous steps. To do this, rename or delete the directories `data/characters-experiment` and `data/characters-jit-experiment`.
797
-
2. Make sure that MatConvNet is compiled with GPU support. To do this, use
798
-
799
-
```matlab
800
-
> setup('useGpu', true) ;
801
-
```
802
-
797
+
2. Make sure that MatConvNet is compiled with GPU support. To do this, use `setup('useGpu', true)`.
803
798
3. Try again training the model of `exercise4.m` switching to `true` the `useGpu` flag.
804
799
805
800
> **Task:** Follow the steps above and note the speed of training. How many images per second can you process now?
@@ -875,7 +870,10 @@ That completes this practical.
875
870
876
871
## History
877
872
873
+
* Used in the Oxford AIMS CDT, 2016-17.
878
874
* Used in the Oxford AIMS CDT, 2015-16.
879
875
* Used in the Oxford AIMS CDT, 2014-15.
880
876
881
877
[^lattice]: A two-dimensional *lattice* is a discrete grid embedded in $R^2$, similar for example to a checkerboard.
878
+
879
+
[^stacking]: The stacking of a tensor $\bx \in\mathbb{R}^{H\times W\times C}$ is the vector $$ \vv \bx= \begin{bmatrix} x_{111}\\ x_{211} \\ \vdots \\ x_{H11} \\ x_{121} \\\vdots \\ x_{HWC} \end{bmatrix}.$$
0 commit comments