chap6.html

<!DOCTYPE html>
<html lang="en">
<!-- Produced from a LaTeX source file.  Note that the production is done -->
<!-- by a very rough-and-ready (and buggy) script, so the HTML and other  -->
<!-- code is quite ugly!  Later versions should be better.                -->
<head>
    <meta charset="utf-8">
    <meta name="citation_title" content="ニューラルネットワークと深層学習">
    <meta name="citation_author" content="Nielsen, Michael A.">
    <meta name="citation_publication_date" content="2015">
    <meta name="citation_fulltext_html_url" content="http://neuralnetworksanddeeplearning.com">
    <meta name="citation_publisher" content="Determination Press">
    <link rel="icon" href="nnadl_favicon.ICO" />
    <title>Neural networks and deep learning</title>
    <script src="assets/jquery.min.js"></script>
    <script type="text/x-mathjax-config">
      MathJax.Hub.Config({
        tex2jax: {inlineMath: [['$','$']]},
        "HTML-CSS":
          {scale: 92},
        TeX: { equationNumbers: { autoNumber: "AMS" }}});
    </script>
    <script type="text/javascript" src="https://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>


    <link href="assets/style.css" rel="stylesheet">
    <link href="assets/pygments.css" rel="stylesheet">
    <link rel="stylesheet" href="https://code.jquery.com/ui/1.11.2/themes/smoothness/jquery-ui.css">

<style>
/* Adapted from */
/* https://groups.google.com/d/msg/mathjax-users/jqQxrmeG48o/oAaivLgLN90J, */
/* by David Cervone */

@font-face {
    font-family: 'MJX_Math';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Math-Italic.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Math-Italic.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Math-Italic.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Math-Italic.svg#MathJax_Math-Italic') format('svg');
}

@font-face {
    font-family: 'MJX_Main';
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot'); /* IE9 Compat Modes */
    src: url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/eot/MathJax_Main-Regular.eot?iefix') format('eot'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/woff/MathJax_Main-Regular.woff')  format('woff'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/otf/MathJax_Main-Regular.otf')  format('opentype'),
    url('https://cdn.mathjax.org/mathjax/latest/fonts/HTML-CSS/TeX/svg/MathJax_Main-Regular.svg#MathJax_Main-Regular') format('svg');
}
</style>

  </head>
  <body><div class="header"><h1 class="chapter_number">
  <a href="">CHAPTER 6</a></h1>
  <h1 class="chapter_title"><a href="">ディープラーニング</a></h1></div><div class="section"><div id="toc">
<p class="toc_title"><a href="index.html">ニューラルネットワークと深層学習</a></p><p class="toc_not_mainchapter"><a href="about.html">What this book is about</a></p><p class="toc_not_mainchapter"><a href="exercises_and_problems.html">On the exercises and problems</a></p><p class='toc_mainchapter'><a id="toc_using_neural_nets_to_recognize_handwritten_digits_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_using_neural_nets_to_recognize_handwritten_digits" src="images/arrow.png" width="15px"></a><a href="chap1.html">ニューラルネットワークを用いた手書き文字認識</a><div id="toc_using_neural_nets_to_recognize_handwritten_digits" style="display: none;"><p class="toc_section"><ul><a href="chap1.html#perceptrons"><li>Perceptrons</li></a><a href="chap1.html#sigmoid_neurons"><li>Sigmoid neurons</li></a><a href="chap1.html#the_architecture_of_neural_networks"><li>The architecture of neural networks</li></a><a href="chap1.html#a_simple_network_to_classify_handwritten_digits"><li>A simple network to classify handwritten digits</li></a><a href="chap1.html#learning_with_gradient_descent"><li>Learning with gradient descent</li></a><a href="chap1.html#implementing_our_network_to_classify_digits"><li>Implementing our network to classify digits</li></a><a href="chap1.html#toward_deep_learning"><li>Toward deep learning</li></a></ul></p></div>
<script>
$('#toc_using_neural_nets_to_recognize_handwritten_digits_reveal').click(function() {
   var src = $('#toc_img_using_neural_nets_to_recognize_handwritten_digits').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_using_neural_nets_to_recognize_handwritten_digits").attr('src', 'images/arrow.png');
   };
   $('#toc_using_neural_nets_to_recognize_handwritten_digits').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_how_the_backpropagation_algorithm_works_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_how_the_backpropagation_algorithm_works" src="images/arrow.png" width="15px"></a><a href="chap2.html">逆伝播の仕組み</a><div id="toc_how_the_backpropagation_algorithm_works" style="display: none;"><p class="toc_section"><ul><a href="chap2.html#warm_up_a_fast_matrix-based_approach_to_computing_the_output_from_a_neural_network"><li>Warm up: a fast matrix-based approach to computing the output  from a neural network</li></a><a href="chap2.html#the_two_assumptions_we_need_about_the_cost_function"><li>The two assumptions we need about the cost function</li></a><a href="chap2.html#the_hadamard_product_$s_\odot_t$"><li>The Hadamard product, $s \odot t$</li></a><a href="chap2.html#the_four_fundamental_equations_behind_backpropagation"><li>The four fundamental equations behind backpropagation</li></a><a href="chap2.html#proof_of_the_four_fundamental_equations_(optional)"><li>Proof of the four fundamental equations (optional)</li></a><a href="chap2.html#the_backpropagation_algorithm"><li>The backpropagation algorithm</li></a><a href="chap2.html#the_code_for_backpropagation"><li>The code for backpropagation</li></a><a href="chap2.html#in_what_sense_is_backpropagation_a_fast_algorithm"><li>In what sense is backpropagation a fast algorithm?</li></a><a href="chap2.html#backpropagation_the_big_picture"><li>Backpropagation: the big picture</li></a></ul></p></div>
<script>
$('#toc_how_the_backpropagation_algorithm_works_reveal').click(function() {
   var src = $('#toc_img_how_the_backpropagation_algorithm_works').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_how_the_backpropagation_algorithm_works").attr('src', 'images/arrow.png');
   };
   $('#toc_how_the_backpropagation_algorithm_works').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_improving_the_way_neural_networks_learn_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_improving_the_way_neural_networks_learn" src="images/arrow.png" width="15px"></a><a href="chap3.html">ニューラルネットワークの学習の改善</a><div id="toc_improving_the_way_neural_networks_learn" style="display: none;"><p class="toc_section"><ul><a href="chap3.html#the_cross-entropy_cost_function"><li>The cross-entropy cost function</li></a><a href="chap3.html#overfitting_and_regularization"><li>Overfitting and regularization</li></a><a href="chap3.html#weight_initialization"><li>Weight initialization</li></a><a href="chap3.html#handwriting_recognition_revisited_the_code"><li>Handwriting recognition revisited: the code</li></a><a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters"><li>How to choose a neural network's hyper-parameters?</li></a><a href="chap3.html#other_techniques"><li>Other techniques</li></a></ul></p></div>
<script>
$('#toc_improving_the_way_neural_networks_learn_reveal').click(function() {
   var src = $('#toc_img_improving_the_way_neural_networks_learn').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_improving_the_way_neural_networks_learn").attr('src', 'images/arrow.png');
   };
   $('#toc_improving_the_way_neural_networks_learn').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_a_visual_proof_that_neural_nets_can_compute_any_function" src="images/arrow.png" width="15px"></a><a href="chap4.html">ニューラルネットワークが任意の関数を表現できることの視覚的証明</a><div id="toc_a_visual_proof_that_neural_nets_can_compute_any_function" style="display: none;"><p class="toc_section"><ul><a href="chap4.html#two_caveats"><li>Two caveats</li></a><a href="chap4.html#universality_with_one_input_and_one_output"><li>Universality with one input and one output</li></a><a href="chap4.html#many_input_variables"><li>Many input variables</li></a><a href="chap4.html#extension_beyond_sigmoid_neurons"><li>Extension beyond sigmoid neurons</li></a><a href="chap4.html#fixing_up_the_step_functions"><li>Fixing up the step functions</li></a><a href="chap4.html#conclusion"><li>Conclusion</li></a></ul></p></div>
<script>
$('#toc_a_visual_proof_that_neural_nets_can_compute_any_function_reveal').click(function() {
   var src = $('#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_a_visual_proof_that_neural_nets_can_compute_any_function").attr('src', 'images/arrow.png');
   };
   $('#toc_a_visual_proof_that_neural_nets_can_compute_any_function').toggle('fast', function() {});
});</script><p class='toc_mainchapter'><a id="toc_why_are_deep_neural_networks_hard_to_train_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_why_are_deep_neural_networks_hard_to_train" src="images/arrow.png" width="15px"></a><a href="chap5.html">ニューラルネットワークを訓練するのはなぜ難しいのか</a><div id="toc_why_are_deep_neural_networks_hard_to_train" style="display: none;"><p class="toc_section"><ul><a href="chap5.html#the_vanishing_gradient_problem"><li>The vanishing gradient problem</li></a><a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets"><li>What's causing the vanishing gradient problem?  Unstable gradients in deep neural nets</li></a><a href="chap5.html#unstable_gradients_in_more_complex_networks"><li>Unstable gradients in more complex networks</li></a><a href="chap5.html#other_obstacles_to_deep_learning"><li>Other obstacles to deep learning</li></a></ul></p></div>
<script>
$('#toc_why_are_deep_neural_networks_hard_to_train_reveal').click(function() {
   var src = $('#toc_img_why_are_deep_neural_networks_hard_to_train').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_why_are_deep_neural_networks_hard_to_train").attr('src', 'images/arrow.png');
   };
   $('#toc_why_are_deep_neural_networks_hard_to_train').toggle('fast', function() {});
});</script>

<p class='toc_mainchapter'><a id="toc_deep_learning_reveal" class="toc_reveal" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';"><img id="toc_img_deep_learning" src="images/arrow.png" width="15px"></a><a href="chap6.html">深層学習</a><div id="toc_deep_learning" style="display: none;"><p class="toc_section"><ul><a href="chap6.html#introducing_convolutional_networks"><li>Introducing convolutional networks</li></a><a href="chap6.html#convolutional_neural_networks_in_practice"><li>Convolutional neural networks in practice</li></a><a href="chap6.html#the_code_for_our_convolutional_networks"><li>The code for our convolutional networks</li></a><a href="chap6.html#recent_progress_in_image_recognition"><li>Recent progress in image recognition</li></a><a href="chap6.html#other_approaches_to_deep_neural_nets"><li>Other approaches to deep neural nets</li></a><a href="chap6.html#on_the_future_of_neural_networks"><li>On the future of neural networks</li></a></ul></p></div>
<script>
$('#toc_deep_learning_reveal').click(function() {
   var src = $('#toc_img_deep_learning').attr('src');
   if(src == 'images/arrow.png') {
     $("#toc_img_deep_learning").attr('src', 'images/arrow_down.png');
   } else {
     $("#toc_img_deep_learning").attr('src', 'images/arrow.png');
   };
   $('#toc_deep_learning').toggle('fast', function() {});
});</script>


<p class="toc_not_mainchapter"><a href="sai.html">
Appendix: 知性のある <i>シンプルな</i> アルゴリズムはあるか?</a></p>
<p class="toc_not_mainchapter"><a href="acknowledgements.html">Acknowledgements</a></p><p class="toc_not_mainchapter"><a href="faq.html">Frequently Asked Questions</a></p>
<hr>
<span class="sidebar_title">Sponsors</span>
<br/>
<a href='http://www.ersatz1.com/'><img src='assets/ersatz.png' width='140px' style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://gsquaredcapital.com/'><img src='assets/gsquared.png' width='150px' style="padding: 0px 0px 10px 10px; border-style: none;"></a>

<a href='http://www.tineye.com'><img src='assets/tineye.png' width='150px'
style="padding: 0px 0px 10px 8px; border-style: none;"></a>

<a href='http://www.visionsmarts.com'><img
src='assets/visionsmarts.png' width='160px' style="padding: 0px 0px
0px 0px; border-style: none;"></a> <br/>


<!--
<p class="sidebar">Thanks to all the <a
href="supporters.html">supporters</a> who made the book possible.
Thanks also to all the contributors to the <a
href="bugfinder.html">Bugfinder Hall of Fame</a>.  </p>

<p class="sidebar">The book is currently a beta release, and is still
under active development.  Please send error reports to
mn@michaelnielsen.org.  For other enquiries, please see the <a
href="faq.html">FAQ</a> first.</p>
-->

<p class="sidebar">著者と共にこの本を作り出してくださった<a
href="supporters.html">サポーター</a>の皆様に感謝いたします。
また、<a
        href="bugfinder.html">バグ発見者の殿堂</a>に名を連ねる皆様にも感謝いたします。
また、日本語版の出版にあたっては、<a
href="translators.html">翻訳者</a>の皆様に深く感謝いたします。

</p>


<p class="sidebar">この本は目下のところベータ版で、開発続行中です。
エラーレポートは mn@michaelnielsen.org まで、日本語版に関する質問は muranushi@gmail.com までお送りください。
その他の質問については、まずは<a
href="faq.html">FAQ</a>をごらんください。</p>


<hr>
<span class="sidebar_title">Resources</span>

<p class="sidebar">
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning">Code repository</a></p>

<p class="sidebar">
<a href="http://eepurl.com/BYr9L">Mailing list for book announcements</a>
</p>

<p class="sidebar">
<a href="http://eepurl.com/0Xxjb">Michael Nielsen's project announcement mailing list</a>
</p>

<hr>
<a href="http://michaelnielsen.org"><img src="assets/Michael_Nielsen_Web_Small.jpg" width="160px" style="border-style: none;"/></a>

<p class="sidebar">
  著：<a href="http://michaelnielsen.org">Michael Nielsen</a> / 2014年9月-12月 <br >  訳：<a href="https://github.com/nnadl-ja/nnadl_site_ja">「ニューラルネットワークと深層学習」翻訳プロジェクト</a>
</p>
</div>
</p>
<p>
  <!--Ó
In the <a href="chap5.html">last chapter</a> we learned that deep neural networks are often much harder to train than shallow neural networks.That's unfortunate, since we have good reason to believe that<em>if</em> we could train deep nets they'd be much more powerful than shallow nets.  But while the news from the last chapter is discouraging, we won't let it stop us.  In this chapter, we'll develop techniques which can be used to train deep networks, and apply them in practice.  We'll also look at the broader picture, briefly reviewing recent progress on using deep nets for image recognition, speech recognition, and other applications.  And we'll take a brief, speculative look at what the future may hold for neural nets, and for artificial intelligence.-->
<a href="chap5.html">前章</a>で、深いニューラルネットワークを訓練するのは、浅いネットワークを訓練する場合よりもずっと難しいことを学びました。
これは悲しいことです。
なぜなら、<em>もし</em>深いネットワークを上手く訓練できれば、浅いネットワークよりも遥かに強力になるからです。
前章の知らせは残念ですが、私たちは歩みを止めません。
この章では、深層ネットワークの訓練に使えるテクニックを発展させ、実践的な課題へ適用していきます。
また、深層ネットワークの幅広い応用例として、画像認識や音声認識などに関する最新結果を簡単に紹介します。
そして、ニューラルネットワークや人工知能の未来に何が待っているのか、についても予測していきます。
</p>
<p>
<!--The chapter is a long one.  To help you navigate, let's take a tour. The sections are only loosely coupled, so provided you have some basic familiarity with neural nets, you can jump to whatever most interests you.-->
この章はとても長いです。
章の全体を軽く案内しましょう。
各セクション間の繋がりは強くはありません。
ですので、ニューラルネットワークに既に少し馴染みがあるなら、興味のあるセクションへ先に飛ぶのもよいでしょう。
</p>
<p>
<!--
The <a href="#convolutional_networks">main part of the chapter</a> is an introduction to one of the most widely used types of deep network:deep convolutional networks.  We'll work through a detailed example- code and all - of using convolutional nets to solve the problem of classifying handwritten digits from the MNIST data set:-->
<a href="#convolutional_networks">この章の主要なテーマ</a>は、幅広く使われている深層ネットワークの一種である、深層畳み込みネットワークの紹介です。
MNISTの手書き数字の分類問題に対して、畳み込みネットワークを駆使して挑んで行く様子を、コードを交えながら詳細に追っていきます。
</p><p><center><img src="images/digits.png" width="160px"></center></p>
<p>
<!--We'll start our account of convolutional networks with the shallow networks used to attack this problem earlier in the book.  Through many iterations we'll build up more and more powerful networks.  As we go we'll explore many powerful techniques: convolutions, pooling, the use of GPUs to do far more training than we did with our shallow networks, the algorithmic expansion of our training data (to reduce overfitting), the use of the dropout technique (also to reduce overfitting), the use of ensembles of networks, and others.  The result will be a system that offers near-human performance.  Of the 10,000 MNIST test images - images not seen during training! - our system will classify 9,967 correctly.  Here's a peek at the 33 images which are misclassified.  Note that the correct classification is in the top right; our program's classification is in the bottom right:-->
畳み込みネットワークの説明に際しては、
本書で扱ってきた浅いネットワークと比較しながら進めます。
試行錯誤により、ネットワークをいじりながら作り上げていきます。
その過程でたくさんの強力なテクニックを学ぶでしょう。
畳み込み、プーリング、（浅いネットワークの場合よりも効率的に訓練するための）GPUの使用、（過適合の抑制のための）訓練データの拡張、（過適合の抑制のための）ドロップアウト、ネットワークのアンサンブルなどを学んでいきます。
これらを組み合わせると最終的に、人間に近いパフォーマンスを出せるシステムが出来上がります。
10,000個のMNISTテスト画像（これらは訓練画像には含まれません！）から、9,967個を正しく分類できるようになります。
誤って分類した33個の画像をちらっと見てみましょう。
正しい分類が手書き数字の右上に表示されており、私たちのプログラムが分類した結果が右下に表示されています。
</p>
<p>
<center><img src="images/ensemble_errors.png" width="580px"></center></p><p>
<!--Many of these are tough even for a human to classify.  Consider, for example, the third image in the top row.  To me it looks more like a "9" than an "8", which is the official classification.  Our network also thinks it's a "9".  This kind of "error" is at the very least understandable, and perhaps even commendable.  We conclude our discussion of image recognition with a <a href="#recent_progress_in_image_recognition">survey of some of the  spectacular recent progress</a> using networks (particularly convolutional nets) to do image recognition.-->
人間でも正しく分類するのは難しい画像が多いです。
例えば、上の行の3つ目の画像を見てください。
正解は"8"ですが、私には"8"よりは"9"に見えます。
私たちのネットワークも"9"と判断しています。
この類の"間違い"は許容できるもので、むしろこの間違いを犯せるのは立派ではないかとさえ思えます。
画像認識の議論のまとめとして、畳み込みネットワークを使用した<a href="#recent_progress_in_image_recognition">最近の目まぐるしい研究成果を調査します</a>。
</p>
<p>
<!--
The remainder of the chapter discusses deep learning from a broader and less detailed perspective.  We'll<a href="#things_we_didn't_cover_but_which_you'll_eventually_want_to_know">briefly  survey other models of neural networks</a>, such as recurrent neuralnets and long short-term memory units, and how such models can be applied to problems in speech recognition, natural language processing, and other areas.  And we'll<a href="#on_the_future_of_neural_networks">speculate about the  future of neural networks and deep learning</a>, ranging from ideas like intention-driven user interfaces, to the role of deep learning inartificial intelligence.-->
この章の最後では、大局的にディープラーニングを概観します。
具体的には、再帰型ニューラルネットワーク（RNN）や長期短期記憶（LSTM）ユニットなどの<a href="#things_we_didn't_cover_but_which_you'll_eventually_want_to_know">他のニューラルネットワークのモデルを簡単に調査</a>して、
そのようなモデルが音声認識や自然言語処理や他の分野の問題にどう適用可能であるのか考察します。
そして、意思だけで使えるユーザインターフェイスから、人工知能におけるディープラーニングの役割まで幅広く扱い、<a href="#on_the_future_of_neural_networks">ニューラルネットワークとディープラーニングの未来を予測</a>します。
</p>
</p><p></p><p></p><p></p><p></p><p></p><p></p><p>
<!--The chapter builds on the earlier chapters in the book, making use of and integrating ideas such as backpropagation, regularization, the softmax function, and so on.  However, to read the chapter you don't need to have worked in detail through all the earlier chapters.  It will, however, help to have read <a href="chap1.html">Chapter 1</a>, on the basics of neural networks.  When I use concepts from Chapters 2 to 5,I provide links so you can familiarize yourself, if necessary.-->
この章は、本書のこれまでの章の内容の集大成です。
例えば、逆伝播、正規化、ソフトマックス関数などのアイデアを組み合わせてネットワークに利用します。
ですがこの章を読むのに、前章までを詳細に読み切る必要はありません。
ただし<a href="chap1.html">1章</a>を読んで、ニューラルネットワークの基礎を抑えておくのは役に立つでしょう。
2章から5章までのアイデアを利用するときには、必要であれば飛べるようにリンクを表示します。
</p>
<p>
<!--
It's worth noting what the chapter is not.  It's not a tutorial on the　latest and greatest neural networks libraries.  Nor are we going to be　training deep networks with dozens of layers to solve problems at the　very leading edge.  Rather, the focus is on understanding some of the core principles behind deep neural networks, and applying them in the simple, easy-to-understand context of the MNIST problem.  Put another way: the chapter is not going to bring you right up to the frontier. Rather, the intent of this and earlier chapters is to focus on fundamentals, and so to prepare you to understand a wide range of　current work.-->
この章では扱わないトピックを明確化しておきます。
まず、ニューラルネットワークの素晴らしい最新ライブラリのチュートリアルではありません。
また、数十の深層ネットワークを、巨大な計算機を使って訓練することで最先端の問題を解く、というようなトピックも扱いません。
それよりむしろ、深層ニューラルネットワークの根源的な原理を理解することに焦点を当て、その原理をシンプルで理解しやすいMNISTの問題に適用します。
繰り返すと、この章を読んでも最先端の流行には辿り着けません。
しかし、この章と以前の全章を読むことであなたは深層ニューラルネットワークの本質に触れることができ、いずれは現在の流行を理解できるようになるでしょう。
</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p>
<p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
<h3><a name="introducing_convolutional_networks"></a><a href="#introducing_convolutional_networks"><!--Introducing convolutional networks-->畳み込みニューラルネットワークの導入</a></h3></p>
<p><!--In earlier chapters, we taught our neural networks to do a pretty good job recognizing images of handwritten digits:
-->
以前の章では、ニューラルネットワークの、手書き数字画像の認識精度を向上することを目的としていました。
</p>
<p>
<center><img src="images/digits.png" width="160px"></center>
</p>
<p>
<!--We did this using networks in which adjacent network layers are fully connected to one another.  That is, every neuron in the network is connected to every neuron in adjacent layers:-->
この目的のために、隣接層がお互いに全結合するネットワークを使用していました。
これは、ネットワーク内の全てのニューロンが、隣接する層の全てのニューロンと結合している状態です。
</p>
<p>
<center><img src="images/tikz41.png"/></center>
</p>
<p>
<!--
In particular, for each pixel in the input image, we encoded the pixel's intensity as the value for a corresponding neuron in the input layer.  For the $28 \times 28$ pixel images we've been using, this means our network has $784$ ($= 28 \times 28$) input neurons.  We then trained the network's weights and biases so that the network's output would - we hope! - correctly identify the input image: '0', '1','2', ..., '8', or '9'.-->
さらに、入力画像の各ピクセルの強度を、入力層内の対応するニューロンへエンコードしていました。
これはつまり、使用した $28 \times 28$ ピクセルの画像の場合、ネットワークは $784$ ($= 28 \times 28$) の入力ニューロンを持つということです。
そして、ネットワークの重みとバイアスを訓練することで、入力画像が'0', '1','2', ..., '8', '9'のいずれであるかを、ネットワークの出力から（望みどおり！）特定しようとしました。
</p>
<p>
<!--
Our earlier networks work pretty well: we've
<a href="chap3.html#98percent">obtained a classification accuracy better
  than 98 percent</a>, using training and test data from the
<a href="chap1.html#learning_with_gradient_descent">MNIST handwritten
  digit data set</a>.  But upon reflection, it's strange to use networks
with fully-connected layers to classify images.  The reason is that
such a network architecture does not take into account the spatial
structure of the images.  For instance, it treats input pixels which
are far apart and close together on exactly the same footing.  Such
concepts of spatial structure must instead be inferred from the
training data.  But what if, instead of starting with a network
architecture which is <em>tabula rasa</em>, we used an architecture
which tries to take advantage of the spatial structure?  In this
section I describe <em>convolutional neural networks</em>*
-->
これまで作ってきたネットワークは上手く動いていました。
<a href="chap1.html#learning_with_gradient_descent">MNISTの手書き数字データセット</a>から取り出した訓練データとテストデータを使って、<a href="chap3.html#98percent">98%以上の分類精度を達成</a>していたのを覚えているでしょうか。
しかしよく考えると、画像を分類するのに全結合層からなるネットワークを使うのは変です。
なぜかというと、全結合層からなるネットワークは、画像の空間的な構造を考慮していないからです。
たとえば全結合層では、入力ピクセルの中の離れたところにあるもの同士と、近いところにあるもの同士が全く同等に扱われます。
そのような空間的な構造の概念自体、全結合層のネットワークの場合、全て訓練データから推測されなければなりません。
しかし、<em>真っさらな状態</em>のネットワーク構造からスタートする代わりに、空間構造を活用したネットワーク構造を使ったとしたらどうでしょう？
このセクションでは、<em>畳み込みニューラルネットワーク</em>を扱います。
<!--
<span class="marginnote">
*The
  origins of convolutional neural networks go back to the 1970s.  But
  the seminal paper establishing the modern subject of convolutional
  networks was a 1998 paper,
  <a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based
    learning applied to document recognition"</a>, by Yann LeCun,
  Léon Bottou, Yoshua Bengio, and Patrick Haffner.
  LeCun has since made an interesting
  <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">remark</a>
  on the terminology for convolutional nets: "The [biological] neural
  inspiration in models like convolutional nets is very
  tenuous. That's why I call them 'convolutional nets' not
  'convolutional neural nets', and why we call the nodes 'units' and
  not 'neurons' ".  Despite this remark, convolutional nets use many
  of the same ideas as the neural networks we've studied up to now:
  ideas such as backpropagation, gradient descent, regularization,
  non-linear activation functions, and so on.  And so we will follow
  common practice, and consider them a type of neural network.  I will
  use the terms "convolutional neural network" and "convolutional
  net(work)" interchangeably.  I will also use the terms
  "[artificial] neuron" and "unit" interchangeably.</span>.-->
<span class="marginnote">
  *畳み込みニューラルネットワークのはじまりは、1970年代まで遡ります。
  しかし、近年の畳み込みニューラルネットワークブームの発端の論文は1998年のby Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffnerによる<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based learning applied to document recognition"</a>です。
  LeCunは畳み込みネットワークの用語定義に関して興味深い<a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">次のような発言</a>を残しています。
  「畳み込みネットワークのモデルは、生物学の神経モデルからはあまり着想を得ていません。そのため私は、"畳み込みニューラルネットワーク"ではなく"畳み込みネットワーク"と呼んでいます。そして、ノードに対しても"ニューロン"ではなく"ユニット"と呼んでいます」と。
  この発言に反して、畳み込みネットワークは、私たちが学んできたニューラルネットワークと同じアイデアを多く利用しています。
  例えば、逆伝播や勾配降下、正規化、非線形な活性化関数などです。
  なので、慣例に従い、これからはニューラルネットワークの1種とみなします。
  "畳み込みニューラルネットワーク"という用語と"畳込みネットワーク"という用語を同じ意味で使っていきます。
  "（人工）ニューロン"や"ユニット"という用語も同義として使っていきます。
  </span>
<!--These networks use a special architecture which is particularly well-adapted to classify images.  Using this architecture makes convolutional networks fast to train.  This, in turn, helps us train deep, many-layer networks, which are very good at classifying images.
Today, deep convolutional networks or some close variant are used in
most neural networks for image recognition.-->
畳み込みネットワークは、空間的構造性を考慮した設計となっており、画像分類に非常に適しています。
この特性により、畳込みネットワークは効率的に学習できるのです。
つまり、画像を分類するのに優れた、深くて多層のネットワークを訓練できると言えます。
今日、深層の畳み込みネットワークもしくは類似のネットワークが、画像認識を目的とするニューラルネットワークの大半に使われています。
</p>
<p>
<!--Convolutional neural networks use three basic ideas: <em>local
  receptive fields</em>, <em>shared weights</em>, and <em>pooling</em>.  Let's
look at each of these ideas in turn.-->
畳み込みニューラルネットワークは次の3つの重要なアイデアを使っています。
それは、<em>局所受容野</em>、<em>重み共有</em>、<em>プーリング</em>です。
各アイデアを順に見ていきましょう。
</p>
<p>
<!--
<strong>Local receptive fields:</strong> In the fully-connected layers shown
earlier, the inputs were depicted as a vertical line of neurons.  In a
convolutional net, it'll help to think instead of the inputs as a $28
\times 28$ square of neurons, whose values correspond to the $28
\times 28$ pixel intensities we're using as inputs:
-->
<strong>局所受容野：</strong>
上で確認した全結合層では、入力は一列のニューロンとして描かれていました。
一方、畳み込みネットワークでは、入力を $28 \times 28$ の正方形のニューロンと考えます。
各ニューロンの値は、入力として用いる $28 \times 28$ の入力ピクセルの強さに対応します。
</p>
<p>
<center>
<img src="images/tikz42.png"/>
</center>
</p>
<p>
<!--
As per usual, we'll connect the input pixels to a layer of hidden
neurons.  But we won't connect every input pixel to every hidden
neuron.  Instead, we only make connections in small, localized regions
of the input image.
-->
いつも通り、入力のピクセルを隠れ層のニューロンに結合します。
しかし、隠れ層の各ニューロンへ、各入力ピクセルを結びつけることはしません。
代わりに、各ニューロンへ入力画像の中の小さい局所領域を結合します。
</p>
<p>
<!--
To be more precise, each neuron in the first hidden layer will be
connected to a small region of the input neurons, say, for example, a
$5 \times 5$ region, corresponding to $25$ input pixels.  So, for a
particular hidden neuron, we might have connections that look like
this:
-->
正確に言うと、1つ目の隠れ層の各ニューロンは、入力ニューロンの小さい領域と結合します。
例えば、入力の $25$ ピクセルに対応する $5 \times 5$ の領域に、隠れ層の各ニューロンが結合します。
ある隠れニューロン結合を示すと次のようになります。
<center>
<img src="images/tikz43.png"/>
</center>
</p>
<p>
<!--
That region in the input image is called the <em>local receptive
  field</em> for the hidden neuron.  It's a little window on the input
pixels.  Each connection learns a weight.  And the hidden neuron
learns an overall bias as well.  You can think of that particular
hidden neuron as learning to analyze its particular local receptive
field.
-->
入力画像内のそのような領域は、隠れニューロンの<em>局所受容野</em>と呼ばれます。
入力ピクセル上の小さな窓のようなものです。
各結合は重みを学習します。
隠れニューロンはバイアスも同じく学習します。
特定の隠れニューロンは、特定の局所受容野を分析しているとみなせます。
</p>
<p>
<!--
We then slide the local receptive field across the entire input image.
For each local receptive field, there is a different hidden neuron in
the first hidden layer.  To illustrate this concretely, let's start
with a local receptive field in the top-left corner:
-->
そして、入力画像全体をカバーするように局所受容野をスライドさせます。
局所受容野ごとに、1つ目の隠れ層の中で異なる隠れニューロンが割り当てられます。
これを具体的に確認してみます。
左上の角の局所受容野から始めてみましょう。
<center>
<img src="images/tikz44.png"/>
</center>
</p>
<p>
<!--
Then we slide the local receptive field over by one pixel to the right
(i.e., by one neuron), to connect to a second hidden neuron:
-->
1ピクセル分（すなわち1ニューロン分）局所受容野を右へスライドして、次は2つ目の隠れニューロンの結合を考えます。
</p>
<p>
<center>
<img src="images/tikz45.png"/>
</center>
</p>
<p>
<!--
And so on, building up the first hidden layer.  Note that if we have a
$28 \times 28$ input image, and $5 \times 5$ local receptive fields,
then there will be $24 \times 24$ neurons in the hidden layer.  This
is because we can only move the local receptive field $23$ neurons
across (or $23$ neurons down), before colliding with the right-hand
side (or bottom) of the input image.
-->
これを繰り返して、最初の隠れ層の全体に対して値を設定します。
入力画像のサイズが $28 \times 28$ で、局所受容野のサイズが $5 \times 5$ の場合、1つ目の隠れ層のニューロンは $24 \times 24$ 個となります。
その理由は、入力画像の右側（もしくは下側）にぶつかるまでに、$23$ 個のニューロン分だけ局所受容野をスライドできるからです。
</p>
<p>
  <!--I've shown the local receptive field being moved by one pixel at a
time.  In fact, sometimes a different <em>stride length</em> is used.
For instance, we might move the local receptive field $2$ pixels to
the right (or down), in which case we'd say a stride length of $2$ is
used.  In this chapter we'll mostly stick with stride length $1$, but
it's worth knowing that people sometimes experiment with different
stride lengths*-->
ここまで、$1$ ピクセルずつ局所受容野が移動する例を見てきました。
実は、<em>ストライドの長さ</em>に $1$ 以外の値が使われることがしばしばあります。
たとえば、$2$ ピクセルずつ局所受容野を右へ（もしくは下へ）動かすこともあるでしょう。
これは、$2$ ピクセルのストライド長さが使われていると言えます。
この章では大抵、ストライドの長さが $1$ の場合しか扱いませんが、
時には異なるストライド長さ*が使用されて実験が行われる場合もあることは知っておいてください。
<span class="marginnote">
<!--
*As was done in earlier chapters, if we're
  interested in trying different stride lengths then we can use
  validation data to pick out the stride length which gives the best
  performance.  For more details, see the
  <a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">earlier
    discussion</a> of how to choose hyper-parameters in a neural network.
  The same approach may also be used to choose the size of the local
  receptive field - there is, of course, nothing special about using
  a $5 \times 5$ local receptive field.  In general, larger local
  receptive fields tend to be helpful when the input images are
  significantly larger than the $28 \times 28$ pixel MNIST images.
-->
  *以前の章で触れたように、異なるストライド長さを試したい場合、最適なパフォーマンスを発揮するストライド長さを決めるには、検証データを使うのがよいでしょう。
  詳細は、ニューラルネットワークでハイパーパラメータを選ぶ方法についての<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">以前の議論</a>を確認してください。
  同じアプローチが局所受容野のサイズを選ぶ時にも使われるでしょう。
  もちろん、$5 \times 5$ の局所受容野を使う特別な理由はないのです。
  一般的には、入力画像が $28 \times 28$ のMNIST画像よりもずっと大きい時には、大きいサイズの局所受容野を使って方が良い傾向があります。
</span>
</p>
<p>
<!--
<strong>Shared weights and biases:</strong> I've said that each hidden neuron
has a bias and $5 \times 5$ weights connected to its local receptive
field.  What I did not yet mention is that we're going to use the
<em>same</em> weights and bias for each of the $24 \times 24$ hidden
neurons.  In other words, for the $j, k$th hidden neuron, the output
is:
<a class="displaced_anchor" name="eqtn125"></a>\begin{eqnarray}
  \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right).
\tag{125}\end{eqnarray}
Here, $\sigma$ is the neural activation function - perhaps the
<a href="chap1.html#sigmoid_neurons">sigmoid function</a> we used in
earlier chapters.  $b$ is the shared value for the bias.  $w_{l,m}$ is
a $5 \times 5$ array of shared weights.  And, finally, we use $a_{x,
  y}$ to denote the input activation at position $x, y$.
-->
<strong>重みとバイアスの共有：</strong>
各隠れニューロンはバイアスと局所受容野に結合された $5 \times 5$ の重みを持つことを上で述べました。
しかし、$24 \times 24$ の全ての隠れニューロンに対して、<em>同じ</em>重みとバイアスを適用することをまだ伝えていませんでした。
これはつまり、$j, k$ 個目の隠れニューロンへの出力は、下記のようになることを示しています。
<a class="displaced_anchor" name="eqtn125"></a>\begin{eqnarray}
  \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right).
\tag{125}\end{eqnarray}
$\sigma$ は活性化関数の一種であり、以前の章で使用した<a href="chap1.html#sigmoid_neurons">シグモイド関数</a>です。
$b$ はバイアスの共有値です。
$w_{l,m}$ は共有重みであり、そのサイズは $5 \times 5$ です。
そして、$a_{x, y}$ は $x, y$ における活性化された入力を示します。
</p>
<p>
<!--This means that all the neurons in the first hidden layer detect
exactly the same feature*<span class="marginnote">
*I haven't precisely defined the
  notion of a feature.  Informally, think of the feature detected by a
  hidden neuron as the kind of input pattern that will cause the
  neuron to activate: it might be an edge in the image, for instance,
  or maybe some other type of shape. </span>, just at different locations in
the input image.  To see why this makes sense, suppose the weights and
bias are such that the hidden neuron can pick out, say, a vertical
edge in a particular local receptive field.  That ability is also
likely to be useful at other places in the image.  And so it is useful
to apply the same feature detector everywhere in the image.  To put it
in slightly more abstract terms, convolutional networks are well
adapted to the translation invariance of images: move a picture of a
cat (say) a little ways, and it's still an image of a cat*-->
このことが意味するのは、入力画像の異なる位置の全く同じ特徴*を、1層目の隠れ層が検知するということです。
<span class="marginnote">
  *まだ特徴の厳密な定義をしていませんでした。
  砕けた言い方をすると、隠れニューロンに検知される特徴とは、ニューロンを活性化する入力パターンの種類を意味します。
  そのパターンとは例えば、画像のエッジだったり他の形状だったりします。
  </span>
なぜこれが成り立つのかを理解するために、
隠れニューロンが縦のエッジを局所受容野に検知できるような、重みやバイアスを想定してみてください。
この検知能力は、画像の他の位置でも有効に使えそうです。
したがって、同じ特徴検出器を画像の全位置へ適用するのは有効と言えるのです。
少し抽象的に表現すると、畳込みニューラルネットワークは画像に対して並進不変性があると言います。
並進不変性とは例えば、猫の絵を少し並進移動しても、それはまだ猫の絵と言えるような性質のことです*
<!--
<span class="marginnote">
*In
  fact, for the MNIST digit classification problem we've been
  studying, the images are centered and size-normalized.  So MNIST has
  less translation invariance than images found "in the wild", so to
  speak.  Still, features like edges and corners are likely to be
  useful across much of the input space. </span>.
-->
<span class="marginnote">
  *私たちが勉強してきたMNIST手書き数字の分類問題では、画像は中央に寄っており、大きさが正規化されていました。
  なので、MNISTは世の中にある画像よりも、並進不変性が小さいと言える。
  しかしそれでも、エッジや角といった特徴は入力空間全体において有効に使えるでしょう</span>。
</p>
<p>
<!--
For this reason, we sometimes call the map from the input layer to the
hidden layer a <em>feature map</em>.  We call the weights defining the
feature map the <em>shared weights</em>.  And we call the bias defining
the feature map in this way the <em>shared bias</em>.  The shared
weights and bias are often said to define a <em>kernel</em> or
<em>filter</em>.  In the literature, people sometimes use these terms in
slightly different ways, and for that reason I'm not going to be more
precise; rather, in a moment, we'll look at some concrete examples.
-->
この理由から、入力層から隠れ層への写像を<em>特徴マップ</em>と呼ぶこともあります。
特徴マップを定義する重みを<em>共有重み</em>と呼びます。
また、特徴マップを同じように定義するバイアスを、同じように<em>共有バイアス</em>と呼びます。
共有重みと共有バイアスはしばしば、<em>カーネル</em>、もしくは<em>フィルタ</em>と呼ばれます。
文献では、これらの用語を少し異なる使い方をする場合があります。
そのため、私は用語の正確性は放棄しています。
むしろ具体例で確認することのほうが本質的なので、そのように心がけていきます。
</p>
<p>
</p>
<p>
<!--The network structure I've described so far can detect just a single
kind of localized feature.  To do image recognition we'll need more
than one feature map.  And so a complete convolutional layer consists
of several different feature maps:-->
これまで扱ってきたネットワーク構造は、一種類の局所的特徴のみ検知できるものでした。
画像認識のためには、二つ以上の特徴マップが必要となります。
したがって、完全な畳込み層は幾つかの異なる特徴マップから構成されるのです。
</p>
<p>
<center>
<img src="images/tikz46.png"/>
</center>
<!--
In the example shown, there are $3$ feature maps.  Each feature map is
defined by a set of $5 \times 5$ shared weights, and a single shared
bias.  The result is that the network can detect $3$ different kinds
of features, with each feature being detectable across the entire
image.-->
上の例には 三つの特徴マップがあります。
それぞれの特徴マップは $5 \times 5$ の共有重みと共有バイアスで定義されています。
その結果、三種の異なる特徴を、画像全体に渡って検知できるのです。
</p>
<p>
</p>
<p>
<!--
I've shown just $3$ feature maps, to keep the diagram above simple.
However, in practice convolutional networks may use more (and perhaps
many more) feature maps.  One of the early convolutional networks,
LeNet-5, used $6$ feature maps, each associated to a $5 \times 5$
local receptive field, to recognize MNIST digits.  So the example
illustrated above is actually pretty close to LeNet-5.  In the
examples we develop later in the chapter we'll use convolutional
layers with $20$ and $40$ feature maps.  Let's take a quick peek at
some of the features which are learned*<span class="marginnote">
*The feature maps
  illustrated come from the final convolutional network we train, see
  <a href="#final_conv">here</a>.</span>:
-->
図をシンプルにするために、 $3$ つの特徴マップだけを見せました。
しかし、実際の畳み込みネットワークは（たぶん遥かに）多くの特徴マップを使っているかもしれません。
初期の畳み込みネットワークであるLeNet-5は、MNISTの手書き数字を認識するために $6$ つの特徴マップを使っていました。
各特徴マップは $5 \times 5$ の局所受容野と結合しています。
したがって、上記の例はLeNet-5と実際かなり近いのです。
章の後ろの方で開発するネットワークでは、 $20$ と $40$ の特徴マップをそれぞれ持つ畳み込み層を使っています。
その特徴マップを少し覗き見してみましょう*<span class="marginnote">
*図の特徴マップは私たちが訓練する最後の畳み込みネットワークに含まれるものです。<a href="#final_conv">ここ</a>を確認してください。</span>
</p>
<p>
<center><img src="images/net_full_layer_0.png" width="400px"></center>
</p>
<p>
<!--The $20$ images correspond to $20$ different feature maps (or filters,
or kernels).  Each map is represented as a $5 \times 5$ block image,
corresponding to the $5 \times 5$ weights in the local receptive
field.  Whiter blocks mean a smaller (typically, more negative)
weight, so the feature map responds less to corresponding input
pixels.  Darker blocks mean a larger weight, so the feature map
responds more to the corresponding input pixels.  Very roughly
speaking, the images above show the type of features the convolutional
layer responds to.-->
$20$ 個の画像は $20$ 個の異なる特徴マップ（もしくはフィルタかカーネル）に対応しています。
各マップは、 $5 \times 5$ ブロック画像で表され、局所受容野の $5 \times 5$ の重みに対応しています。
白いブロックは小さい（概してマイナスの）重みを意味し、その特徴マップは入力ピクセルに反応しにくいという性質を持ちます。
一方、黒いブロックは大きい重みを意味し、その特徴マップは入力ピクセルによく反応します。
大まかに言うと、畳み込み層がどのような種類の特徴に反応するかを、上の画像群は示しています。
</p>
<p>
<!--
So what can we conclude from these feature maps?  It's clear there is
spatial structure here beyond what we'd expect at random: many of the
features have clear sub-regions of light and dark.  That shows our
network really is learning things related to the spatial structure.
However, beyond that, it's difficult to see what these feature
detectors are learning.  Certainly, we're not learning (say) the
<a href="http://en.wikipedia.org/wiki/Gabor_filter">Gabor filters</a> which
have been used in many traditional approaches to image recognition.
In fact, there's now a lot of work on better understanding the
features learnt by convolutional networks.  If you're interested in
following up on that work, I suggest starting with the paper
<a href="http://arxiv.org/abs/1311.2901">Visualizing and Understanding
  Convolutional Networks</a> by Matthew Zeiler and Rob Fergus (2013).
-->
これらの特徴マップからどんな結論を導けるでしょうか？
何らかの空間的構造が特徴マップには現れているようです。
特徴マップの多くは明るい領域と暗い領域の両方を含んでいます。
これにより、空間的構造に関する特徴をネットワークが学習していることが分かります。
しかし、これらの特徴検出器が何を学んでいるのかを、それ以上に深く把握するのは難しいです。
明らかに、この特徴は画像認識の伝統的なアプローチでたくさん採用されてきた<a href="http://en.wikipedia.org/wiki/Gabor_filter">ガボールフィルタ</a>ではありません。
畳み込みネットワークの学習した特徴を、突き詰めて理解しようとする研究が、現在多数進行中です。
もし興味があるのなら、2013年のMatthew ZeilerとRob Fergusによる<a href="http://arxiv.org/abs/1311.2901">Visualizing and Understanding Convolutional Networks</a>を読むのを薦めます。</p>
<p></p>
<p>
<!--
A big advantage of sharing weights and biases is that it greatly
reduces the number of parameters involved in a convolutional network.
For each feature map we need $25 = 5 \times 5$ shared weights, plus a
single shared bias. So each feature map requires $26$ parameters.  If
we have $20$ feature maps that's a total of $20 \times 26 = 520$
parameters defining the convolutional layer.  By comparison, suppose
we had a fully connected first layer, with $784 = 28 \times 28$ input
neurons, and a relatively modest $30$ hidden neurons, as we used in
many of the examples earlier in the book.  That's a total of $784
\times 30$ weights, plus an extra $30$ biases, for a total of $23,550$
parameters.  In other words, the fully-connected layer would have more
than $40$ times as many parameters as the convolutional layer.
-->
重みとバイアスを共有する大きな利点は、畳込みネットワークのパラメータ数を大きく減らせる点です。
上記の畳み込みネットワークのパラメータを数えてみます。
特徴マップごとに $25 = 5 \times 5$ の共有重みと、1つの共有バイアスを必要とします。
つまり、各特徴マップは $26$ のパラメータが必要です。
$20$ 個の特徴マップがある場合には、畳込み層を定義するための全パラメータは $20 \times 26 = 520$ 個となります。
それと比較するために、全結合層のパラメータ数を数えてみます。
これまで本書で使ってきた例と同様に $784 = 28 \times 28$ 個のニューロンからなる入力層と、$30$ 個の隠れニューロンからなる全結合層で構成されるネットワークを仮定してみてください。
その場合、 $784 \times 30$ 個の重みとさらに $30$ 個のバイアスで、全部で $23,550$
個のパラメータからなります。
つまり、全結合層は畳み込み層と比べて $40$ 倍のパラメータを保持することになるのです。
</p>
<p>
<!--
Of course, we can't really do a direct comparison between the number
of parameters, since the two models are different in essential ways.
But, intuitively, it seems likely that the use of translation
invariance by the convolutional layer will reduce the number of
parameters it needs to get the same performance as the fully-connected
model.  That, in turn, will result in faster training for the
convolutional model, and, ultimately, will help us build deep networks
using convolutional layers.-->
もちろん、根本的に2つのモデルは異なっているため、パラメータ数を直接比較することは本当はできません。
しかし直感的に考えてみても、畳み込み層には並進不変の性質があるので、全結合層のモデルと同じパフォーマンスを得るのに要するパラメータ数は少なくなると思えます。
パラメータ数が小さいおかげで、畳込みモデルは高速に訓練でき、層を深くできるのです。
</p>
<p>
</p>
<p>
<!--
Incidentally, the name <em>convolutional</em> comes from the fact that
the operation in Equation <span id="margin_903716135329_reveal" class="equation_link">(125)</span><span id="margin_903716135329" class="marginequation" style="display: none;"><a href="chap6.html#eqtn125" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_903716135329_reveal').click(function() {$('#margin_903716135329').toggle('slow', function() {});});</script> is sometimes known as a
<em>convolution</em>.  A little more precisely, people sometimes write
that equation as $a^1 = \sigma(b + w * a^0)$, where $a^1$ denotes the
set of output activations from one feature map, $a^0$ is the set of
input activations, and $*$ is called a convolution operation.  We're
not going to make any deep use of the mathematics of convolutions, so
you don't need to worry too much about this connection.  But it's
worth at least knowing where the name comes from.
-->
ところで、<em>畳み込み</em>という呼び名は<span id="margin_903716135329_reveal" class="equation_link">(125)</span><span id="margin_903716135329" class="marginequation" style="display: none;"><a href="chap6.html#eqtn125" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \sigma\left(b + \sum_{l=0}^4 \sum_{m=0}^4  w_{l,m} a_{j+l, k+m} \right) \nonumber\end{eqnarray}</a></span><script>$('#margin_903716135329_reveal').click(function() {$('#margin_903716135329').toggle('slow', function() {});});</script>の、<em>畳み込み</em>という名で知られた操作に由来しています。
正確に式を記述すると、$a^1 = \sigma(b + w * a^0)$ となります。
ここで、 $a^1$ はある特徴マップからの活性化された出力、$a^0$ は活性化された入力、$*$ は畳み込みと呼ばれる操作を示します。
私たちは、いわゆる数学における畳込み操作を深く追い求めません。
なので、数学との結びつきを心配する必要はありません。
しかし、由来が何なのかは少なくとも知っておいて損はありません。
</p><p></p><p></p><p></p><p></p><p>
<!--
<strong>Pooling layers:</strong> In addition to the convolutional layers just
described, convolutional neural networks also contain <em>pooling
  layers</em>.  Pooling layers are usually used immediately after
convolutional layers.  What the pooling layers do is simplify the
information in the output from the convolutional layer.
-->
<strong>プーリング層：</strong>
通常の畳み込みニューラルネットワークは、先ほどの畳み込み層に加えて、<em>プーリング層</em>も含みます。
このプーリング層は通常、畳込み層の直後に置かれます。
この層の役割は畳込み層の出力を単純化することです。
</p><p></p><p>
<!--
In detail, a pooling layer takes each feature map*<span class="marginnote">
*The
  nomenclature is being used loosely here.  In particular, I'm using
  "feature map" to mean not the function computed by the
  convolutional layer, but rather the activation of the hidden neurons
  output from the layer.  This kind of mild abuse of nomenclature is
  pretty common in the research literature.</span> output from the
convolutional layer and prepares a condensed feature map.  For
instance, each unit in the pooling layer may summarize a region of
(say) $2 \times 2$ neurons in the previous layer.  As a concrete
example, one common procedure for pooling is known as
<em>max-pooling</em>.  In max-pooling, a pooling unit simply outputs the
maximum activation in the $2 \times 2$ input region, as illustrated in
the following diagram:-->
少し詳しく説明すると、プーリング層は畳み込み層から各特徴マップ*を取得し、特徴マップを濃縮させています。<span class="marginnote">
  *ここでの用語の定義ははっきりしていません。
  私は「特徴マップ」という言葉を、畳み込み層から算出される関数を指して使うのではなく、活性化された出力ニューロンのことを指して使います。
  この類の意味の揺れは、研究論文によくあることです。</span>
つまり、プーリング層の各ユニットは前層の $2 \times 2$ の領域のニューロンをまとめます。
具体的な手法を紹介すると、プーリングのよく知られた例として<em>Maxプーリング</em>があります。
Maxプーリングでは、プーリングのユニットは $2 \times 2$ の入力領域のうちで最大の値を単純に出力します。
次の図を見てください。
</p>
<p>
<center>
<img src="images/tikz47.png"/>
</center>
</p>
<p>
<!--
Note that since we have $24 \times 24$ neurons output from the
convolutional layer, after pooling we have $12 \times 12$ neurons.
-->
畳み込み層の出力ニューロンは $24 \times 24$ なので、プーリング処理後は $12 \times 12$ のサイズのニューロンとなります。
</p>
<p>
<!--As mentioned above, the convolutional layer usually involves more than
a  single feature  map.   We  apply max-pooling  to  each feature  map
separately.   So  if  there  were three  feature  maps,  the  combined
convolutional and max-pooling layers would look like:
-->
上で述べた通り、畳み込み層は通常1つ以上の特徴マップを持ちます。
それらの特徴マップに対して個別にMaxプーリングを適用します。
したがって、3つ特徴マップがある場合には、畳込み層とMaxプーリング層の様子は次のようになります。
</p>
<p>
<center>
<img src="images/tikz48.png"/>
</center>
</p>
<p>
<!--
We can think of max-pooling as a way for the network to ask whether a
given feature is found anywhere in a region of the image.  It then
throws away the exact positional information.  The intuition is that
once a feature has been found, its exact location isn't as important
as its rough location relative to other features.  A big benefit is
that there are many fewer pooled features, and so this helps reduce
the number of parameters needed in later layers.
-->
Maxプーリングは、画像のある領域内のどこかに指定の特徴があるかをネットワークが確認する手段とみなせます。
つまり正確な位置の情報は棄てているのです。
直感的に解釈すると、一度特徴が見つかればその正確な位置は重要でなく、他の特徴に対するおおよその位置さえ分かればよいということなのです。
この手法の大きな利点は、特徴はプーリングされると少なくなるため、後方の層で必要なパラメータを減らすことができる点です。
</p>
<p>
</p>
<p>
<!--
Max-pooling isn't the only technique used for pooling.  Another common
approach is known as <em>L2 pooling</em>.  Here, instead of taking the
maximum activation of a $2 \times 2$ region of neurons, we take the
square root of the sum of the squares of the activations in the $2
\times 2$ region.  While the details are different, the intuition is
similar to max-pooling: L2 pooling is a way of condensing information
from the convolutional layer.  In practice, both techniques have been
widely used.  And sometimes people use other types of pooling
operation.  If you're really trying to optimize performance, you may
use validation data to compare several different approaches to
pooling, and choose the approach which works best.  But we're not
going to worry about that kind of detailed optimization.
-->
プーリングの手法はMaxプーリングだけではありません。
他のよく知られたアプローチとして<em>L2 プーリング</em>があります。
L2プーリングの手法では、 $2 \times 2$ 領域のニューロンの活性化出力の最大値をとるのではなく、
$2 \times 2$ 領域の活性化出力の和の平方根をとります。
L2プーリングは、畳み込み層からの情報を圧縮する方法とも言えます。
詳細な手続きは異なるものの、直感的にはMaxプーリングに近いはたらきをします。
実際、どちらの手法も広く使われてきました。
そして、場合によってはさらに別のプーリング手法も使われることもあります。
パフォーマンスを本気で良くしようと思ったら、検証データを使って異なるプーリング手法を試すのが良いでしょう。
そして、一番良い手法を選択するのです。
しかし私たちは、細かい最適化の種類を気にかけるつもりはありません。
</p><p></p><p>
<!--
<strong>Putting it all together:</strong> We can now put all these ideas
together to form a complete convolutional neural network.  It's
similar to the architecture we were just looking at, but has the
addition of a layer of $10$ output neurons, corresponding to the $10$
possible values for MNIST digits ('0', '1', '2', <em>etc</em>):
-->
<strong>全てを1つにまとめる：</strong>
さあ、これまでのアイデアを全て使って、畳み込みニューラルネットワークを完成させましょう。
これから作るものは、上で見てきた構造と似ていますが、$10$ のニューロンを持つ出力層が追加されています。
この層の各ニューロンはMNISTの $10$ 種の手書き数字 ('0', '1', '2', <em>etc</em>) に対応するものです。
</p>
<p>
<center>
<img src="images/tikz49.png"/>
</center></p><p>
<!--
The network begins with $28 \times 28$ input neurons, which are used
to encode the pixel intensities for the MNIST image.  This is then
followed by a convolutional layer using a $5 \times 5$ local receptive
field and $3$ feature maps.  The result is a layer of $3 \times 24
\times 24$ hidden feature neurons.  The next step is a max-pooling
layer, applied to $2 \times 2$ regions, across each of the $3$ feature
maps.  The result is a layer of $3 \times 12 \times 12$ hidden feature
neurons.
-->
このネットワークは、MNIST画像のピクセル強度を符号化するのに使われる $28 \times 28$ の入力ニューロンから始まります。
そして、$5 \times 5$ の局所受容野と $3$ の特徴マップを使う畳み込み層が続きます。
この畳み込み層は $3 \times 24 \times 24$ の隠れ特徴ニューロンから構成されます。
次にMaxプーリング層が続きます。
この層では $2 \times 2$ の領域を $3$ の特徴マップごとに処理します。
つまりプーリング層は $3 \times 12 \times 12$ の隠れ特徴ニューロンからなります。
</p>
<p>
<!--
The final layer of connections in the network is a fully-connected
layer.  That is, this layer connects <em>every</em> neuron from the
max-pooled layer to every one of the $10$ output neurons.  This
fully-connected architecture is the same as we used in earlier
chapters.  Note, however, that in the diagram above, I've used a
single arrow, for simplicity, rather than showing all the connections.
Of course, you can easily imagine the connections.
-->
ネットワークの最後の層は全結合層です。
Maxプーリング層の<em>全ての</em>ニューロンと この層の $10$ の出力ニューロンが個別に結合します。
この全結合の構造は以前の章で扱ったものと同じです。
しかし、上図では表記をシンプルにするため、全ての結合を表示する代わりに1つの矢印で表現しています。
結合の様子は容易に想像できるでしょう。
</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
<!--
This convolutional architecture is quite different to the
architectures used in earlier chapters.  But the overall picture is
similar: a network made of many simple units, whose behaviors are
determined by their weights and biases.  And the overall goal is still
the same: to use training data to train the network's weights and
biases so that the network does a good job classifying input digits.
-->
この畳込み構造は、以前までの章で扱ってきた構造と大きく異なります。
しかし、全体像は似ています。
ネットワークは単純なユニットから構成され、各ユニットの振る舞いは重みとバイアスから決定されます。
全体の目標も同じです。
それは、訓練データによりネットワークの重みとバイアスを訓練して、ネットワークが入力画像を上手く分類できるようにすることです。
</p>
<p>
<!--
In particular, just as earlier in the book, we will train our network
using stochastic gradient descent and backpropagation.  This mostly
proceeds in exactly the same way as in earlier chapters.  However, we
do need to make a few modifications to the backpropagation procedure.
The reason is that our earlier <a href="chap2.html">derivation of
  backpropagation</a> was for networks with fully-connected layers.
Fortunately, it's straightforward to modify the derivation for
convolutional and max-pooling layers.  If you'd like to understand the
details, then I invite you to work through the following problem.  Be
warned that the problem will take some time to work through, unless
you've really internalized the <a href="chap2.html">earlier derivation of
  backpropagation</a> (in which case it's easy).
-->
また、以前の章と同じようにネットワークの訓練には、確率的勾配降下法と逆伝播を用います。
以前の章と殆ど同じです。
しかし、逆伝播の手続きには、少し修正を加える必要があります。
以前の章の<a href="chap2.html">逆伝播による偏微分導出</a>では、全結合層を対象とした手続きを扱っていたためです。
幸運なことに、畳込み層とMaxプーリング層の偏微分の式を導出するには、修正を少し加えるだけで済みます。
もし詳細を理解したければ、次の問題に取り組んだほうがよいでしょう。
ただし、<a href="chap2.html">逆伝播にる前方の層の偏微分</a>を正しく理解していない限り、この問題を解くには少し時間がかかります。
</p>
<p>
<h4><a name="problem_214396"></a><a href="#problem_214396"><!--Problem-->問題</a></h4><ul>
<li><strong><!--Backpropagation in a convolutional network-->畳み込みネットワークにおける逆伝播：</strong>
  <!--The core equations
  of backpropagation in a network with fully-connected layers
  are-->
  全結合層のネットワークにおける逆伝播の重要な式は
  <span id="margin_709574921443_reveal" class="equation_link">(BP1)</span><span id="margin_709574921443" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP1" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}</a></span><script>$('#margin_709574921443_reveal').click(function() {$('#margin_709574921443').toggle('slow', function() {});});</script>-<span id="margin_220452626963_reveal" class="equation_link">(BP4)</span><span id="margin_220452626963" class="marginequation" style="display: none;"><a href="chap2.html#eqtnBP4" style="padding-bottom: 5px;" onMouseOver="this.style.borderBottom='1px solid #2A6EA6';" onMouseOut="this.style.borderBottom='0px';">\begin{eqnarray}
  \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray}</a></span><script>$('#margin_220452626963_reveal').click(function() {$('#margin_220452626963').toggle('slow', function() {});});</script>
  (<a href="chap2.html#backpropsummary">link</a>)でした。
  <!--Suppose we have a
  network containing a convolutional layer, a max-pooling layer, and a
  fully-connected output layer, as in the network discussed above.
  How are the equations of backpropagation modified?-->
  上述のネットワークのように、畳み込み層、Maxプーリング層、全結合の出力層から構成されるネットワークを想定してください。
  この時、逆伝播の式はどのように修正されるでしょうか？
</ul></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
  <h3><a name="convolutional_neural_networks_in_practice"></a><a href="#convolutional_neural_networks_in_practice"><!--Convolutional neural networks in practice-->畳み込みニューラルネットワークの実際</a></h3></p>
<p>
<!--
We've now seen the core ideas behind convolutional neural networks.
Let's look at how they work in practice, by implementing some
convolutional networks, and applying them to the MNIST digit
classification problem.  The program we'll use to do this is called
<tt>network3.py</tt>, and it's an improved version of the programs
<tt>network.py</tt> and <tt>network2.py</tt> developed in earlier
chapters*
-->
畳み込みニューラルネットワークの核となるアイデアをこれまで確認してきました。
実際にそれらがどう作用するのかを、畳込みネットワークを実装し、MNISTの数字分類問題へ適用することで確認してみましょう。
今回、私たちが使うプログラムは<tt>network3.py</tt>です。
これは以前の章で使った<tt>network.py</tt>と<tt>network2.py</tt>の改良版です*
<!--
<span class="marginnote">
*Note also that <tt>network3.py</tt> incorporates ideas
  from the Theano library's documentation on convolutional neural nets
  (notably the implementation of
  <a href="http://deeplearning.net/tutorial/lenet.html">LeNet-5</a>), from
  Misha Denil's
  <a href="https://github.com/mdenil/dropout">implementation of dropout</a>,
  and from <a href="http://colah.github.io">Chris Olah</a>.</span>.-->
<span class="marginnote">
  *<tt>network3.py</tt>はTheanoライブラリの畳み込みニューラルネットワークのドキュメントからアイデアを取り込んでいることにも注意してください
  (特に<a href="http://deeplearning.net/tutorial/lenet.html">LeNet-5</a>の実装部分) 。
  また、Misha Denilの
  <a href="https://github.com/mdenil/dropout">ドロップアウト</a>の実装や、
  <a href="http://colah.github.io">Chris Olah</a>のアイデアも参照しています。</span>。
    <!--If you wish to follow along, the code is available
<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network3.py">on
  GitHub</a>.  Note that we'll work through the code for
<tt>network3.py</tt> itself in the next section.  In this section, we'll
use <tt>network3.py</tt> as a library to build convolutional networks.
-->
コードを参照したい場合、<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/network3.py">GitHub</a>からコードを取得できます。
次のセクションでは<tt>network3.py</tt>を作り上げていきます。
一方、このセクションでは、<tt>network3.py</tt>を畳み込みネットワークを構築するためのライブラリとして使います。
</p><p>
</p><p>
<!--
The programs <tt>network.py</tt> and <tt>network2.py</tt> were implemented
using Python and the matrix library Numpy.  Those programs worked from
first principles, and got right down into the details of
backpropagation, stochastic gradient descent, and so on.  But now that
we understand those details, for <tt>network3.py</tt> we're going to use
a machine learning library known as
<a href="http://deeplearning.net/software/theano/">Theano</a>*
-->
<tt>network.py</tt>と<tt>network2.py</tt>はPythonと行列ライブラリであるNumpyを使って実装されていました。
その際、ニューラルネットの原理や逆伝播、確率的勾配降下法などを学ぶために、各実装をスクラッチから行いました。
しかし、これらの詳細を私たちは既に理解しているため、<tt>network3.py</tt>では<a href="http://deeplearning.net/software/theano/">Theano</a>*という機械学習ライブラリを利用します
<!--
<span class="marginnote">
*See
  <a href="http://www.iro.umontreal.ca/&#126;lisa/pointeurs/theano_scipy2010.pdf">Theano:
    A CPU and GPU Math Expression Compiler in Python</a>, by James
  Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan
  Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
  and Yoshua Bengio (2010).  Theano is also the basis for the popular
  <a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a> and
  <a href="http://keras.io/">Keras</a> neural networks libraries. Other
  popular neural nets libraries at the time of this writing include
  <a href="http://caffe.berkeleyvision.org">Caffe</a> and
  <a href="http://torch.ch">Torch</a>. </span>.
-->
<span class="marginnote">
  *2010年のJames
  Bergstra, Olivier Breuleux, Frederic Bastien, Pascal Lamblin, Ravzan
  Pascanu, Guillaume Desjardins, Joseph Turian, David Warde-Farley,
  and Yoshua Bengioによる<a href="http://www.iro.umontreal.ca/&#126;lisa/pointeurs/theano_scipy2010.pdf">Theano:
    A CPU and GPU Math Expression Compiler in Python</a>を確認してください。
  Theanoは、人気のある<a href="http://deeplearning.net/software/pylearn2/">Pylearn2</a>や
  <a href="http://keras.io/">Keras</a>などのニューラルネットワークライブラリにも利用されています。
  この本の執筆中の現在、他の人気のニューラルネットライブラリには、
  <a href="http://caffe.berkeleyvision.org">Caffe</a>と
  <a href="http://torch.ch">Torch</a>があります</span>。
<!--
   Using Theano makes it easy to
implement backpropagation for convolutional neural networks, since it
automatically computes all the mappings involved.  Theano is also
quite a bit faster than our earlier code (which was written to be easy
to understand, not fast), and this makes it practical to train more
complex networks.  In particular, one great feature of Theano is that
it can run code on either a CPU or, if available, a GPU.  Running on a
GPU provides a substantial speedup and, again, helps make it practical
to train more complex networks.
-->
Theanoを使うことで、畳み込みニューラルネットワークでの逆伝播を簡単に実装できます。
それは全ての結合での計算を自動的に行ってくれるためです。<!-- "mappings"を意訳 -->
さらにTheanoは私たちの以前のコード（こちらは速度よりも理解しやすさを重視して記述されています）よりも高速であり
、複雑なネットワークを訓練するにはとても実用的です。
Theanoのさらに別の利点は、1つのコードをCPU上でもGPU上でも実行できることです。
GPU実行により高速化が実現されるため、複雑なネットワークの利用に実用性が生まれます。
</p><p></p><p>
  <!--
If you wish to follow along, then you'll need to get Theano running on
your system.  To install Theano, follow the instructions at the
project's <a href="http://deeplearning.net/software/theano/">homepage</a>.
The examples which follow were run using Theano 0.6*<span class="marginnote">
*As I
  release this chapter, the current version of Theano has changed to
  version 0.7.  I've actually rerun the examples under Theano 0.7 and
  get extremely similar results to those reported in the text.</span>.
-->
コードを追いたい場合には、あなたのシステムでTheanoを実行する必要があります。
Theanoのインストールは、<a href="http://deeplearning.net/software/theano/">プロジェクトページ</a>の指示に従い行ってください。
以降のコードはTheano 0.6*で実行を確かめています
<span class="marginnote">
*この章を公開時には、Theanoの最新バージョンは0.7に更新されていました。
Theano 0.7でも同じコードを実行してみたところ、本書の記載結果とほぼ同等の結果が得られました。
</span>。
<!--
Some were run under Mac OS X Yosemite, with no GPU.  Some were run on
Ubuntu 14.04, with an NVIDIA GPU.  And some of the experiments were run
under both.  To get <tt>network3.py</tt> running you'll need to set the
<tt>GPU</tt> flag to either <tt>True</tt> or <tt>False</tt> (as appropriate)
in the <tt>network3.py</tt> source.  Beyond that, to get Theano up and
running on a GPU you may find
<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">the
  instructions here</a> helpful.  There are also tutorials on the web,
easily found using Google, which can help you get things working.
-->
コードの一部はGPUなしのMac OS X Yosemiteで実行しました。
また、一部はNVIDIAのGPUありのUbuntu 14.04の環境で実行しました。
幾つかの実験はどちらの環境でも試しました。
<tt>network3.py</tt>の実行の際は、<tt>network3.py</tt>のコード中の<tt>GPU</tt>フラグを<tt>True</tt>か<tt>False</tt>のどちらか適当な方に設定する必要があります。
Theanoを起動してGPU上で動かす際には、<a href="http://deeplearning.net/software/theano/tutorial/using_gpu.html">このインストラクション</a>が役立ちます。
ウェブ上のチュートリアルもGoogle検索により簡単に見つかります。
きっと、あなたの助けになるでしょう。
<!--
If
you don't have a GPU available locally, then you may wish to look into
<a href="http://aws.amazon.com/ec2/instance-types/">Amazon Web Services</a>
EC2 G2 spot instances.  Note that even with a GPU the code will take
some time to execute.  Many of the experiments take from minutes to
hours to run.  On a CPU it may take days to run the most complex of
the experiments.  As in earlier chapters, I suggest setting things
running, and continuing to read, occasionally coming back to check the
output from the code.  If you're using a CPU, you may wish to reduce
the number of training epochs for the more complex experiments, or
perhaps omit them entirely.
-->
ローカル環境でGPUを利用できない場合、<a href="http://aws.amazon.com/ec2/instance-types/">Amazon Web Services</a>の
EC2インスタンスやG2インスタンスを試すとよいでしょう。
ただしGPUを使ったとしても、処理に時間がかかるコードがあることに注意してください。
実験の多くは数分から数時間かかります。
一番複雑な実験は、CPUだと数日かかるでしょう。
以前の章でお薦めしたように、コードを実行している間に本書を読み進めて、時たま出力結果を確認するのが良いと思います。
CPU上で複雑な実験を行う際には、訓練のエポック数を小さくするか、実験そのものを諦めた方がよいでしょう。
</p><p>
<!--
To get a baseline, we'll start with a shallow architecture using just
a single hidden layer, containing $100$ hidden neurons.  We'll train
for $60$ epochs, using a learning rate of $\eta = 0.1$, a mini-batch
size of $10$, and no regularization.  Here we go*
-->
準備運動として、<!-- to get a baseline, を意訳 -->
$100$のニューロンを含む隠れ層を1つだけを持つ浅いネットワークから始めてみましょう。
訓練のエポック数は $60$ 、学習率は $\eta = 0.1$、ミニバッチサイズは $10$、正規化なしの条件で実行してみます*。
<!--
<span class="marginnote">
*Code for the
  experiments in this section may be found
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/conv.py">in
    this script</a>.  Note that the code in the script simply duplicates
  and parallels the discussion in this section.<br><br>Note also that
  throughout the section I've explicitly specified the number of
  training epochs.  I've done this for clarity about how we're
  training.  In practice, it's worth using
  <a href="chap3.html#early_stopping">early stopping</a>, that is,
  tracking accuracy on the validation set, and stopping training when
  we are confident the validation accuracy has stopped improving.</span>
-->
<span class="marginnote">
*このセクションの実験用コードは
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/conv.py">このスクリプト</a>の中にあります。
  スクリプト中のコードは、このセクションの議論に単純に沿っていることに注意してください。
  <br><br>
  セクションの中では、訓練のエポック数を指定していることにも注意してください。
  これは、訓練の様子を明らかにするために行っています。
  実際には、<a href="chap3.html#early_stopping">早期打ち切り</a>のテクニックが有効です。
  早期打ち切りは、検証データごとに精度を調査して、精度がそれ以上向上しなくなった段階で訓練を打ち切る方法でした。</span>
</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">import</span> <span class="nn">network3</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">Network</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">ConvPoolLayer</span><span class="p">,</span> <span class="n">FullyConnectedLayer</span><span class="p">,</span> <span class="n">SoftmaxLayer</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">network3</span><span class="o">.</span><span class="n">load_data_shared</span><span class="p">()</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">mini_batch_size</span> <span class="o">=</span> <span class="mi">10</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">784</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
  </pre></div>
  </p><p></p><p>
<!--
I obtained a best classification accuracy of $97.80$ percent. This is
the classification accuracy on the <tt>test_data</tt>, evaluated at the
training epoch where we get the best classification accuracy on the
<tt>validation_data</tt>. Using the validation data to decide when to
evaluate the test accuracy helps avoid overfitting to the test data
(see this <a href="chap3.html#validation_explanation">earlier
  discussion</a> of the use of validation data).  We will follow this
practice below.  Your results may vary slightly, since the network's
weights and biases are randomly initialized*
-->
$97.80%の分類精度を得ました。
この結果は、<tt>検証データ</tt>を使って、最高の分類精度を発揮する訓練エポック数を探し、
その訓練エポック数を<tt>テストデータ</tt>に適用した時の精度です。
検証データを使って精度評価のタイミングを決定することで、テストデータへ過適合を防ぎます。
(検証データの使用に関する<a href="chap3.html#validation_explanation">前章での議論</a>を確認してください）
このやり方を今後も踏襲します。
なお、ネットワークの重みとバイアスはランダムに初期化される*ため、あなたの結果は少し異なるかもしれません。
<!--
<span class="marginnote">
  *In fact, in this
  experiment I actually did three separate runs training a network
  with this architecture.  I then reported the test accuracy which
  corresponded to the best validation accuracy from any of the three
  runs.  Using multiple runs helps reduce variation in results, which
  is useful when comparing many architectures, as we are doing.  I've
  followed this procedure below, except where noted.  In practice, it
  made little difference to the results obtained.</span>.
-->
<span class="marginnote">
  *私は、この実験で同じネットワークに対し3回実行しました。
  検証データによる精度が3つのうちで最も良かった条件下での、テストデータによる結果を報告しました。
  複数回実行することで結果のばらつきを小さくできます。
  私たちが行っているように、多くの構造を比較する際にはこの方法は便利です。
  以降では明記していない限り、この手続きをとっています。
  実際には、この手続きを行っても、あまり結果に違いは生まれません。
  </span>
</p><p>
<!--
This $97.80$ percent accuracy is close to the $98.04$ percent accuracy
obtained back in <a href="chap3.html#chap3_98_04_percent">Chapter 3</a>,
using a similar network architecture and learning hyper-parameters.
In particular, both examples used a shallow network, with a single
hidden layer containing $100$ hidden neurons.  Both also trained for
$60$ epochs, used a mini-batch size of $10$, and a learning rate of
$\eta = 0.1$.
-->
この $97.80$ %という分類精度は、<a href="chap3.html#chap3_98_04_percent">3章</a>で得た $98.04$ という結果に近いです。
3章でも似たネットワーク構造を使って、ハイパーパラメータを学んでいました。
どちらの例も、$100$ 個のニューロンからなる隠れ層を1つ持つ浅いネットワークを使っています。
さらにどちらも、訓練エポック数 $60$、ミニバッチサイズ $10$、学習率 $\eta = 0.1$ の条件で訓練を行っています。
</p>
<p>
<!--
There were, however, two differences in the earlier network.  First,
we <a href="chap3.html#overfitting_and_regularization">regularized</a>
the earlier network, to help reduce the effects of
overfitting. Regularizing the current network does improve the
accuracies, but the gain is only small, and so we'll hold off worrying
about regularization until later.  Second, while the final layer in
the earlier network used sigmoid activations and the cross-entropy
cost function, the current network uses a softmax final layer, and the
log-likelihood cost function.  As
<a href="chap3.html#softmax">explained</a> in Chapter 3 this isn't a big
change.  I haven't made this switch for any particularly deep reason
- mostly, I've done it because softmax plus log-likelihood cost is
more common in modern image classification networks.
-->
しかしネットワーク前方において、異なる点が2つあります。
1つ目は、3章のネットワークでは、前方において<a href="chap3.html#overfitting_and_regularization">正規化</a>を行っていたことです。
これにより過適合を防いでいました。
一方、この章のネットワークに対して同様に正規化を施すと、精度は向上しますが、その上がり幅はわずかです。
そのため、私たちは正規化に関しては終盤まで気にしないこととします。
2つ目は、ネットワーク前方の最後の層は、活性化関数としてシグモイドと、誤差関数として交差エントロピー関数を用いていたのに対して、現在のネットワークは活性化関数としてソフトマックス関数を、誤差関数として対数尤度関数を使っている点です。
3章で<a href="chap3.html#softmax">説明</a>したように、これは大きな差異ではありません。
特に深い理由なくこのようにしています。
実際の理由は、ソフトマックス関数と対数尤度誤差を同時に使うのが、画像分類のネットワークでは常套手段だからです。
</p>
<p>
<!--
Can we do better than these results using a deeper network
architecture?
-->
もっと深いネットワークを使えば、より良い結果を得られるでしょうか？
</p>
<p>
<!--
Let's begin by inserting a convolutional layer, right at the beginning
of the network.  We'll use $5$ by $5$ local receptive fields, a stride
length of $1$, and $20$ feature maps.  We'll also insert a max-pooling
layer, which combines the features using $2$ by $2$ pooling windows.
So the overall network architecture looks much like the architecture
discussed in the last section, but with an extra fully-connected
layer:
-->
さぁ、ネットワークの最初の層へ、畳み込み層を挿入するところから始めましょう。
$5 \times 5$ の局所受容野、$1$ のストライド長さ、$20$ 個の特徴マップを使います。
さらにMaxプーリング層を挿入します。
このプーリング層は $2 \times 2$ のプーリングウィンドウを用いて特徴を結合します。
したがって、ネットワーク構造全体は、追加の全結合層以外は上のセクションで議論したものに近いです。
</p>
<p>
<center><img src="images/simple_conv.png" width="550px"></center>
</p>
<p>
<!--
In this architecture, we can think of the convolutional and pooling
layers as learning about local spatial structure in the input training
image, while the later, fully-connected layer learns at a more
abstract level, integrating global information from across the entire
image.  This is a common pattern in convolutional neural networks.
-->
この構造の場合、畳込み層とプーリング層が入力の訓練画像内の局所空間構造を学び、
後方の層で全結合層が、もう少し抽象的なレベルで画像全体の統合情報を学ぶとみなせます。
これは畳み込みニューラルネットワークの典型的なパターンです。
</p>
<p>
<!--
Let's train such a network, and see how it performs*<span class="marginnote">
*I've
  continued to use a mini-batch size of $10$ here.  In fact, as we
  <a href="chap3.html#mini_batch_size">discussed earlier</a> it may be
  possible to speed up training using larger mini-batches.  I've
  continued to use the same mini-batch size mostly for consistency
  with the experiments in earlier chapters.</span>:
-->
そのようなネットワークを訓練してみて、どう振る舞うか見てみましょう*。
<span class="marginnote">
  *$10$ のサイズのミニバッチをここでも使い続けています。
  実際は、<a href="chap3.html#mini_batch_size">以前議論したように</a>、ミニバッチのサイズを大きくすれば訓練は高速化できるでしょう。
  同じミニバッチサイズを使っているのは、以前の章で行った実験と一貫性を保つためです。
  </span>
</p><p>
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">20</span><span class="o">*</span><span class="mi">12</span><span class="o">*</span><span class="mi">12</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
  </pre></div>
  </p><p></p><p></p><p>
<!--
That gets us to $98.78$ percent accuracy, which is a considerable
improvement over any of our previous results.  Indeed, we've reduced
our error rate by better than a third, which is a great improvement.
-->
今回の分類精度は $98.78$ となり、これまでで最高の結果でした。
実際、誤差率を3分の1以上減らしました。
これは素晴らしい改良です。
</p><p>
<!--
In specifying the network structure, I've treated the convolutional
and pooling layers as a single layer.  Whether they're regarded as
separate layers or as a single layer is to some extent a matter of
taste.  <tt>network3.py</tt> treats them as a single layer because it
makes the code for <tt>network3.py</tt> a little more compact.  However,
it is easy to modify <tt>network3.py</tt> so the layers can be specified
separately, if desired.
-->
ネットワーク構造を確認すると、畳み込み層とプーリング層をまとめて1つの層と扱っています。
別々の層としてみなすか、1つの層とみなすかは好みの問題です。
<tt>network3.py</tt>ではまとめて1つの層とみなしています。
なぜかというと、<tt>network3.py</tt>のコードを少しコンパクトにできるからです。
しかし、お望みであれば、各層を別々に扱うよう<tt>network3.py</tt>を修正することも簡単にできます。
</p><p>
<h4><a name="exercise_683491"></a><a href="#exercise_683491"><!--Exercise-->練習問題</a></h4><ul>
<li>
  <!--
  What classification accuracy do you get if you omit the
  fully-connected layer, and just use the convolutional-pooling layer
  and softmax layer?  Does the inclusion of the fully-connected layer
  help?
  -->
  全結合層を除外して、畳み込み-プーリング層とソフトマックス層のみ使うと、分類の精度はどうなるでしょうか？
  全結合層の存在が寄与しているのでしょうか？
</ul></p><p>
<!--Can we improve on the $98.78$ percent classification accuracy?-->
$98.78$ % の分類精度をさらに向上できるでしょうか？
</p><p>
<!--
Let's try inserting a second convolutional-pooling layer.  We'll make
the insertion between the existing convolutional-pooling layer and the
fully-connected hidden layer.  Again, we'll use a $5 \times 5$ local
receptive field, and pool over $2 \times 2$ regions.  Let's see what
happens when we train using similar hyper-parameters to before:
-->
2つ目の畳み込み-プーリング層を挿入してみましょう。
既存の畳み込み-プーリング層と全結合層の間に入れます。
今回の畳み込み-プーリング層にも、$5 \times 5$ の局所受容野と $2 \times 2$ のプーリング領域という設定を適用します。
訓練時のハイパーパラメータは以前と同じ条件として、何が起きるか見てみましょう。
</p><p><div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">)),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.1</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
  </pre></div>
  </p><p></p><p></p><p>
<!--
Once again, we get an improvement: we're now at $99.06$ percent
classification accuracy!
-->
前回よりも良い結果です！ $99.06$ %の分類精度に到達しました！
</p><p>
<!--
There's two natural questions to ask at this point.  The first
question is: what does it even mean to apply a second
convolutional-pooling layer?  In fact, you can think of the second
convolutional-pooling layer as having as input $12 \times 12$
"images", whose "pixels" represent the presence (or absence) of
particular localized features in the original input image.  So you can
think of this layer as having as input a version of the original input
image.  That version is abstracted and condensed, but still has a lot
of spatial structure, and so it makes sense to use a second
convolutional-pooling layer.
-->
ここで2つの疑問が湧きます。
1つ目の疑問は、2層目の畳込み-プーリング層を適用した意味とは何なのか？というものです。
2層目の畳み込み-プーリング層の入力は、 $12 \times 12$ の入力"画像"とみなせます。
その各"ピクセル"は、もとの入力画像がある局所的な特徴を保持するか否かを示します。
したがって、この層はもとの入力画像の別バージョンの"画像"を入力として持つと考えられます。
そのバージョンの画像は抽象化されており情報が縮約されていますが、空間的構造は保持しています。
したがって、2層目の畳み込み-プーリング層には存在意義があると言えるのです。
</p><p>
<!--
That's a satisfying point of view, but gives rise to a second
question.  The output from the previous layer involves $20$ separate
feature maps, and so there are $20 \times 12 \times 12$ inputs to the
second convolutional-pooling layer.  It's as though we've got $20$
separate images input to the convolutional-pooling layer, not a single
image, as was the case for the first convolutional-pooling layer.  How
should neurons in the second convolutional-pooling layer respond to
these multiple input images?  In fact, we'll allow each neuron in this
layer to learn from <em>all</em> $20 \times 5 \times 5$ input neurons in
its local receptive field.  More informally: the feature detectors in
the second convolutional-pooling layer have access to <em>all</em> the
features from the previous layer, but only within their particular
local receptive field*
-->
これは魅力的な見方です。
でも、そうなると2つ目の疑問が湧きます。
1層目の出力は $20$ の個別の特徴マップです。
したがって、2層目の畳み込み-プーリング層への入力は $20 \times 12 \times 12$ となります。
これは、1層目の畳み込み-プーリング層の場合のように1つの画像が入力されるというよりも、
まるで $20$ の異なる入力画像が畳み込み-プーリング層に入力されるかのようです。
2層目の畳み込み-プーリング層のニューロンは、これらの多数の入力画像にどのような反応を見せるのでしょうか？
実際、2層目の自身の局所受容野中の入力ニューロン $20 \times 5 \times 5$ の<em>全て</em>から、2層目の各ニューロンは学習します。
ざっくり言い換えると、2層目の畳込み-プーリング層の特徴検出器は前層の特徴<em>全て</em>にアクセスします。
ただし、特定の局所受容野*の範囲においてのみですが。
<!--
<span class="marginnote">
*This issue would have arisen in the
  first layer if the input images were in color.  In that case we'd
  have 3 input features for each pixel, corresponding to red, green
  and blue channels in the input image.  So we'd allow the feature
  detectors to have access to all color information, but only within a
  given local receptive field.
</span>.
-->
<span class="marginnote">
  入力がカラー画像の場合、1層目のこの問題は発生するでしょう。
  その場合、入力画像の赤・緑・青のチャネルに対応する3つの特徴を各ピクセルが保持します。
  したがって特徴検出器は、局所受容野の範囲内においては全ての色情報にアクセスできるのです。
</span>
</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
<h4><a name="problem_834310"></a><a href="#problem_834310"><!-- Problem-->問題</a></h4><ul>
<li>
<!--
  <strong>Using the tanh activation function</strong> Several times earlier in the
  book I've mentioned arguments that the
  <a href="chap3.html#other_models_of_artificial_neuron">tanh
    function</a> may be a better activation function than the sigmoid
  function.  We've never acted on those suggestions, since we were
  already making plenty of progress with the sigmoid.  But now let's
  try some experiments with tanh as our activation function.  Try
  training the network with tanh activations in the convolutional and
  fully-connected layers*<span class="marginnote">
*Note that you can pass
    <tt>activation_fn=tanh</tt> as a parameter to the
    <tt>ConvPoolLayer</tt> and <tt>FullyConnectedLayer</tt> classes.</span>.
-->
<strong>活性化関数としてtanhの使用</strong>
  本書の前半の章で<a href="chap3.html#other_models_of_artificial_neuron">tanh関数</a>はシグモイド関数よりも良い活性化関数であると述べました。
  これまでは主にシグモイド関数を使って議論を進めてきましたが、tanhを活性化関数として用いて少し実験をしてみましょう。
  畳み込み層と全結合層でtanhを試しに使い、ネットワークを訓練してみてください*<span class="marginnote">
  *<tt>activation_fn=tanh</tt>をパラメータとして<tt>ConvPoolLayer</tt>と<tt>FullyConnectedLayer</tt>のクラスへ渡せます。</span>。
<!--
  Begin with the same hyper-parameters as for the sigmoid network, but
  train for $20$ epochs instead of $60$.  How well does your network
  perform?  What if you continue out to $60$ epochs?  Try plotting the
  per-epoch validation accuracies for both tanh- and sigmoid-based
  networks, all the way out to $60$ epochs.  If your results are
  similar to mine, you'll find the tanh networks train a little
  faster, but the final accuracies are very similar.  Can you explain
  why the tanh network might train faster?  Can you get a similar
  training speed with the sigmoid, perhaps by changing the learning
  rate, or doing some rescaling*
-->
  シグモイドのネットワークの時と同じパラメータで始めてみましょう。
  ただし、 エポック数は $60$ ではなく $20$ にします。
  ネットワークはどのように振る舞うでしょうか？
  $60$ エポックまで続けたらどうなるでしょう？
  tanhの場合とシグモイドの場合で、検証データに対する精度を $60$ エポックまでプロットしてみてください。
  あなたの結果が私の結果と近ければ、tanhのネットワークの方が少し高速ですが、最終的な精度はほぼ同じになったはずです。
  なぜtanhのネットワークの方が高速なのか説明できますか？
  学習率やスケール*などを変更することで、シグモイドと同じ速度で学習させることができるでしょうか？
<!--
  <span class="marginnote">
*You may perhaps find
    inspiration in recalling that $\sigma(z) = (1+\tanh(z/2))/2$.</span>?
-->
  <span class="marginnote">
  *$\sigma(z) = (1+\tanh(z/2))/2$ の式を思い出すことで、何かを思いつくでしょうか？</span>
<!--
  Try a half-dozen iterations on the learning hyper-parameters or
  network architecture, searching for ways that tanh may be superior
  to the sigmoid. <em>Note: This is an open-ended problem.
    Personally, I did not find much advantage in switching to tanh,
    although I haven't experimented exhaustively, and perhaps you may
    find a way.  In any case, in a moment we will find an advantage in
    switching to the rectified linear activation function, and so we
    won't go any deeper into the use of tanh.</em>
-->
  tanhがシグモイドよりも優れている点を探すために、ハイパーパラメータやネットワーク構造を変更するなど、試行錯誤してください。
  <em>
    これは自由回答の問題であることに注意してください。
    すべての設定を網羅して試せていませんが、個人的には、tanhへ活性化関数を変更する利点はないように思います。
    もしかしたら利点を見つけられるかもしれません。
    どちらにせよ、活性化関数をReLUへ変更する利点がこの後、見つかってしまいます。
    ですので、これ以上tanhを使用することは考えません。
  </em>
</ul></p><p></p><p>
<!--
<strong>Using rectified linear units:</strong> The network we've developed at
this point is actually a variant of one of the networks used in the
seminal 1998
paper*
<span class="marginnote">
*<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based
    learning applied to document recognition"</a>, by Yann LeCun,
  Léon Bottou, Yoshua Bengio, and Patrick Haffner
  (1998).  There are many differences of detail, but broadly speaking
  our network is quite similar to the networks described in the
  paper.</span> introducing the MNIST problem, a network known as LeNet-5.
-->
<strong>ReLUの使用：</strong>
これまで開発してきたネットワークは、1998年の先駆的な論文*で使われたネットワークの亜種です。
<span class="marginnote">
  *その論文とは1998年のYann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffnerによる<a href="http://yann.lecun.com/exdb/publis/pdf/lecun-98.pdf">"Gradient-based learning applied to document recognition"</a>です。
  細かいところに違いはたくさんあるのですが、ネットワーク全体で見ると私たちのネットワークに非常に似ています。
</span>
この論文では、MNISTの問題とLeNet-5というネットワークが紹介されています。
<!--
It's a good foundation for further experimentation, and for building
up understanding and intuition.  In particular, there are many ways we
can vary the network in an attempt to improve our results.</p>-->
私たちの理解と直感を促進する上で、この論文は有用です。
特に、結果を良くするために何をすればよいかを示す、ネットワークの改善指針がたくさん載っています。
</p>
<!--
<p>As a beginning, let's change our neurons so that instead of using a
sigmoid activation function, we use
<a href="chap3.html#other_models_of_artificial_neuron">rectified
  linear units</a>.  That is, we'll use the activation function $f(z)
\equiv \max(0, z)$.  We'll train for $60$ epochs, with a learning rate
of $\eta = 0.03$.  I also found that it helps a little to use some
<a href="chap3.html#overfitting_and_regularization">l2
  regularization</a>, with regularization parameter $\lambda = 0.1$:
-->
<p>まず始めに、活性化関数をシグモイドではなく、<a href="chap3.html#other_models_of_artificial_neuron">ReLU</a>に変更しましょう。
すなわち $f(z) \equiv \max(0, z)$ という活性化関数を使うようにします。
訓練のエポック数 $60$ 、学習率 $\eta = 0.03$ で訓練します。
パラメータ $\lambda = 0.1$ として<a href="chap3.html#overfitting_and_regularization">L2 正規化</a>を使うと少し良くなるということも知っています。
</p><p>
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="kn">from</span> <span class="nn">network3</span> <span class="kn">import</span> <span class="n">ReLU</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
  </pre></div>
  </p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p>
<!--
I obtained a classification accuracy of $99.23$ percent.  It's a
modest improvement over the sigmoid results ($99.06$).  However,
across all my experiments I found that networks based on rectified
linear units consistently outperformed networks based on sigmoid
activation functions.  There appears to be a real gain in moving to
rectified linear units for this problem.
-->
今回、 $99.23$ %の分類精度を得ました。
シグモイドの場合（ $99.06$ % ）と比べてほんの少し改善されました。
しかし、同一の問題に対し様々な条件で試した結果、私が発見したのは、ReLUを使うとシグモイドを使ったネットワークよりも、一貫して高精度となることです。
この問題に関しては、ReLUへ移行するのが得策のようです。
</p>
<p>
<!--
What makes the rectified linear activation function better than the
sigmoid or tanh functions?  At present, we have a poor understanding
of the answer to this question.  Indeed, rectified linear units have
only begun to be widely used in the past few years.  The reason for
that recent adoption is empirical: a few people tried rectified linear
units, often on the basis of hunches or heuristic arguments*
-->
ReLUの活性化関数はシグモイドやtanh関数より何が優れているのでしょう？
この問いに対する答えを今はまだ持ち合わせていません。
実際のところ、ReLUはここ数年で使われ始めたのです。
その採用の理由は、試しに使ってみた実験の結果が良かったことにあります。
最初の何人かが直感やヒューリスティックな議論*に従い、ReLUを試してみたのです。
<!--
<span class="marginnote">
*A
  common justification is that $\max(0, z)$ doesn't saturate in the
  limit of large $z$, unlike sigmoid neurons, and this helps rectified
  linear units continue learning.  The argument is fine, as far it
  goes, but it's hardly a detailed justification, more of a just-so
  story.  Note that we discussed the problems with saturation back in
  <a href="chap2.html#saturation">Chapter 2</a>.</span>.
-->
<span class="marginnote">
  *理由付けとしてよくあるのが、 $\max(0, z)$ は大きな $z$ に対しても飽和しないので良い、というものです。
  シグモイドは飽和するのに対して、このReLUは飽和しないために学習し続けられるというのです。
  この主張は直感的には合っている気がします。
  しかし詳細な証明ではありません。
  この飽和の問題に関する議論は<a href="chap2.html#saturation">2章で</a>既に触れました。
  </span>
<!--
They got good
results classifying benchmark data sets, and the practice has spread.
In an ideal world we'd have a theory telling us which activation
function to pick for which application.  But at present we're a long
way from such a world.  I should not be at all surprised if further
major improvements can be obtained by an even better choice of
activation function.  And I also expect that in coming decades a
powerful theory of activation functions will be developed.  Today, we
still have to rely on poorly understood rules of thumb and experience.
-->
ReLUはベンチマークのデータセットを上手く分類しました。
その実例は、さらに広がりつつあります。
理想としては、応用方法に応じて使用する活性化関数を選ぶための理論が欲しいと思います。
でも現実にそんな理論はありません。
もし仮に活性化関数をさらに上手く選ぶことができれば、結果が良くなるでしょう。
なので数十年後には、活性化関数に関する強力な理論が生まれていることを期待しています。
でも今は、大ざっぱな理論にすがり、地道な実験を繰り返して、活性化関数を選ぶしかないのです。
</p><p>
<!--
<strong>Expanding the training data:</strong> Another way we may hope to
improve our results is by algorithmically expanding the training data.
A simple way of expanding the training data is to displace each
training image by a single pixel, either up one pixel, down one pixel,
left one pixel, or right one pixel.  We can do this by running the
program <tt>expand_mnist.py</tt> from the shell prompt*
<span class="marginnote">
*The code
  for <tt>expand_mnist.py</tt> is available
  <a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/expand_mnist.py">here</a>.</span>:</p><p><div class="highlight"><pre><span></span>
  $ python expand_mnist.py
  </pre></div>
-->
<strong>訓練データの拡張：</strong>
結果を改良する別の手法として、訓練データをアルゴリズムを使って拡張する手法があります。
シンプルなやり方は、訓練データの各画像を1ピクセルずつ上下左右のいずれかの方向にずらすことです。
<tt>expand_mnist.py</tt>のプログラムをシェルプロンプトから動かして試せます*。
<span class="marginnote">
*<tt>expand_mnist.py</tt>のコードは<a href="https://github.com/mnielsen/neural-networks-and-deep-learning/blob/master/src/expand_mnist.py">ここ</a>で取得できます。</span></p><p><div class="highlight"><pre><span></span>
  $ python expand_mnist.py
  </pre></div>
</p><p>
<!--
Running this program takes the $50,000$ MNIST training images, and
prepares an expanded training set, with $250,000$ training images.  We
can then use those training images to train our network.  We'll use
the same network as above, with rectified linear units.  In my initial
experiments I reduced the number of training epochs - this made
sense, since we're training with $5$ times as much data.  But, in
fact, expanding the data turned out to considerably reduce the effect
of overfitting.  And so, after some experimentation, I eventually went
back to training for $60$ epochs.  In any case, let's train:
-->
このプログラムを実行すると、入力の $50,000$ のMNISTの訓練画像を、 $250,000$ と増やします。
新たに生成した画像もネットワークを訓練するのに使えます。
今回も上のネットワークと同じようにReLUを使います。
最初の実験では、訓練のエポック数を減らしました。
<!-- 実は、訓練データが $5$ 倍になることを予め想定して、エポック数を減らしておいたのです。 -->
しかし、訓練データを増やすことで、過適合を防ぐ効果が期待できます。
そのため幾つかの実験を経た後で、私はエポック数を $60$ へ戻しました。
さあ、訓練してみましょう。
</p><p>
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">expanded_training_data</span><span class="p">,</span> <span class="n">_</span><span class="p">,</span> <span class="n">_</span> <span class="o">=</span> <span class="n">network3</span><span class="o">.</span><span class="n">load_data_shared</span><span class="p">(</span>
          <span class="s2">&quot;../data/mnist_expanded.pkl.gz&quot;</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">expanded_training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
  </pre></div>
  </p><p></p><p>
<!--
Using the expanded training data I obtained a $99.37$ percent training
accuracy.  So this almost trivial change gives a substantial
improvement in classification accuracy.  Indeed, as we
<a href="chap3.html#other_techniques_for_regularization">discussed
  earlier</a> this idea of algorithmically expanding the data can be
taken further.  Just to remind you of the flavour of some of the
results in that earlier discussion: in 2003 Simard, Steinkraus and
Platt*<span class="marginnote">
*<a href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">Best
    Practices for Convolutional Neural Networks Applied to Visual
    Document Analysis</a>, by Patrice Simard, Dave Steinkraus, and John
  Platt (2003).</span> improved their MNIST performance to $99.6$ percent
using a neural network otherwise very similar to ours, using two
convolutional-pooling layers, followed by a hidden fully-connected
layer with $100$ neurons.
-->
拡張した訓練データを使った結果、 $99.37$ %の分類精度を得ました。
データに僅かな変化を施したことで、精度が向上したのです。
データ拡張により結果が良くなることは<a href="chap3.html#other_techniques_for_regularization">以前にも議論</a>しました。
この時の議論の雰囲気を思い出してもらいましょう。
2003年にSimard, Steinkraus, Platt*はMNISTに対する分類精度を $99.6$ まで向上しました。
その時には、2つの畳み込み-プーリング層とそれに続く $100$ のニューロンを持つ全結合層からなるネットワークを使っていました。
これは私たちのネットワークに非常に近いです。
<span class="marginnote">
*2003年のPatrice Simard, Dave Steinkraus, John Plattによる<a href="http://dx.doi.org/10.1109/ICDAR.2003.1227801">
Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis</a>。</span>
<!--
There were a few differences of detail in
theinr architecture - they didn't have the advantage of using
rectified linear units, for instance - but the key to their improved
performance was expanding the training data.  They did this by
rotating, translating, and skewing the MNIST training images.  They
also developed a process of "elastic distortion", a way of emulating
the random oscillations hand muscles undergo when a person is writing.
By combining all these processes they substantially increased the
effective size of their training data, and that's how they achieved
$99.6$ percent accuracy.
-->
しかし、私たちのネットワークと彼らのネットワークの構造には細かい違いが幾つかあります。
例えば、ReLUを使っていなかったことなどです。
しかし、パフォーマンスが非常に良かった大きな理由は訓練データの拡張にあります。
彼らはMNISTの訓練画像に対して、回転・並進移動・せん断の操作を行いました。
さらに"elastic distortion"の操作も加えました。
これは、人が数字を書く時に手の筋肉がランダムに振動する様子を模擬する方法です。
これらの操作を組み合わせて、訓練データを実質的に増やし、 $99.6$ %の分類精度を達成しました。
</p><p>
<!--
<h4><a name="problem_437600"></a><a href="#problem_437600">Problem</a></h4><ul>
<li> The idea of convolutional layers is to behave in an invariant
  way across images.  It may seem surprising, then, that our network
  can learn more when all we've done is translate the input data.  Can
  you explain why this is actually quite reasonable?
-->
<h4><a name="problem_437600"></a><a href="#problem_437600">問題</a></h4><ul>
<li>
  畳み込み層のアイデアとは、画像に対して並進方向の不変性を持つことです。
  となると、入力画像に対して並進移動の操作を加えて訓練データを増やしたときに、ネットワークの精度が向上するのは一見不思議です。
  なぜこれがとても合理的なのか説明できますか？
</ul>
</p><p>
<!--
<strong>Inserting an extra fully-connected layer:</strong> Can we do even
better?  One possibility is to use exactly the same procedure as
above, but to expand the size of the fully-connected layer.  I tried
with $300$ and $1,000$ neurons, obtaining results of $99.46$ and
$99.43$ percent, respectively.  That's interesting, but not really a
convincing win over the earlier result ($99.37$ percent).
-->
<strong>全結合層の追加:</strong>
さらに改良できるでしょうか？
1つの可能性としては、上記と同じ手続きを行いながらも、サイズの大きな全結合層を追加してみることです。
$300$ と $1,000$ のニューロンを持つ全結合層を追加して、それぞれ試したところ、それぞれ $99.46$ %と $99.43$ %の結果を得ました。
これは興味深いですが、前回の結果（$99.37$ %）とあまり変わりませんでした。
</p><p>
<!--
What about adding an extra fully-connected layer?  Let's try inserting
an extra fully-connected layer, so that we have two $100$-hidden
neuron fully-connected layers:
-->
さらに全結合層を追加してみたらどうでしょうか？
さあ、全結合層を1層追加して、$100$ ニューロンからなる全結合層を2つ持つネットワークとしましょう。
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">100</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">)],</span> <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">expanded_training_data</span><span class="p">,</span> <span class="mi">60</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.1</span><span class="p">)</span>
  </pre></div>
  </p><p>
<!--
Doing this, I obtained a test accuracy of $99.43$ percent.  Again, the
expanded net isn't helping so much.  Running similar experiments with
fully-connected layers containing $300$ and $1,000$ neurons yields
results of $99.48$ and $99.47$ percent.  That's encouraging, but still
falls short of a really decisive win.
-->
これにより、$99.43$ の分類精度を得ました。
全結合層を追加しただけでは、またしても、精度が向上しませんでした。
ニューロン数を $300$ と $1,000$ とした全結合層の場合でも同じ実験をしてみましたが、それぞれ $99.48$ %と $99.47$ %という結果でした。
悪くないですが、格段には向上していません。
</p><p>
<a name="final_conv"></a>
</p><p>
<!--
What's going on here?  Is it that the expanded or extra
fully-connected layers really don't help with MNIST?  Or might it be
that our network has the capacity to do better, but we're going about
learning the wrong way?  For instance, maybe we could use stronger
regularization techniques to reduce the tendency to overfit.  One
possibility is the
<a href="chap3.html#other_techniques_for_regularization">dropout</a>
technique introduced back in Chapter 3.  Recall that the basic idea of
dropout is to remove individual activations at random while training
the network.  This makes the model more robust to the loss of
individual pieces of evidence, and thus less likely to rely on
particular idiosyncracies of the training data.  Let's try applying
dropout to the final fully-connected layers:
-->
何が起きているのでしょうか？
全結合層を増やすのは、MNIST問題には有効でないのでしょうか？
もしくはネットワークは改良されているのに、私たちが間違ったやり方で訓練しているのでしょうか？
例えば、過適合を回避するために、より強力な正規化のテクニックを使うのもよいでしょう。
別の例としては、3章で紹介した<a href="chap3.html#other_techniques_for_regularization">ドロップアウト</a>のテクニックがあります。
ドロップアウトの基礎的なアイデアを思い出してください。
訓練時に各活性化出力をランダムに0とするのです。
これによりモデルが、各入力の有無の違いにロバストになるため、各訓練データの特異性に依存しなくなります。
このドロップアウトを最後の全結合層に適用してみましょう。
</p><p>
<div class="highlight"><pre><span></span><span class="o">&gt;&gt;&gt;</span> <span class="n">net</span> <span class="o">=</span> <span class="n">Network</span><span class="p">([</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">28</span><span class="p">,</span> <span class="mi">28</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">20</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">ConvPoolLayer</span><span class="p">(</span><span class="n">image_shape</span><span class="o">=</span><span class="p">(</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">12</span><span class="p">,</span> <span class="mi">12</span><span class="p">),</span>
                        <span class="n">filter_shape</span><span class="o">=</span><span class="p">(</span><span class="mi">40</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">),</span>
                        <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>
                        <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span>
              <span class="n">n_in</span><span class="o">=</span><span class="mi">40</span><span class="o">*</span><span class="mi">4</span><span class="o">*</span><span class="mi">4</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">),</span>
          <span class="n">FullyConnectedLayer</span><span class="p">(</span>
              <span class="n">n_in</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">ReLU</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">),</span>
          <span class="n">SoftmaxLayer</span><span class="p">(</span><span class="n">n_in</span><span class="o">=</span><span class="mi">1000</span><span class="p">,</span> <span class="n">n_out</span><span class="o">=</span><span class="mi">10</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.5</span><span class="p">)],</span>
          <span class="n">mini_batch_size</span><span class="p">)</span>
  <span class="o">&gt;&gt;&gt;</span> <span class="n">net</span><span class="o">.</span><span class="n">SGD</span><span class="p">(</span><span class="n">expanded_training_data</span><span class="p">,</span> <span class="mi">40</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="mf">0.03</span><span class="p">,</span>
              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">)</span>
  </pre></div>
  </p><p>
<!--
Using this, we obtain an accuracy of $99.60$ percent, which is a
substantial improvement over our earlier results, especially our main
benchmark, the network with $100$ hidden neurons, where we achieved
$99.37$ percent.
-->
これにより $99.60$ %の分類精度を得ました。
やっと $100$ のニューロンからなるネットワークのベンチマーク結果である $99.37$ を大きく更新しました。
</p><p>
<!--
There are two changes worth noting.  </p><p>First, I reduced the number of training epochs to $40$: dropout
reduced overfitting, and so we learned faster.</p><p>Second, the fully-connected hidden layers have $1,000$ neurons, not
the $100$ used earlier. Of course, dropout effectively omits many of
the neurons while training, so some expansion is to be expected.  In
fact, I tried experiments with both $300$ and $1,000$ hidden neurons,
and obtained (very slightly) better validation performance with
$1,000$ hidden neurons.
-->
2つの注目すべき変化があります。
</p><p>1つ目は、訓練するエポック数を $40$ に減らしたことです。
ドロップアウトが過適合を抑制するため、高速に学習できるのです。
</p><p>2つ目は、全結合層内のニューロン数が以前と同じ $100$ 個ではなく、 $1,000$ 個であることです。
ドロップアウトはニューロンの多くを効率的に省いているので、ニューロン数の増量が必要なのです。
実際、$300$ と $1,000$ の隠れニューロンを使って実験もしてみましたが、$1,000$ ニューロンの場合の方が（僅かですが）結果が良かったです。
</p><p>
<!--
<strong>Using an ensemble of networks:</strong> An easy way to improve
performance still further is to create several neural networks, and
then get them to vote to determine the best classification.  Suppose,
for example, that we trained $5$ different neural networks using the
prescription above, with each achieving accuracies near to $99.6$
percent.  Even though the networks would all have similar accuracies,
they might well make different errors, due to the different random
initializations.  It's plausible that taking a vote amongst our $5$
networks might yield a classification better than any individual
network.
-->
<strong>ネットワークのアンサンブルの活用：</strong>
パフォーマンスをさらに向上させる簡単な方法は、複数のニューラルネットワークを作成し、それらに多数決で分類を決めさせることです。
例えば、上述の条件を満たす $5$ つの異なるニューラルネットワークを訓練して、それぞれ $99.6$ %の精度を得たと想定します。
各ネットワークは精度は同等だったとしても、ランダムな初期化がされているため誤差の出し方が異なります。
つまり、これら $5$ つのニューラルネットワーク間で多数決を取れば、個々のニューラルネットワークだけの場合よりも良い分類が可能となるはずです。
</p><p>
<!--
This sounds too good to be true, but this kind of ensembling is a
common trick with both neural networks and other machine learning
techniques.  And it does in fact yield further improvements: we end up
with $99.67$ percent accuracy.  In other words, our ensemble of
networks classifies all but $33$ of the $10,000$ test images
correctly.
-->
あまりに上手く行きそうなので、胡散臭いですね。
でも、この種のアンサンブル手法はニューラルネットワークや他の機械学習での常套手段です。
そして実際、精度は向上し、 $99.67$ %となりました。
この結果を言い換えると、ネットワークのアンサンブルにより $10,000$ のテスト画像のうち、$33$ 以外の全てを正しく分類できたのです。
</p><p>
<!--
The remaining errors in the test set are shown below.  The label in
the top right is the correct classification, according to the MNIST
data, while in the bottom right is the label output by our ensemble of
nets:
-->
テストセットで間違えたものを下に示します。
右上のラベルはMNISTによる正しい分類を示し、右下はネットワークのアンサンブルによる出力を示します。
</p><p>
<center><img src="images/ensemble_errors.png" width="580px"></center>
</p><p>
<!--
It's worth looking through these in detail. The first two digits, a 6
and a 5, are genuine errors by our ensemble.  However, they're also
understandable errors, the kind a human could plausibly make. That 6
really does look a lot like a 0, and the 5 looks a lot like a 3.  The
third image, supposedly an 8, actually looks to me more like a 9.  So
I'm siding with the network ensemble here: I think it's done a better
job than whoever originally drew the digit.  On the other hand, the
fourth image, the 6, really does seem to be classified badly by our
networks.
-->
この結果は細かく確認する価値があります。
最初の2つの数字、"6"と"5"はアンサンブルが判断した間違いです。
しかし、それらはいずれも理解できる間違いです。
人間でも間違えそうです。
その"6"は"0"のように見えますし、"5"は"3"のように見えます。
3つめの画像は"8"のはずですが、実際には"9"に近く見えます。
なので、私はネットワークのアンサンブルの判断の方が正しいように思えます。
つまり、その数字を書いた人よりもネットワークのほうが良い仕事をしていると思います。
一方、4つ目の画像の"6"はネットワークによる分類は不思議に感じます。
</p><p>
<!--
And so on.  In most cases our networks' choices seem at least
plausible, and in some cases they've done a better job classifying
than the original person did writing the digit.  Overall, our networks
offer exceptional performance, especially when you consider that they
correctly classified 9,967 images which aren't shown.  In that
context, the few clear errors here seem quite understandable.  Even a
careful human makes the occasional mistake.  And so I expect that only
an extremely careful and methodical human would do much better.  Our
network is getting near to human performance.
-->
大抵のケースでは私たちのネットワークの選択は、少なくとももっともらしいように見えます。
幾つかのケースでは、数字を書いた人よりも良いはたらきをしています。
上記に示さなかった9,967の画像を考慮すると、全体として私たちのネットワークは素晴らしいパフォーマンスを発揮していると言えます。
そのような文脈を考えると、僅かにある分類の明らかな誤りも許容できます。
現実にはどんなに注意深い人間でさえも、時には間違いを犯します。
したがって、とんでもなく注意深くて理論的な人間だけがこのネットワークよりもよい精度を出せると思います。
私たちのネットワークは人間の最高精度に到達しつつあるのです。
</p><p>
<!--
<strong>Why we only applied dropout to the fully-connected layers:</strong> If
you look carefully at the code above, you'll notice that we applied
dropout only to the fully-connected section of the network, not to the
convolutional layers.  In principle we could apply a similar procedure
to the convolutional layers.  But, in fact, there's no need: the
convolutional layers have considerable inbuilt resistance to
overfitting.  The reason is that the shared weights mean that
convolutional filters are forced to learn from across the entire
image.  This makes them less likely to pick up on local idiosyncracies
in the training data.  And so there is less need to apply other
regularizers, such as dropout.
-->
<strong>なぜ全結合層だけにドロップアウトを適用したのか：</strong>
上記のコードを注意深く見ると、ドロップアウトを全結合層にのみ適用しており、
畳み込み層へは適用していないことに気づくでしょう。
原理的には、畳み込み層に対しても同じ手続きを適用できます。
しかし、実際その必要はありません。
畳み込み層は過適合に対するもともとの抵抗力が強いのです。
その理由は、重みを共有することにより、畳込みフィルタが画像全体から学習することなるからです。
これにより訓練データの局所的な特異性による影響を受けにくくなっています。
したがって、ドロップアウトなどの他の正規化手法を適用する必要性が薄いのです。
</p><p>
<!--
<strong>Going further:</strong> It's possible to improve performance on MNIST
still further. Rodrigo Benenson has compiled an
<a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">informative
  summary page</a>, showing progress over the years, with links to
papers.  Many of these papers use deep convolutional networks along
lines similar to the networks we've been using.  If you dig through
the papers you'll find many interesting techniques, and you may enjoy
implementing some of them.  If you do so it's wise to start
implementation with a simple network that can be trained quickly,
which will help you more rapidly understand what is going on.
-->
<strong>さらなる性能向上をめざして：</strong>
MNIST問題に対する性能をもっと上げることはできます。
Rodrigo Benensonが
<a href="http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html">有益なまとめページ</a>を作っています。
このページでは年々の進化を、論文にリンクした形で確認できます。
これらの論文では、私たちが使ってきたものと、近しい深層畳み込みネットワークを使用しています。
論文を漁れば、面白いテクニックがたくさん見つかるでしょう。
それを実装するのは楽しいはずです。
もし実装する場合には、高速に訓練ができる単純なネットワークを使うのが賢いと思います。
そういったネットワークでは、何が起きているのかをすぐに理解しやすいからです。
</p><p>
<!--
For the most part, I won't try to survey this recent work.  But I
can't resist making one exception.  It's a 2010 paper by Cireșan,
Meier, Gambardella, and
Schmidhuber*<span class="marginnote">
*<a href="http://arxiv.org/abs/1003.0358">Deep, Big,
    Simple Neural Nets Excel on Handwritten Digit Recognition</a>, by Dan
  Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, and Jürgen
  Schmidhuber (2010).</span>.
-->
私は最近の大抵の論文を調査しようとしません。
しかし、1つだけ例外があります。
2010年のCireșan, Meier, Gambardella, Schmidhuberによる論文です*<span class="marginnote">
*2010年のDan Claudiu Cireșan, Ueli Meier, Luca Maria Gambardella, Jürgen
  Schmidhuberによる<a href="http://arxiv.org/abs/1003.0358">Deep, Big,
    Simple Neural Nets Excel on Handwritten Digit Recognition</a>。</span>。
<!--
What I like about this paper is how simple it
is.  The network is a many-layer neural network, using only
fully-connected layers (no convolutions).  Their most successful
network had hidden layers containing $2,500$, $2,000$, $1,500$,
$1,000$, and $500$ neurons, respectively.  They used ideas similar to
Simard <em>et al</em> to expand their training data.  But apart from
that, they used few other tricks, including no convolutional layers:
it was a plain, vanilla network, of the kind that, with enough
patience, could have been trained in the 1980s (if the MNIST data set
had existed), given enough computing power.  They achieved a
classification accuracy of $99.65$ percent, more or less the same as
ours.  The key was to use a very large, very deep network, and to use
a GPU to speed up training.  This let them train for many epochs.
They also took advantage of their long training times to gradually
decrease the learning rate from $10^{-3}$ to $10^{-6}$.  It's a fun
exercise to try to match these results using an architecture like
theirs.
-->
この論文はシンプルさが好きです。
ネットワークは多層で、全結合層のみを使っています（畳み込み層はなし）。
最も成功したネットワークは $2500$, $2000$, $1500$, $1000$, $500$ のニューロンをそれぞれ含む全結合層からなるネットワークです。
訓練データを拡張するためにSimard <em>et al</em>に似たアイデアを使っています。
しかし、それ以外にも幾つかのトリックを用いています。
その1つは畳み込み層を使わないことです。
簡素で平凡ですが、1980年代（もしMNISTデータセットがあったとしたら）であっても計算機の能力さえあれば同じことが可能です。
彼らは私たちと同等程度の $99.65$ %の分類精度を達成しました。
ポイントはとても深いネットワークを使い、GPUで高速に訓練したことです。
エポック数を大きく取って訓練しています。
訓練時間を長く取ることで、学習率を $10^{-3}$ から $10^{-6}$ へ徐々に減少させています。
論文のネットワーク構造を使って、結果が合うかどうか試してみるのは楽しい演習になるでしょう。
</p><p>
<!--
<strong>Why are we able to train?</strong>  We saw in <a href="chap5.html">the
  last chapter</a> that there are fundamental obstructions to training in
deep, many-layer neural networks.  In particular, we saw that the
gradient tends to be quite unstable: as we move from the output layer
to earlier layers the gradient tends to either vanish (the vanishing
gradient problem) or explode (the exploding gradient problem).  Since
the gradient is the signal we use to train, this causes problems.
-->
<strong>なぜ訓練できるのでしょうか？</strong>
<a href="chap5.html">前章</a>で、深くて多層のニューラルネットワークの訓練における本質的な障壁について
学びました。
特に、勾配がとても不安定になる問題を確認しました。
出力層から前方の層に遡るにしたがって、勾配は消失（勾配消失問題）するか、爆発（勾配爆発問題）してしまう傾向があります。
勾配は訓練のきっかけとなるものなので、勾配が不安定になると問題となります。
</p><p>
<!--How have we avoided those results?-->
では、どうやってこの問題を避けているのでしょうか？
</p><p>
<!--
Of course, the answer is that we haven't avoided these results.
Instead, we've done a few things that help us proceed anyway.  In
particular: (1) Using convolutional layers greatly reduces the number
of parameters in those layers, making the learning problem much
easier; (2) Using more powerful regularization techniques (notably
dropout and convolutional layers) to reduce overfitting, which is
otherwise more of a problem in more complex networks; (3) Using
rectified linear units instead of sigmoid neurons, to speed up
training - empirically, often by a factor of $3$-$5$; (4) Using GPUs
and being willing to train for a long period of time.  In particular,
in our final experiments we trained for $40$ epochs using a data set
$5$ times larger than the raw MNIST training data.  Earlier in the
book we mostly trained for $30$ epochs using just the raw training
data.  Combining factors (3) and (4) it's as though we've trained a
factor perhaps $30$ times longer than before.
-->
もちろん答えは、この問題を回避していない、というものです。
代わりに、結果を良くするための操作を幾つか行っています。
(1)畳み込み層を使うことで、層のパラメータ数が劇的に減っているため、学習に伴う問題が起きにくくなっています。
(2)強力な正規化テクニック（特にドロップアウトと畳込み層）を使うことで、他の複雑なネットワークでは問題となる過適合を防いています。
(3)シグモイドの代わりにReLUを使うことで、訓練を高速に行っています。
実験では $3$-$5$ 倍速くなっています。
(4) GPUを使って訓練時間を長くしています。特に、最後の実験では、もとのMNISTの訓練データの $5$ 倍のデータを用いて $40$ エポック分訓練しています。
本書の前半ではもとの訓練データを用いて、 $30$ エポック分訓練していました。
(3) と (4) の要素を組み合わせると以前の $30$ 倍長く時間がかかります。
</p><p>
<!--
Your response may be "Is that it? Is that all we had to do to train
deep networks?  What's all the fuss about?"
-->
あなたはきっと、
「それだけ？深いネットワークの訓練に必要な工夫は本当にそれだけ？（これまでさんざん苦労してきたのに）一体どうなってるの？」と思うでしょう。
</p><p>
<!--
Of course, we've used other ideas, too: making use of sufficiently
large data sets (to help avoid overfitting); using the right cost
function (to
<a href="chap3.html#the_cross\-entropy_cost_function">avoid a
  learning slowdown</a>); using
<a href="chap3.html#weight_initialization">good weight initializations</a>
(also to avoid a learning slowdown, due to neuron saturation);
<a href="chap3.html#other_techniques_for_regularization">algorithmically
  expanding the training data</a>.  We discussed these and other ideas in
earlier chapters, and have for the most part been able to reuse these
ideas with little comment in this chapter.
-->
もちろん、他の工夫も加えています。
過適合を防ぐために十分大きなデータセットを利用したり、
<a href="chap3.html#the_cross\-entropy_cost_function">学習が遅くなるのを防ぐ</a>ために正しいコスト関数を使ったり、
ニューロンの飽和による学習の遅延を防ぐために<a href="chap3.html#weight_initialization">重みを上手く初期化</a>したり、
<a href="chap3.html#other_techniques_for_regularization">訓練データをアルゴリズムで拡張</a>したりしました。
上記の工夫についてはこれまでの章で解説してきたので、この章では説明が少なくても理解できるでしょう。
</p><p>
<!--
With that said, this really is a rather simple set of ideas.  Simple,
but powerful, when used in concert.  Getting started with deep
learning has turned out to be pretty easy!
-->
これらの工夫はシンプルなものです。
シンプルですが、同時に使用すると強力です。
ディープラーニングがとても容易になります！
</p><p>
<!--
<strong>How deep are these networks, anyway?</strong> Counting the
convolutional-pooling layers as single layers, our final architecture
has $4$ hidden layers.  Does such a network really deserve to be
called a <em>deep</em> network?  Of course, $4$ hidden layers is many
more than in the shallow networks we studied earlier.  Most of those
networks only had a single hidden layer, or occasionally $2$ hidden
layers.  On the other hand, as of 2015 state-of-the-art deep networks
sometimes have dozens of hidden layers.  I've occasionally heard
people adopt a deeper-than-thou attitude, holding that if you're not
keeping-up-with-the-Joneses in terms of number of hidden layers, then
you're not really doing deep learning.  I'm not sympathetic to this
attitude, in part because it makes the definition of deep learning
into something which depends upon the result-of-the-moment. The real
breakthrough in deep learning was to realize that it's practical to go
beyond the shallow $1$- and $2$-hidden layer networks that dominated
work until the mid-2000s.  That really was a significant breakthrough,
opening up the exploration of much more expressive models.  But beyond
that, the number of layers is not of primary fundamental interest.
Rather, the use of deeper networks is a tool to use to help achieve
other goals - like better classification accuracies.
-->
<strong>ところで、これらのネットワークの深さはどのくらい？</strong>
畳み込み-プーリング層を1つの層として数えると、私たちの最後のネットワークは隠れ層を $4$ つ持つこととなります。
このネットワークは本当に <em>深層</em>ネットワークと呼ばれるに値するのでしょうか？
もちろん、これまで学んできた他の浅いネットワークと比べると $4$ つの隠れ層というのは深いです。
これまでの浅いネットワークは $1$ つ、もしくは $2$ つだけ隠れ層を持っていました。
一方、2015年の最新の深層ネットワークは $10$ 以上の隠れ層を持ちます。
最近はとにかく層を深くする傾向があります。
聞くところによると、隠れ層の数で周囲に遅れを取っているのでは、ディープラーニングと呼べないというのです。
しかし、一時的な結果に依存する何かをディープラーニングと呼ぶことになってしまうため、私はこの態度には共感しません。
ディープラーニングの本当のブレークスルーは、2000年代中盤まで支配的だった浅い $1$ 層や $2$ 層のネットワーク以外でも実用的な結果が得られることが判明したことだと思っています。
それは、はるかに表現力豊かなモデルを探索する機会を得るという意味で真のブレークスルーでした。
しかしそれを除いても、層の深さは本質的な議論ではありません。
むしろ、他の目的に深いネットワークをいかに応用できるかの方が重要です。
</p><p>
<!--
<strong>A word on procedure:</strong> In this section, we've smoothly moved
from single hidden-layer shallow networks to many-layer convolutional
networks.  It all seemed so easy!  We make a change and, for the
most part, we get an improvement.  If you start experimenting, I can
guarantee things won't always be so smooth.  The reason is that I've
presented a cleaned-up narrative, omitting many experiments -
including many failed experiments.  This cleaned-up narrative will
hopefully help you get clear on the basic ideas.  But it also runs the
risk of conveying an incomplete impression.  Getting a good, working
network can involve a lot of trial and error, and occasional
frustration.  In practice, you should expect to engage in quite a bit
of experimentation.  To speed that process up you may find it helpful
to revisit Chapter 3's discussion of
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">how
  to choose a neural network's hyper-parameters</a>, and perhaps also to
look at some of the further reading suggested in that section.
-->
<strong>注意点：</strong>
このセクションでは、単一の隠れ層を持つ浅いネットワークから多層の畳み込みネットワークへ順調に移行しました。
畳み込みネットワークは簡単そうです！
ネットワークに変更を加えることで、すぐに結果が良くなりました。
しかしあなたが実験を始めても、その直後は上手く行くとは限らないことを私は保証します。<!-- 上文をうまく訳せていません -->
その理由は、実際には多くの実験を行っているにも関わらず、本書ではそれらを省略した無駄のない文脈で、畳み込みネットワークを紹介したからです。
失敗に終わった実験を裏でたくさん行っています。
無駄のない文脈で説明してきたので、あなたの基礎理解は深まったと思っています。
しかし、誤解を与えた恐れもありますね。
上手く行くネットワークに辿り着くには、イライラしながら試行錯誤をする必要があります。
実際、多くの実験をこなさなければならないはずです。
その時には、3章の
<a href="chap3.html#how_to_choose_a_neural_network's_hyper-parameters">ニューラルネットワークのハイパーパラメータをどう選ぶか</a>の議論が作業効率化の助けになります。
そして、この章の残りを読むことも重要です。</p><p></p><p></p><p></p><p>
<h3><a name="the_code_for_our_convolutional_networks"></a><a href="#the_code_for_our_convolutional_networks"><!-- The code for our convolutional networks -->畳み込みネットワークのコード</a></h3></p><p><!--Alright, let's take a look at the code for our program,<tt>network3.py</tt>.  Structurally, it's similar to <tt>network2.py</tt>,the program we developed in <a href="chap3.html">Chapter 3</a>, although thedetails differ, due to the use of Theano.  We'll start by looking atthe <tt>FullyConnectedLayer</tt> class, which is similar to the layersstudied earlier in the book.  Here's the code (discussionbelow)*<span class="marginnote">*Note added November 2016: several readers have noted  that in the line initializing <tt>self.w</tt>, I set  <tt>scale=np.sqrt(1.0/n_out)</tt>, when the arguments of Chapter 3  suggest a better initialization may be  <tt>scale=np.sqrt(1.0/n_in)</tt>.  This was simply a mistake on my  part.  In an ideal world I'd rerun all the examples in this chapter  with the correct code. Still, I've moved on to other projects, so am  going to let the error go.</span>:-->さあ、私たちのコード<tt>network3.py</tt>を見てみましょう。<a href="chap3.html">3章</a>で開発した<tt>network2.py</tt>に構造的に似ています。ただし、Theanoを導入したため、詳細は異なっています。まず<tt>FullyConnectedLayer</tt>のクラスから確認を始めましょう。これは本書で既に扱ってきた層に似ています。コードはこのようになっています*<span class="marginnote">  <tt>self.w</tt>を初期化する行で、<tt>scale=np.sqrt(1.0/n_out)</tt>としているのに気づいた人もいるでしょう。  3章の議論では<tt>scale=np.sqrt(1.0/n_in)</tt>を使うように推していたのに、何故この方法を取っているのかと不思議に感じる方もいると思います。  実を言うと、これは単純に私のミスです。  本当は、この章のコードを修正しなくてはいけないのですが、現在私は他のプロジェクトにかかりきりになっているので、しばらくはそのままにしておきます。  </span>。</p><p><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">FullyConnectedLayer</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">sigmoid</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span> <span class="o">=</span> <span class="n">n_in</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_out</span> <span class="o">=</span> <span class="n">n_out</span>          <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span> <span class="o">=</span> <span class="n">activation_fn</span>          <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span> <span class="o">=</span> <span class="n">p_dropout</span>          <span class="c1"># Initialize weights and biases</span>          <span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span>                  <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span>                      <span class="n">loc</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="n">n_out</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">)),</span>                  <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_out</span><span class="p">,)),</span>                         <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">]</span>      <span class="k">def</span> <span class="nf">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inpt</span><span class="p">,</span> <span class="n">inpt_dropout</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt</span> <span class="o">=</span> <span class="n">inpt</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">))</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="p">(</span>              <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span><span class="o">*</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span> <span class="o">=</span> <span class="n">dropout_layer</span><span class="p">(</span>              <span class="n">inpt_dropout</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">)),</span> <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="p">(</span>              <span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>      <span class="k">def</span> <span class="nf">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>          <span class="s2">&quot;Return the accuracy for the mini-batch.&quot;</span>          <span class="k">return</span> <span class="n">T</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">eq</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span><span class="p">))</span>  </pre></div>  </p><p><!--Much of the <tt>__init__</tt> method is self-explanatory, but a fewremarks may help clarify the code.  As per usual, we randomlyinitialize the weights and biases as normal random variables withsuitable standard deviations.  The lines doing this look a littleforbidding.  However, most of the complication is just loading theweights and biases into what Theano calls shared variables.  Thisensures that these variables can be processed on the GPU, if one isavailable.  We won't get too much into the details of this.  If you'reinterested, you can dig into the<a href="http://deeplearning.net/software/theano/index.html">Theano  documentation</a>.  Note also that this weight and bias initializationis designed for the sigmoid activation function (as<a href="chap3.html#weight_initialization">discussed earlier</a>).--><tt>__init__</tt>の大部分は自明ですが、少し解説しておくとコードの意味が明快になるでしょう。通常通り、適切な標準偏差を設定して重みとバイアスをランダムにばらつかせて初期化しています。この操作を行っている行は少し馴染みが薄いかもしれません。でも、複雑そうに見える処理は実は、重みとバイアスをTheanoが共有変数と呼ぶ変数へ渡しているだけです。この操作は、GPU実行可能であれば、この変数がGPU上で処理されることを保証するというものです。詳細にはこれ以上踏み込みません。興味があれば、<a href="http://deeplearning.net/software/theano/index.html">Theanoのドキュメント</a>を参照してください。この重みとバイアスの初期化は、<a href="chap3.html#weight_initialization">以前議論したように</a>シグモイドの活性化関数を考慮して行われていることにも注意してください。<!--Ideally, we'd initialize the weights and biases somewhat differentlyfor activation functions such as the tanh and rectified linearfunction.  This is discussed further in problems below.  The<tt>__init__</tt> method finishes with<tt>self.params = [self.w, self.b]</tt>.  This is a handy way to bundleup all the learnable parameters associated to the layer.  Later on,the <tt>Network.SGD</tt> method will use <tt>params</tt> attributes tofigure out what variables in a <tt>Network</tt> instance can learn.-->理想的には、重みとバイアスの初期化をtanhやReLU向けに、少し異なる方法で行うのがよいでしょう。これは後々議論します。<tt>__init__</tt>関数は<tt>self.params = [self.w, self.b]</tt>を行い終了します。この処理は、層に関連する学習可能なパラメータをまとめる手軽な方法です。後々、<tt>Network.SGD</tt>関数が<tt>params</tt>の属性を使うときに、<tt>Network</tt>のインスタンスが学習するパラメータを明らかにしているのです。</p><p><!--The <tt>set_inpt</tt> method is used to set the input to the layer, andto compute the corresponding output.  I use the name <tt>inpt</tt>rather than <tt>input</tt> because <tt>input</tt> is a built-in functionin Python, and messing with built-ins tends to cause unpredictablebehavior and difficult-to-diagnose bugs.  Note that we actually setthe input in two separate ways: as <tt>self.inpt</tt> and<tt>self.inpt_dropout</tt>.  This is done because during training we maywant to use dropout.  If that's the case then we want to remove afraction <tt>self.p_dropout</tt> of the neurons.  That's what thefunction <tt>dropout_layer</tt> in the second-last line of the<tt>set_inpt</tt> method is doing.  So <tt>self.inpt_dropout</tt> and<tt>self.output_dropout</tt> are used during training, while<tt>self.inpt</tt> and <tt>self.output</tt> are used for all otherpurposes, e.g., evaluating accuracy on the validation and test data.--><tt>set_inpt</tt>関数は層への入力を設定し、対応する出力を計算するために使われます。<tt>input</tt>ではなく<tt>inpt</tt>という名前を使っているのは、Pythonに<tt>input</tt>というビルトイン関数があるためです。ビルトイン関数と混在すると、予測不能な振る舞いが起きる恐れがあったので避けました。さて、入力を2つの異なる方法で設定していることに注意してください。それぞれ<tt>self.inpt</tt>と<tt>self.inpt_dropout</tt>です。訓練時にはドロップアウトを使いたいと思うかもしれないので、こうしました。その時には、<tt>self.p_dropout</tt>のニューロンの一部を取り除く必要があります。それが、関数<tt>dropout_layer</tt>の最後から2行目の<tt>set_inpt</tt>関数が行っていることです。したがって、<tt>self.inpt_dropout</tt>と<tt>self.output_dropout</tt>は訓練時に使用されます。一方、<tt>self.inpt</tt>と<tt>self.output</tt>は、例えば、検証データとテストデータの精度を評価する場合など、どのような場合にも使われます。</p><p><!--The <tt>ConvPoolLayer</tt> and <tt>SoftmaxLayer</tt> class definitions aresimilar to <tt>FullyConnectedLayer</tt>.  Indeed, they're so close thatI won't excerpt the code here.  If you're interested you can look atthe full listing for <tt>network3.py</tt>, later in this section.--><tt>ConvPoolLayer</tt>と<tt>SoftmaxLayer</tt>クラスの定義は<tt>FullyConnectedLayer</tt>の定義と似ています。本当にそっくりなので、ここではコードを引用しません。興味があれば、このセクションの後にある<tt>network3.py</tt>の全コードを参照してください。</p><p><!--However, a couple of minor differences of detail are worth mentioning.Most obviously, in both <tt>ConvPoolLayer</tt> and <tt>SoftmaxLayer</tt>we compute the output activations in the way appropriate to that layertype.  Fortunately, Theano makes that easy, providing built-inoperations to compute convolutions, max-pooling, and the softmaxfunction.-->しかし、いくつかの細かい違いに着目するのは悪くありません。<tt>ConvPoolLayer</tt>と<tt>SoftmaxLayer</tt>は、層の種類に応じて適切な出力の活性化を行っています。幸運なことに、Theanoの提供するビルトイン演算操作を使えば、畳み込みやMaxプーリング、ソフトマックス関数の計算を簡単に行なえます。</p><p><!--Less obviously, when we <a href="chap3.html#softmax">introduced the  softmax layer</a>, we never discussed how to initialize the weights andbiases.  Elsewhere we've argued that for sigmoid layers we shouldinitialize the weights using suitably parameterized normal randomvariables.  But that heuristic argument was specific to sigmoidneurons (and, with some amendment, to tanh neurons).  However, there'sno particular reason the argument should apply to softmax layers.  Sothere's no <em>a priori</em> reason to apply that initialization again.Rather than do that, I shall initialize all the weights and biases tobe $0$.  This is a rather <em>ad hoc</em> procedure, but works wellenough in practice.-->さらに、些細な事ですが、<a href="chap3.html#softmax">ソフトマックス層</a>を導入した際、重みとバイアスの初期化の仕方を議論しませんでした。一方、シグモイドの層では、重みを適切なランダム値に初期化するべきであることは既に述べました。しかし、そのヒューリスティックな議論はシグモイドニューロン（と少し修正を加えればtanh）に特有なものです。同じ議論をソフトマックス層に適用すべき理由は特にありません。したがって、その初期化法を適用する<em>ア・プリオリ</em>な理由はないのです。むしろ、$0$ で全ての重みとバイアスを初期化した方がよいと思います。これはとても<em>場当たり的な</em>やり方に見えますが、実践では十分上手くいきます。</p><p><!--Okay, we've looked at all the layer classes.  What about the<tt>Network</tt> class?  Let's start by looking at the <tt>__init__</tt>method:-->よし、これで層の全種類のクラス定義を確認したことになります。<tt>Network</tt>クラスはどうでしょう？<tt>__init__</tt>関数を見るところから始めましょう。</p><p><div class="highlight"><pre><span></span><span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">layers</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;Takes a list of `layers`, describing the network architecture, and</span>  <span class="sd">        a value for the `mini_batch_size` to be used during training</span>  <span class="sd">        by stochastic gradient descent.</span>  <span class="sd">        &quot;&quot;&quot;</span>          <span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">layers</span>          <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span> <span class="o">=</span> <span class="n">mini_batch_size</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="n">param</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">layer</span><span class="o">.</span><span class="n">params</span><span class="p">]</span>          <span class="bp">self</span><span class="o">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="s2">&quot;x&quot;</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">ivector</span><span class="p">(</span><span class="s2">&quot;y&quot;</span><span class="p">)</span>          <span class="n">init_layer</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>          <span class="n">init_layer</span><span class="o">.</span><span class="n">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">)</span>          <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">)):</span>              <span class="n">prev_layer</span><span class="p">,</span> <span class="n">layer</span>  <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="n">j</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>              <span class="n">layer</span><span class="o">.</span><span class="n">set_inpt</span><span class="p">(</span>                  <span class="n">prev_layer</span><span class="o">.</span><span class="n">output</span><span class="p">,</span> <span class="n">prev_layer</span><span class="o">.</span><span class="n">output_dropout</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">output</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">output_dropout</span>  </pre></div>  </p><p><!--Most of this is self-explanatory, or nearly so.  The line<tt>self.params = [param for layer in ...]</tt> bundles up theparameters for each layer into a single list.  As anticipated above,the <tt>Network.SGD</tt> method will use <tt>self.params</tt> to figureout what variables in the <tt>Network</tt> can learn.  The lines<tt>self.x = T.matrix("x")</tt> and <tt>self.y = T.ivector("y")</tt>define Theano symbolic variables named <tt>x</tt> and <tt>y</tt>.  Thesewill be used to represent the input and desired output from thenetwork.-->コードの大部分は見ればわかると思います。<tt>self.params = [param for layer in ...]</tt>の行では、各層のパラメータを1つのリストにまとめています。上で触れたように、<tt>Network.SGD</tt>関数が<tt>self.params</tt>を観て、<tt>Network</tt>の中のどの変数が学習するのかを把握します。<tt>self.x = T.matrix("x")</tt>と<tt>self.y = T.ivector("y")</tt>の行は、<tt>x</tt>と<tt>y</tt>と名付けたTheanoのシンボリック変数を定義する部分に当たります。これらの変数は、入力とネットワークの望みの出力を表現するのに使われます。</p><p><!--Now, this isn't a Theano tutorial, and so we won't get too deeply intowhat it means that these are symbolic variables*<span class="marginnote">*The  <a href="http://deeplearning.net/software/theano/index.html">Theano    documentation</a> provides a good introduction to Theano.  And if you  get stuck, you may find it helpful to look at one of the other  tutorials available online.  For instance,  <a href="http://nbviewer.ipython.org/github/craffel/theano-tutorial/blob/master/Theano%20Tutorial.ipynb">this    tutorial</a> covers many basics.</span>.-->さて本書はTheanoのチュートリアルではないので、シンボリック変数*の意味については深く踏み込みません<span class="marginnote">  *Theanoの導入には<a href="http://deeplearning.net/software/theano/index.html">Theanoのドキュメント</a>を読むべきです。  もし詰まったら、オンラインにある他のチュートリアルを参照すると良いでしょう。  例えば、<a href="http://nbviewer.ipython.org/github/craffel/theano-tutorial/blob/master/Theano%20Tutorial.ipynb">このチュートリアルは</a>基礎を広く押さえています。</span>。<!--But the rough idea is that theserepresent mathematical variables, <em>not</em> explicit values.  We cando all the usual things one would do with such variables: add,subtract, and multiply them, apply functions, and so on.  Indeed,Theano provides many ways of manipulating such symbolic variables,doing things like convolutions, max-pooling, and so on.  But the bigwin is the ability to do fast symbolic differentiation, using a verygeneral form of the backpropagation algorithm.  This is extremelyuseful for applying stochastic gradient descent to a wide variety ofnetwork architectures.  In particular, the next few lines of codedefine symbolic outputs from the network.  We start by setting theinput to the initial layer, with the line-->しかし簡単に説明しておくと、シンボリック変数とは数学的な変数であり、値を表すものでは<em>ありません</em>。加算、減算、乗算や関数の適用などの操作をシンボリック変数へ施すことができます。他にも、畳み込みやMaxプーリングなどの、シンボリック変数を操作する方法をTheanoは多数提供しています。シンボリック変数を使う大きな利点は、逆伝播のアルゴリズムで必要な微分を高速なシンボリック微分として行える点です。確率的勾配降下法を様々なネットワーク構造に対して実行するにあたって、これはとても強力に感じます。次の数行では、ネットワークの出力を定義しています。その際、最初の層へ入力を設定するところから始まっています。</p><p><div class="highlight"><pre><span></span>        <span class="n">init_layer</span><span class="o">.</span><span class="n">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">)</span>  </pre></div>  </p><p><!--Note that the inputs are set one mini-batch at a time, which is whythe mini-batch size is there.  Note also that we pass the input<tt>self.x</tt> in twice: this is because we may use the network in twodifferent ways (with or without dropout).  The <tt>for</tt> loop thenpropagates the symbolic variable <tt>self.x</tt> forward through thelayers of the <tt>Network</tt>. This allows us to define the final<tt>output</tt> and <tt>output_dropout</tt> attributes, which symbolicallyrepresent the output from the <tt>Network</tt>.-->入力が1つのミニバッチに一度に設定されることに注意してください。入力<tt>self.x</tt>を2つの引数として渡していることにも注意してください。これは、ネットワークを（ドロップアウトの有無で）2つの異なる方法で使うためです。<tt>for</tt>ループでは、<tt>Network</tt>の層間をシンボリック変数<tt>self.x</tt>が順伝播していきます。これにより、最後の<tt>output</tt>と<tt>output_dropout</tt>の中身を定義することができます。これらは<tt>Network</tt>の出力を表現します。</p><p><!--Now that we've understood how a <tt>Network</tt> is initialized, let'slook at how it is trained, using the <tt>SGD</tt> method.  The codelooks lengthy, but its structure is actually rather simple.Explanatory comments after the code.-->これで、<tt>Network</tt>がどのように初期化されるかがわかりました。さあ<tt>SGD</tt>関数による訓練方法を見ていきましょう。コードは長いですが、構造は実にシンプルです。コードの後に説明のコメントを記します。</p><p><div class="highlight"><pre><span></span>    <span class="k">def</span> <span class="nf">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span>              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;Train the network using mini-batch stochastic gradient descent.&quot;&quot;&quot;</span>          <span class="n">training_x</span><span class="p">,</span> <span class="n">training_y</span> <span class="o">=</span> <span class="n">training_data</span>          <span class="n">validation_x</span><span class="p">,</span> <span class="n">validation_y</span> <span class="o">=</span> <span class="n">validation_data</span>          <span class="n">test_x</span><span class="p">,</span> <span class="n">test_y</span> <span class="o">=</span> <span class="n">test_data</span>          <span class="c1"># compute number of minibatches for training, validation and testing</span>          <span class="n">num_training_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="n">num_validation_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">validation_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="n">num_test_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="c1"># define the (regularized) cost function, symbolic gradients, and updates</span>          <span class="n">l2_norm_squared</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([(</span><span class="n">layer</span><span class="o">.</span><span class="n">w</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">])</span>          <span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">cost</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">+</span>\                 <span class="mf">0.5</span><span class="o">*</span><span class="n">lmbda</span><span class="o">*</span><span class="n">l2_norm_squared</span><span class="o">/</span><span class="n">num_training_batches</span>          <span class="n">grads</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">cost</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">)</span>          <span class="n">updates</span> <span class="o">=</span> <span class="p">[(</span><span class="n">param</span><span class="p">,</span> <span class="n">param</span><span class="o">-</span><span class="n">eta</span><span class="o">*</span><span class="n">grad</span><span class="p">)</span>                     <span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">grad</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">,</span> <span class="n">grads</span><span class="p">)]</span>          <span class="c1"># define functions to train a mini-batch, and to compute the</span>          <span class="c1"># accuracy in validation and test mini-batches.</span>          <span class="n">i</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">lscalar</span><span class="p">()</span> <span class="c1"># mini-batch index</span>          <span class="n">train_mb</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">cost</span><span class="p">,</span> <span class="n">updates</span><span class="o">=</span><span class="n">updates</span><span class="p">,</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">training_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">training_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="n">validate_mb_accuracy</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">),</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">validation_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">validation_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="n">test_mb_accuracy</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">),</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">test_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">test_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="bp">self</span><span class="o">.</span><span class="n">test_mb_predictions</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">y_out</span><span class="p">,</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">test_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="c1"># Do the actual training</span>          <span class="n">best_validation_accuracy</span> <span class="o">=</span> <span class="mf">0.0</span>          <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>              <span class="k">for</span> <span class="n">minibatch_index</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_training_batches</span><span class="p">):</span>                  <span class="n">iteration</span> <span class="o">=</span> <span class="n">num_training_batches</span><span class="o">*</span><span class="n">epoch</span><span class="o">+</span><span class="n">minibatch_index</span>                  <span class="k">if</span> <span class="n">iteration</span>                      <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Training mini-batch number {0}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">iteration</span><span class="p">))</span>                  <span class="n">cost_ij</span> <span class="o">=</span> <span class="n">train_mb</span><span class="p">(</span><span class="n">minibatch_index</span><span class="p">)</span>                  <span class="k">if</span> <span class="p">(</span><span class="n">iteration</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span>                      <span class="n">validation_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span>                          <span class="p">[</span><span class="n">validate_mb_accuracy</span><span class="p">(</span><span class="n">j</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_validation_batches</span><span class="p">)])</span>                      <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Epoch {0}: validation accuracy {1:.2</span>                          <span class="n">epoch</span><span class="p">,</span> <span class="n">validation_accuracy</span><span class="p">))</span>                      <span class="k">if</span> <span class="n">validation_accuracy</span> <span class="o">&gt;=</span> <span class="n">best_validation_accuracy</span><span class="p">:</span>                          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;This is the best validation accuracy to date.&quot;</span><span class="p">)</span>                          <span class="n">best_validation_accuracy</span> <span class="o">=</span> <span class="n">validation_accuracy</span>                          <span class="n">best_iteration</span> <span class="o">=</span> <span class="n">iteration</span>                          <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span>                              <span class="n">test_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span>                                  <span class="p">[</span><span class="n">test_mb_accuracy</span><span class="p">(</span><span class="n">j</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_test_batches</span><span class="p">)])</span>                              <span class="k">print</span><span class="p">(</span><span class="s1">&#39;The corresponding test accuracy is {0:.2</span>                                  <span class="n">test_accuracy</span><span class="p">))</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Finished training network.&quot;</span><span class="p">)</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Best validation accuracy of {0:.2</span>              <span class="n">best_validation_accuracy</span><span class="p">,</span> <span class="n">best_iteration</span><span class="p">))</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Corresponding test accuracy of {0:.2</span>  </pre></div>  </p><p><!--The first few lines are straightforward, separating the datasets into$x$ and $y$ components, and computing the number of mini-batches usedin each dataset.  The next few lines are more interesting, and showsome of what makes Theano fun to work with.  Let's explicitly excerptthe lines here:-->最初の数行は単純です。データセットを $x$ と $y$ の要素に分けて、各データセットで使われるミニバッチの数を計算しています。次の数行は興味深く、Theanoの醍醐味となる部分です。その行をここに引用してみましょう。</p><p><div class="highlight"><pre><span></span>        <span class="c1"># define the (regularized) cost function, symbolic gradients, and updates</span>          <span class="n">l2_norm_squared</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([(</span><span class="n">layer</span><span class="o">.</span><span class="n">w</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">])</span>          <span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">cost</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">+</span>\                 <span class="mf">0.5</span><span class="o">*</span><span class="n">lmbda</span><span class="o">*</span><span class="n">l2_norm_squared</span><span class="o">/</span><span class="n">num_training_batches</span>          <span class="n">grads</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">cost</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">)</span>          <span class="n">updates</span> <span class="o">=</span> <span class="p">[(</span><span class="n">param</span><span class="p">,</span> <span class="n">param</span><span class="o">-</span><span class="n">eta</span><span class="o">*</span><span class="n">grad</span><span class="p">)</span>                     <span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">grad</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">,</span> <span class="n">grads</span><span class="p">)]</span>  </pre></div>  </p><p><!--In these lines we symbolically set up the regularized log-likelihoodcost function, compute the corresponding derivatives in the gradientfunction, as well as the corresponding parameter updates.  Theano letsus achieve all of this in just these few lines.  The only thing hiddenis that computing the <tt>cost</tt> involves a call to the <tt>cost</tt>method for the output layer; that code is elsewhere in<tt>network3.py</tt>.  But that code is short and simple, anyway.  Withall these things defined, the stage is set to define the<tt>train_mb</tt> function, a Theano symbolic function which uses the<tt>updates</tt> to update the <tt>Network</tt> parameters, given amini-batch index.  Similarly, <tt>validate_mb_accuracy</tt> and<tt>test_mb_accuracy</tt> compute the accuracy of the <tt>Network</tt> onany given mini-batch of validation or test data.  By averaging overthese functions, we will be able to compute accuracies on the entirevalidation and test data sets.-->これらの行では、正規化した対数尤度の誤差関数を用意しています。また、パラメータの更新に応じて、勾配関数の中の対応する微分を計算します。Theanoを使うと数行でこれら全てを実装できます。隠されていて唯一分かりにくいのは、<tt>cost</tt>の計算の時に出力層の<tt>cost</tt>関数を呼ぶことです。このcost関数は<tt>network3.py</tt>内の別の箇所にあります。しかし、そのコードは短くシンプルです。さて、これら全ての設定が終わると、<tt>train_mb</tt>関数を定義する段階へ移ります。このTheanoのシンボリック関数は<tt>updates</tt>を用い、ミニバッチのインデックスをもとに<tt>Network</tt>のパラメータを更新します。同様に、<tt>validate_mb_accuracy</tt>と<tt>test_mb_accuracy</tt>は、検証データやテストデータのミニバッチに基づいて、<tt>Network</tt>の精度を計算します。これらの関数の結果の平均をとって、検証データやテストデータの全体精度を計算できるのです。</p><p><!--The remainder of the <tt>SGD</tt> method is self-explanatory - wesimply iterate over the epochs, repeatedly training the network onmini-batches of training data, and computing the validation and testaccuracies.--><tt>SGD</tt>関数の残りの部分は自明だと思います。単純にエポック数分だけ反復し、訓練データのミニバッチに基づいてネットワークを訓練し、検証データとテストデータの精度を計算します。</p><p><!--Okay, we've now understood the most important pieces of code in<tt>network3.py</tt>.  Let's take a brief look at the entire program.You don't need to read through this in detail, but you may enjoyglancing over it, and perhaps diving down into any pieces that strikeyour fancy.  The best way to really understand it is, of course, bymodifying it, adding extra features, or refactoring anything you thinkcould be done more elegantly.  After the code, there are some problemswhich contain a few starter suggestions for things to do.  Here's thecode*-->よし、これで<tt>network3.py</tt>の重要な部分は理解したことになります。プログラム全体を見てみましょう。コードを詳細に読み解く必要はありません。きっと、コードを眺めるだけで楽しいはずです。あなたが気になった箇所を深掘りしてみるのも良いと思います。もちろん、コードを深く理解するための一番良い方法は、コードに修正を加えたり、何か特徴を追加したり、もしくはエレガントになるようリファクタリングしてみることです。コードの後ろに、いくつか修正すべき項目を載せています*<!--<span class="marginnote">  *Using Theano on a GPU can be a little tricky.  In  particular, it's easy to make the mistake of pulling data off the  GPU, which can slow things down a lot.  I've tried to avoid this.  With that said, this code can certainly be sped up quite a bit  further with careful optimization of Theano's configuration.  See  the Theano documentation for more details.</span>:--><span class="marginnote">  *Theanoを使ってコードをGPU実行する方法は少しトリッキーです。  特に、GPUからデータを取得するところは間違えやすく、間違えるとかなり低速になってしまいます。  私はこれを避けようと試行錯誤してきました。  このコードではTheanoの最適化設定を注意深く行っているため、かなり高速に動作するはずです。  詳細はTheanoのドキュメントを見て確認してください。  </span></p><p><div class="highlight"><pre><span></span><span class="sd">&quot;&quot;&quot;network3.py</span>  <span class="sd">~~~~~~~~~~~~~~</span>  <span class="sd">A Theano-based program for training and running simple neural</span>  <span class="sd">networks.</span>  <span class="sd">Supports several layer types (fully connected, convolutional, max</span>  <span class="sd">pooling, softmax), and activation functions (sigmoid, tanh, and</span>  <span class="sd">rectified linear units, with more easily added).</span>  <span class="sd">When run on a CPU, this program is much faster than network.py and</span>  <span class="sd">network2.py.  However, unlike network.py and network2.py it can also</span>  <span class="sd">be run on a GPU, which makes it faster still.</span>  <span class="sd">Because the code is based on Theano, the code is different in many</span>  <span class="sd">ways from network.py and network2.py.  However, where possible I have</span>  <span class="sd">tried to maintain consistency with the earlier programs.  In</span>  <span class="sd">particular, the API is similar to network2.py.  Note that I have</span>  <span class="sd">focused on making the code simple, easily readable, and easily</span>  <span class="sd">modifiable.  It is not optimized, and omits many desirable features.</span>  <span class="sd">This program incorporates ideas from the Theano documentation on</span>  <span class="sd">convolutional neural nets (notably,</span>  <span class="sd">http://deeplearning.net/tutorial/lenet.html ), from Misha Denil&#39;s</span>  <span class="sd">implementation of dropout (https://github.com/mdenil/dropout ), and</span>  <span class="sd">from Chris Olah (http://colah.github.io ).</span>  <span class="sd">Written for Theano 0.6 and 0.7, needs some changes for more recent</span>  <span class="sd">versions of Theano.</span>  <span class="sd">&quot;&quot;&quot;</span>  <span class="c1">#### Libraries</span>  <span class="c1"># Standard library</span>  <span class="kn">import</span> <span class="nn">cPickle</span>  <span class="kn">import</span> <span class="nn">gzip</span>  <span class="c1"># Third-party libraries</span>  <span class="kn">import</span> <span class="nn">numpy</span> <span class="kn">as</span> <span class="nn">np</span>  <span class="kn">import</span> <span class="nn">theano</span>  <span class="kn">import</span> <span class="nn">theano.tensor</span> <span class="kn">as</span> <span class="nn">T</span>  <span class="kn">from</span> <span class="nn">theano.tensor.nnet</span> <span class="kn">import</span> <span class="n">conv</span>  <span class="kn">from</span> <span class="nn">theano.tensor.nnet</span> <span class="kn">import</span> <span class="n">softmax</span>  <span class="kn">from</span> <span class="nn">theano.tensor</span> <span class="kn">import</span> <span class="n">shared_randomstreams</span>  <span class="kn">from</span> <span class="nn">theano.tensor.signal</span> <span class="kn">import</span> <span class="n">downsample</span>  <span class="c1"># Activation functions for neurons</span>  <span class="k">def</span> <span class="nf">linear</span><span class="p">(</span><span class="n">z</span><span class="p">):</span> <span class="k">return</span> <span class="n">z</span>  <span class="k">def</span> <span class="nf">ReLU</span><span class="p">(</span><span class="n">z</span><span class="p">):</span> <span class="k">return</span> <span class="n">T</span><span class="o">.</span><span class="n">maximum</span><span class="p">(</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span>  <span class="kn">from</span> <span class="nn">theano.tensor.nnet</span> <span class="kn">import</span> <span class="n">sigmoid</span>  <span class="kn">from</span> <span class="nn">theano.tensor</span> <span class="kn">import</span> <span class="n">tanh</span>  <span class="c1">#### Constants</span>  <span class="n">GPU</span> <span class="o">=</span> <span class="bp">True</span>  <span class="k">if</span> <span class="n">GPU</span><span class="p">:</span>      <span class="k">print</span> <span class="s2">&quot;Trying to run under a GPU.  If this is not desired, then modify &quot;</span><span class="o">+</span>\          <span class="s2">&quot;network3.py</span><span class="se">\n</span><span class="s2">to set the GPU flag to False.&quot;</span>      <span class="k">try</span><span class="p">:</span> <span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">device</span> <span class="o">=</span> <span class="s1">&#39;gpu&#39;</span>      <span class="k">except</span><span class="p">:</span> <span class="k">pass</span> <span class="c1"># it&#39;s already set</span>      <span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span> <span class="o">=</span> <span class="s1">&#39;float32&#39;</span>  <span class="k">else</span><span class="p">:</span>      <span class="k">print</span> <span class="s2">&quot;Running with a CPU.  If this is not desired, then the modify &quot;</span><span class="o">+</span>\          <span class="s2">&quot;network3.py to set</span><span class="se">\n</span><span class="s2">the GPU flag to True.&quot;</span>  <span class="c1">#### Load the MNIST data</span>  <span class="k">def</span> <span class="nf">load_data_shared</span><span class="p">(</span><span class="n">filename</span><span class="o">=</span><span class="s2">&quot;../data/mnist.pkl.gz&quot;</span><span class="p">):</span>      <span class="n">f</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s1">&#39;rb&#39;</span><span class="p">)</span>      <span class="n">training_data</span><span class="p">,</span> <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span> <span class="o">=</span> <span class="n">cPickle</span><span class="o">.</span><span class="n">load</span><span class="p">(</span><span class="n">f</span><span class="p">)</span>      <span class="n">f</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>      <span class="k">def</span> <span class="nf">shared</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;Place the data into shared variables.  This allows Theano to copy</span>  <span class="sd">        the data to the GPU, if one is available.</span>  <span class="sd">        &quot;&quot;&quot;</span>          <span class="n">shared_x</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="n">shared_y</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">data</span><span class="p">[</span><span class="mi">1</span><span class="p">],</span> <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="k">return</span> <span class="n">shared_x</span><span class="p">,</span> <span class="n">T</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">shared_y</span><span class="p">,</span> <span class="s2">&quot;int32&quot;</span><span class="p">)</span>      <span class="k">return</span> <span class="p">[</span><span class="n">shared</span><span class="p">(</span><span class="n">training_data</span><span class="p">),</span> <span class="n">shared</span><span class="p">(</span><span class="n">validation_data</span><span class="p">),</span> <span class="n">shared</span><span class="p">(</span><span class="n">test_data</span><span class="p">)]</span>  <span class="c1">#### Main class used to construct and train networks</span>  <span class="k">class</span> <span class="nc">Network</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">layers</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;Takes a list of `layers`, describing the network architecture, and</span>  <span class="sd">        a value for the `mini_batch_size` to be used during training</span>  <span class="sd">        by stochastic gradient descent.</span>  <span class="sd">        &quot;&quot;&quot;</span>          <span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="o">=</span> <span class="n">layers</span>          <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span> <span class="o">=</span> <span class="n">mini_batch_size</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="n">param</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span> <span class="k">for</span> <span class="n">param</span> <span class="ow">in</span> <span class="n">layer</span><span class="o">.</span><span class="n">params</span><span class="p">]</span>          <span class="bp">self</span><span class="o">.</span><span class="n">x</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">matrix</span><span class="p">(</span><span class="s2">&quot;x&quot;</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">y</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">ivector</span><span class="p">(</span><span class="s2">&quot;y&quot;</span><span class="p">)</span>          <span class="n">init_layer</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>          <span class="n">init_layer</span><span class="o">.</span><span class="n">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">)</span>          <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">)):</span>              <span class="n">prev_layer</span><span class="p">,</span> <span class="n">layer</span>  <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="n">j</span><span class="o">-</span><span class="mi">1</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="n">j</span><span class="p">]</span>              <span class="n">layer</span><span class="o">.</span><span class="n">set_inpt</span><span class="p">(</span>                  <span class="n">prev_layer</span><span class="o">.</span><span class="n">output</span><span class="p">,</span> <span class="n">prev_layer</span><span class="o">.</span><span class="n">output_dropout</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">output</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">output_dropout</span>      <span class="k">def</span> <span class="nf">SGD</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">training_data</span><span class="p">,</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">,</span> <span class="n">eta</span><span class="p">,</span>              <span class="n">validation_data</span><span class="p">,</span> <span class="n">test_data</span><span class="p">,</span> <span class="n">lmbda</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;Train the network using mini-batch stochastic gradient descent.&quot;&quot;&quot;</span>          <span class="n">training_x</span><span class="p">,</span> <span class="n">training_y</span> <span class="o">=</span> <span class="n">training_data</span>          <span class="n">validation_x</span><span class="p">,</span> <span class="n">validation_y</span> <span class="o">=</span> <span class="n">validation_data</span>          <span class="n">test_x</span><span class="p">,</span> <span class="n">test_y</span> <span class="o">=</span> <span class="n">test_data</span>          <span class="c1"># compute number of minibatches for training, validation and testing</span>          <span class="n">num_training_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">training_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="n">num_validation_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">validation_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="n">num_test_batches</span> <span class="o">=</span> <span class="n">size</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span><span class="o">/</span><span class="n">mini_batch_size</span>          <span class="c1"># define the (regularized) cost function, symbolic gradients, and updates</span>          <span class="n">l2_norm_squared</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([(</span><span class="n">layer</span><span class="o">.</span><span class="n">w</span><span class="o">**</span><span class="mi">2</span><span class="p">)</span><span class="o">.</span><span class="n">sum</span><span class="p">()</span> <span class="k">for</span> <span class="n">layer</span> <span class="ow">in</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">])</span>          <span class="n">cost</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">cost</span><span class="p">(</span><span class="bp">self</span><span class="p">)</span><span class="o">+</span>\                 <span class="mf">0.5</span><span class="o">*</span><span class="n">lmbda</span><span class="o">*</span><span class="n">l2_norm_squared</span><span class="o">/</span><span class="n">num_training_batches</span>          <span class="n">grads</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">grad</span><span class="p">(</span><span class="n">cost</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">)</span>          <span class="n">updates</span> <span class="o">=</span> <span class="p">[(</span><span class="n">param</span><span class="p">,</span> <span class="n">param</span><span class="o">-</span><span class="n">eta</span><span class="o">*</span><span class="n">grad</span><span class="p">)</span>                     <span class="k">for</span> <span class="n">param</span><span class="p">,</span> <span class="n">grad</span> <span class="ow">in</span> <span class="nb">zip</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">params</span><span class="p">,</span> <span class="n">grads</span><span class="p">)]</span>          <span class="c1"># define functions to train a mini-batch, and to compute the</span>          <span class="c1"># accuracy in validation and test mini-batches.</span>          <span class="n">i</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">lscalar</span><span class="p">()</span> <span class="c1"># mini-batch index</span>          <span class="n">train_mb</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="n">cost</span><span class="p">,</span> <span class="n">updates</span><span class="o">=</span><span class="n">updates</span><span class="p">,</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">training_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">training_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="n">validate_mb_accuracy</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">),</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">validation_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">validation_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="n">test_mb_accuracy</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">),</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">test_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">],</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">y</span><span class="p">:</span>                  <span class="n">test_y</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="bp">self</span><span class="o">.</span><span class="n">test_mb_predictions</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">function</span><span class="p">(</span>              <span class="p">[</span><span class="n">i</span><span class="p">],</span> <span class="bp">self</span><span class="o">.</span><span class="n">layers</span><span class="p">[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">y_out</span><span class="p">,</span>              <span class="n">givens</span><span class="o">=</span><span class="p">{</span>                  <span class="bp">self</span><span class="o">.</span><span class="n">x</span><span class="p">:</span>                  <span class="n">test_x</span><span class="p">[</span><span class="n">i</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">:</span> <span class="p">(</span><span class="n">i</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span><span class="o">*</span><span class="bp">self</span><span class="o">.</span><span class="n">mini_batch_size</span><span class="p">]</span>              <span class="p">})</span>          <span class="c1"># Do the actual training</span>          <span class="n">best_validation_accuracy</span> <span class="o">=</span> <span class="mf">0.0</span>          <span class="k">for</span> <span class="n">epoch</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">epochs</span><span class="p">):</span>              <span class="k">for</span> <span class="n">minibatch_index</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_training_batches</span><span class="p">):</span>                  <span class="n">iteration</span> <span class="o">=</span> <span class="n">num_training_batches</span><span class="o">*</span><span class="n">epoch</span><span class="o">+</span><span class="n">minibatch_index</span>                  <span class="k">if</span> <span class="n">iteration</span> <span class="o">%</span> <span class="mi">1000</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>                      <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Training mini-batch number {0}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">iteration</span><span class="p">))</span>                  <span class="n">cost_ij</span> <span class="o">=</span> <span class="n">train_mb</span><span class="p">(</span><span class="n">minibatch_index</span><span class="p">)</span>                  <span class="k">if</span> <span class="p">(</span><span class="n">iteration</span><span class="o">+</span><span class="mi">1</span><span class="p">)</span> <span class="o">%</span> <span class="n">num_training_batches</span> <span class="o">==</span> <span class="mi">0</span><span class="p">:</span>                      <span class="n">validation_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span>                          <span class="p">[</span><span class="n">validate_mb_accuracy</span><span class="p">(</span><span class="n">j</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_validation_batches</span><span class="p">)])</span>                      <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Epoch {0}: validation accuracy {1:.2%}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>                          <span class="n">epoch</span><span class="p">,</span> <span class="n">validation_accuracy</span><span class="p">))</span>                      <span class="k">if</span> <span class="n">validation_accuracy</span> <span class="o">&gt;=</span> <span class="n">best_validation_accuracy</span><span class="p">:</span>                          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;This is the best validation accuracy to date.&quot;</span><span class="p">)</span>                          <span class="n">best_validation_accuracy</span> <span class="o">=</span> <span class="n">validation_accuracy</span>                          <span class="n">best_iteration</span> <span class="o">=</span> <span class="n">iteration</span>                          <span class="k">if</span> <span class="n">test_data</span><span class="p">:</span>                              <span class="n">test_accuracy</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span>                                  <span class="p">[</span><span class="n">test_mb_accuracy</span><span class="p">(</span><span class="n">j</span><span class="p">)</span> <span class="k">for</span> <span class="n">j</span> <span class="ow">in</span> <span class="nb">xrange</span><span class="p">(</span><span class="n">num_test_batches</span><span class="p">)])</span>                              <span class="k">print</span><span class="p">(</span><span class="s1">&#39;The corresponding test accuracy is {0:.2%}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>                                  <span class="n">test_accuracy</span><span class="p">))</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Finished training network.&quot;</span><span class="p">)</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Best validation accuracy of {0:.2%} obtained at iteration {1}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span>              <span class="n">best_validation_accuracy</span><span class="p">,</span> <span class="n">best_iteration</span><span class="p">))</span>          <span class="k">print</span><span class="p">(</span><span class="s2">&quot;Corresponding test accuracy of {0:.2%}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">test_accuracy</span><span class="p">))</span>  <span class="c1">#### Define layer types</span>  <span class="k">class</span> <span class="nc">ConvPoolLayer</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="sd">&quot;&quot;&quot;Used to create a combination of a convolutional and a max-pooling</span>  <span class="sd">    layer.  A more sophisticated implementation would separate the</span>  <span class="sd">    two, but for our purposes we&#39;ll always use them together, and it</span>  <span class="sd">    simplifies the code, so it makes sense to combine them.</span>  <span class="sd">    &quot;&quot;&quot;</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">filter_shape</span><span class="p">,</span> <span class="n">image_shape</span><span class="p">,</span> <span class="n">poolsize</span><span class="o">=</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">2</span><span class="p">),</span>                   <span class="n">activation_fn</span><span class="o">=</span><span class="n">sigmoid</span><span class="p">):</span>          <span class="sd">&quot;&quot;&quot;`filter_shape` is a tuple of length 4, whose entries are the number</span>  <span class="sd">        of filters, the number of input feature maps, the filter height, and the</span>  <span class="sd">        filter width.</span>  <span class="sd">        `image_shape` is a tuple of length 4, whose entries are the</span>  <span class="sd">        mini-batch size, the number of input feature maps, the image</span>  <span class="sd">        height, and the image width.</span>  <span class="sd">        `poolsize` is a tuple of length 2, whose entries are the y and</span>  <span class="sd">        x pooling sizes.</span>  <span class="sd">        &quot;&quot;&quot;</span>          <span class="bp">self</span><span class="o">.</span><span class="n">filter_shape</span> <span class="o">=</span> <span class="n">filter_shape</span>          <span class="bp">self</span><span class="o">.</span><span class="n">image_shape</span> <span class="o">=</span> <span class="n">image_shape</span>          <span class="bp">self</span><span class="o">.</span><span class="n">poolsize</span> <span class="o">=</span> <span class="n">poolsize</span>          <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="o">=</span><span class="n">activation_fn</span>          <span class="c1"># initialize weights and biases</span>          <span class="n">n_out</span> <span class="o">=</span> <span class="p">(</span><span class="n">filter_shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">*</span><span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">filter_shape</span><span class="p">[</span><span class="mi">2</span><span class="p">:])</span><span class="o">/</span><span class="n">np</span><span class="o">.</span><span class="n">prod</span><span class="p">(</span><span class="n">poolsize</span><span class="p">))</span>          <span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span>                  <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="n">n_out</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="n">filter_shape</span><span class="p">),</span>                  <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span>                  <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">filter_shape</span><span class="p">[</span><span class="mi">0</span><span class="p">],)),</span>                  <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">]</span>      <span class="k">def</span> <span class="nf">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inpt</span><span class="p">,</span> <span class="n">inpt_dropout</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt</span> <span class="o">=</span> <span class="n">inpt</span><span class="o">.</span><span class="n">reshape</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">image_shape</span><span class="p">)</span>          <span class="n">conv_out</span> <span class="o">=</span> <span class="n">conv</span><span class="o">.</span><span class="n">conv2d</span><span class="p">(</span>              <span class="nb">input</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt</span><span class="p">,</span> <span class="n">filters</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">,</span> <span class="n">filter_shape</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">filter_shape</span><span class="p">,</span>              <span class="n">image_shape</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">image_shape</span><span class="p">)</span>          <span class="n">pooled_out</span> <span class="o">=</span> <span class="n">downsample</span><span class="o">.</span><span class="n">max_pool_2d</span><span class="p">(</span>              <span class="nb">input</span><span class="o">=</span><span class="n">conv_out</span><span class="p">,</span> <span class="n">ds</span><span class="o">=</span><span class="bp">self</span><span class="o">.</span><span class="n">poolsize</span><span class="p">,</span> <span class="n">ignore_border</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="p">(</span>              <span class="n">pooled_out</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="o">.</span><span class="n">dimshuffle</span><span class="p">(</span><span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">,</span> <span class="s1">&#39;x&#39;</span><span class="p">))</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="c1"># no dropout in the convolutional layers</span>  <span class="k">class</span> <span class="nc">FullyConnectedLayer</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">,</span> <span class="n">activation_fn</span><span class="o">=</span><span class="n">sigmoid</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span> <span class="o">=</span> <span class="n">n_in</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_out</span> <span class="o">=</span> <span class="n">n_out</span>          <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span> <span class="o">=</span> <span class="n">activation_fn</span>          <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span> <span class="o">=</span> <span class="n">p_dropout</span>          <span class="c1"># Initialize weights and biases</span>          <span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span>                  <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span>                      <span class="n">loc</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">sqrt</span><span class="p">(</span><span class="mf">1.0</span><span class="o">/</span><span class="n">n_out</span><span class="p">),</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">)),</span>                  <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">asarray</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">normal</span><span class="p">(</span><span class="n">loc</span><span class="o">=</span><span class="mf">0.0</span><span class="p">,</span> <span class="n">scale</span><span class="o">=</span><span class="mf">1.0</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="p">(</span><span class="n">n_out</span><span class="p">,)),</span>                         <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">]</span>      <span class="k">def</span> <span class="nf">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inpt</span><span class="p">,</span> <span class="n">inpt_dropout</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt</span> <span class="o">=</span> <span class="n">inpt</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">))</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="p">(</span>              <span class="p">(</span><span class="mi">1</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span><span class="o">*</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span> <span class="o">=</span> <span class="n">dropout_layer</span><span class="p">(</span>              <span class="n">inpt_dropout</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">)),</span> <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">activation_fn</span><span class="p">(</span>              <span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>      <span class="k">def</span> <span class="nf">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>          <span class="s2">&quot;Return the accuracy for the mini-batch.&quot;</span>          <span class="k">return</span> <span class="n">T</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">eq</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span><span class="p">))</span>  <span class="k">class</span> <span class="nc">SoftmaxLayer</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>      <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">,</span> <span class="n">p_dropout</span><span class="o">=</span><span class="mf">0.0</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span> <span class="o">=</span> <span class="n">n_in</span>          <span class="bp">self</span><span class="o">.</span><span class="n">n_out</span> <span class="o">=</span> <span class="n">n_out</span>          <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span> <span class="o">=</span> <span class="n">p_dropout</span>          <span class="c1"># Initialize weights and biases</span>          <span class="bp">self</span><span class="o">.</span><span class="n">w</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">n_in</span><span class="p">,</span> <span class="n">n_out</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;w&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">b</span> <span class="o">=</span> <span class="n">theano</span><span class="o">.</span><span class="n">shared</span><span class="p">(</span>              <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="n">n_out</span><span class="p">,),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">),</span>              <span class="n">name</span><span class="o">=</span><span class="s1">&#39;b&#39;</span><span class="p">,</span> <span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">params</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">]</span>      <span class="k">def</span> <span class="nf">set_inpt</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">inpt</span><span class="p">,</span> <span class="n">inpt_dropout</span><span class="p">,</span> <span class="n">mini_batch_size</span><span class="p">):</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt</span> <span class="o">=</span> <span class="n">inpt</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">))</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">((</span><span class="mi">1</span><span class="o">-</span><span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span><span class="o">*</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span> <span class="o">=</span> <span class="n">T</span><span class="o">.</span><span class="n">argmax</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output</span><span class="p">,</span> <span class="n">axis</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span> <span class="o">=</span> <span class="n">dropout_layer</span><span class="p">(</span>              <span class="n">inpt_dropout</span><span class="o">.</span><span class="n">reshape</span><span class="p">((</span><span class="n">mini_batch_size</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">n_in</span><span class="p">)),</span> <span class="bp">self</span><span class="o">.</span><span class="n">p_dropout</span><span class="p">)</span>          <span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span> <span class="o">=</span> <span class="n">softmax</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">dot</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">inpt_dropout</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">w</span><span class="p">)</span> <span class="o">+</span> <span class="bp">self</span><span class="o">.</span><span class="n">b</span><span class="p">)</span>      <span class="k">def</span> <span class="nf">cost</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">net</span><span class="p">):</span>          <span class="s2">&quot;Return the log-likelihood cost.&quot;</span>          <span class="k">return</span> <span class="o">-</span><span class="n">T</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">log</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">output_dropout</span><span class="p">)[</span><span class="n">T</span><span class="o">.</span><span class="n">arange</span><span class="p">(</span><span class="n">net</span><span class="o">.</span><span class="n">y</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span> <span class="n">net</span><span class="o">.</span><span class="n">y</span><span class="p">])</span>      <span class="k">def</span> <span class="nf">accuracy</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>          <span class="s2">&quot;Return the accuracy for the mini-batch.&quot;</span>          <span class="k">return</span> <span class="n">T</span><span class="o">.</span><span class="n">mean</span><span class="p">(</span><span class="n">T</span><span class="o">.</span><span class="n">eq</span><span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="bp">self</span><span class="o">.</span><span class="n">y_out</span><span class="p">))</span>  <span class="c1">#### Miscellanea</span>  <span class="k">def</span> <span class="nf">size</span><span class="p">(</span><span class="n">data</span><span class="p">):</span>      <span class="s2">&quot;Return the size of the dataset `data`.&quot;</span>      <span class="k">return</span> <span class="n">data</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span><span class="o">.</span><span class="n">get_value</span><span class="p">(</span><span class="n">borrow</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span><span class="o">.</span><span class="n">shape</span><span class="p">[</span><span class="mi">0</span><span class="p">]</span>  <span class="k">def</span> <span class="nf">dropout_layer</span><span class="p">(</span><span class="n">layer</span><span class="p">,</span> <span class="n">p_dropout</span><span class="p">):</span>      <span class="n">srng</span> <span class="o">=</span> <span class="n">shared_randomstreams</span><span class="o">.</span><span class="n">RandomStreams</span><span class="p">(</span>          <span class="n">np</span><span class="o">.</span><span class="n">random</span><span class="o">.</span><span class="n">RandomState</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span><span class="o">.</span><span class="n">randint</span><span class="p">(</span><span class="mi">999999</span><span class="p">))</span>      <span class="n">mask</span> <span class="o">=</span> <span class="n">srng</span><span class="o">.</span><span class="n">binomial</span><span class="p">(</span><span class="n">n</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span> <span class="n">p</span><span class="o">=</span><span class="mi">1</span><span class="o">-</span><span class="n">p_dropout</span><span class="p">,</span> <span class="n">size</span><span class="o">=</span><span class="n">layer</span><span class="o">.</span><span class="n">shape</span><span class="p">)</span>      <span class="k">return</span> <span class="n">layer</span><span class="o">*</span><span class="n">T</span><span class="o">.</span><span class="n">cast</span><span class="p">(</span><span class="n">mask</span><span class="p">,</span> <span class="n">theano</span><span class="o">.</span><span class="n">config</span><span class="o">.</span><span class="n">floatX</span><span class="p">)</span>  </pre></div></p><p><h4><a name="problems_269956"></a><a href="#problems_269956"><!-- Problems -->問題</a></h4><ul></p><p>  <li>  <!--  At present, the <tt>SGD</tt> method requires the user to manually  choose the number of epochs to train for.  Earlier in the book we  discussed an automated way of selecting the number of epochs to  train for, known as <a href="chap3.html#early_stopping">early    stopping</a>.  Modify <tt>network3.py</tt> to implement early stopping.  -->  現在、<tt>SGD</tt>の関数では訓練のエポック数をユーザが手入力するようになっています。  しかし以前議論したように、訓練のエポック数を自動的に決める方法として<a href="chap3.html#early_stopping">早期打ち切り</a>が知られています。  そこで、<tt>network3.py</tt>を修正して、この早期打ち切りを実装してください。</p><p>  <li>  <!--  Add a <tt>Network</tt> method to return the accuracy on an  arbitrary data set.  -->  任意のデータセットに対して、精度を出力する関数を<tt>Network</tt>に追加してください。</p><p>  <li>  <!--  Modify the <tt>SGD</tt> method to allow the learning rate $\eta$  to be a function of the epoch number.  <em>Hint: After working on    this problem for a while, you may find it useful to see the    discussion at    <a href="https://groups.google.com/forum/#!topic/theano-users/NQ9NYLvleGc">this      link</a>.</em>  -->  <tt>SGD</tt>関数を修正して、学習率 $\eta$ がエポック数の関数になるようにしてください。  <em>ヒント: この問題にしばらく取り組んだら、<a href="https://groups.google.com/forum/#!topic/theano-users/NQ9NYLvleGc">このリンク</a>の議論を見るとよいでしょう。</em></p><p>  <li>  <!--  Earlier in the chapter I described a technique for expanding the  training data by applying (small) rotations, skewing, and  translation.  Modify <tt>network3.py</tt> to incorporate all these  techniques.  <em>Note: Unless you have a tremendous amount of    memory, it is not practical to explicitly generate the entire    expanded data set.  So you should consider alternate approaches.</em>  -->  この章の前半で、訓練データに（小さい）回転やせん断、並進移動を加えることで、訓練データを拡張するテクニックを紹介しました。  <tt>network3.py</tt>を修正して、上記のテクニックを取り入れてください。  <em>巨大なメモリを持っていない場合には、拡張データセット全体を生成するのは実用的ではありません。  そのときは別のアプローチを考えてください</em></p><p>  <li>  <!--Add the ability to load and save networks to <tt>network3.py</tt>.-->  ネットワークを記録、再生する機能を<tt>network3.py</tt>に加えてください。</p><p>  <li>  <!--  A shortcoming of the current code is that it provides few  diagnostic tools.  Can you think of any diagnostics to add that  would make it easier to understand to what extent a network is  overfitting?  Add them.  -->  現在のコードの欠点は、診断ツールが少ないことです。  ネットワークの過適合の度合いを簡単に把握できる診断ツールを考えて、実装してください。</p><p>  <li>  <!--  We've used the same initialization procedure for rectified  linear units as for sigmoid (and tanh) neurons.  Our  <a href="chap3.html#weight_initialization">argument for that    initialization</a> was specific to the sigmoid function.  Consider a  network made entirely of rectified linear units (including outputs).  Show that rescaling all the weights in the network by a constant  factor $c > 0$ simply rescales the outputs by a factor $c^{L-1}$,  where $L$ is the number of layers.  How does this change if the  final layer is a softmax?  What do you think of using the sigmoid  initialization procedure for the rectified linear units?  Can you  think of a better initialization procedure?  <em>Note: This is a    very open-ended problem, not something with a simple    self-contained answer.  Still, considering the problem will help    you better understand networks containing rectified linear units.</em>  -->  ReLUの初期化手続きは、シグモイド（とtanh）のニューロンの場合と同じ手続きを使っています。  以前の<a href="chap3.html#weight_initialization">初期化の議論</a>はシグモイド関数に特有のものでした。  （出力も含め）全体がReLUから構成されるネットワークを考えてみてください。  ネットワーク中の全ての重みを定数 $c > 0$ でスケーリングすると、単に出力が $c^{L-1}$ 倍されることを示してください。  ただし、 $L$ は層の数とします。  最終層がソフトマックス関数になると、これはどのように変化するでしょうか？  シグモイドの初期化方法をReLUに適用するのはどう思いますか？  もっと良い初期化方法を思いつきますか？  <em>注意：これは自由回答の問題です。答えは決まっていません。  しかし、この問題を考えてみることで、ReLUを含むネットワークに対する理解が深まるでしょう。  </em></p><p>  <li>  <!--  Our  <a   href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">analysis</a>  of the unstable gradient problem was for sigmoid neurons. How does  the analysis change for networks made up of rectified linear units?  Can you think of a good way of modifying such a network so it  doesn't suffer from the unstable gradient problem?  <em>Note: The    word good in the second part of this makes the problem a research    problem.  It's actually easy to think of ways of making such    modifications.  But I haven't investigated in enough depth to know    of a really good technique.</em>  -->  勾配が不安定になる問題への<a href="chap5.html#what's_causing_the_vanishing_gradient_problem_unstable_gradients_in_deep_neural_nets">分析</a>  は、以前の章でシグモイドニューロンに対して実施しました。  ReLUから構成されるネットワークになると、この分析はどう変化するでしょうか？  勾配が不安定となる問題を回避するための、ネットワークを修正する良い方法を思いつきますか？  <em>注意：    「良い」方法を探すのは研究課題です。    目的を達成する修正は実際には簡単に思いつきます。    しかし、本当に「良い」テクニックかどうかは私は深く調べ切っていません。  </em></ul></p><p><h3><a name="recent_progress_in_image_recognition"></a><a href="#recent_progress_in_image_recognition"><!-- Recent progress in image recognition -->画像認識の近年の進展</a></h3></p><p><!--In 1998, the year MNIST was introduced, it took weeks to train astate-of-the-art workstation to achieve accuracies substantially worsethan those we can achieve using a GPU and less than an hour oftraining. Thus, MNIST is no longer a problem that pushes the limits ofavailable technique; rather, the speed of training means that it is aproblem good for teaching and learning purposes.  Meanwhile, the focusof research has moved on, and modern work involves much morechallenging image recognition problems.  In this section, I brieflydescribe some recent work on image recognition using neural networks.-->1998年、MNISTが生まれた年には、当時の最新のワークステーションを使ったとしても、ネットワークの訓練に数週間かかっていました。その時の精度を、現在GPUを使うと一時間以内に達成してしまいます。したがってMNISTは現在の技術の前では、もはや問題として物足りなくなってしまったと言えます。<!-- この間の文は誤訳の恐れから省略 -->現在では、より難しい画像認識問題を研究の課題とするようになりました。このセクションでは、近年のニューラルネットワークを使った画像認識の研究を概観します。</p><p><!--The section is different to most of the book.  Through the book I'vefocused on ideas likely to be of lasting interest - ideas such asbackpropagation, regularization, and convolutional networks.  I'vetried to avoid results which are fashionable as I write, but whoselong-term value is unknown. In science, such results are more oftenthan not ephemera which fade and have little lasting impact.  Giventhis, a skeptic might say: "well, surely the recent progress in imagerecognition is an example of such ephemera?  In another two or threeyears, things will have moved on.  So surely these results are only ofinterest to a few specialists who want to compete at the absolutefrontier?  Why bother discussing it?"-->このセクションは本書の大部分と趣向が異なります。これまで本書では、長く通用するアイデアをテーマとして扱ってきました。例えば、逆伝播、正規化、畳込みネットワークなどです。今から記述していくような、長期的には価値が不明な流行っている知見を避けようとしてきたのです。科学の世界では、流行りというのはすぐに移り変わり、影響力が薄いものです。このことから考えると、懐疑的な人は次のように述べるでしょう。「ええとつまり、近年の画像認識の成果は結局、流行りものですよね？2、3年後には、物事は移り変わっているはずです。したがって最新の結果というものは、最先端で競争する専門家のような限られた人にとってのみ、意味のあるものですね？だとしたら、私たちがなぜ議論する必要があるのですか？」</p><p><!--Such a skeptic is right that some of the finer details of recentpapers will gradually diminish in perceived importance.  With thatsaid, the past few years have seen extraordinary improvements usingdeep nets to attack extremely difficult image recognition tasks.Imagine a historian of science writing about computer vision in theyear 2100.  They will identify the years 2011 to 2015 (and probably afew years beyond) as a time of huge breakthroughs, driven by deepconvolutional nets.  That doesn't mean deep convolutional nets willstill be used in 2100, much less detailed ideas such as dropout,rectified linear units, and so on.  But it does mean that an importanttransition is taking place, right now, in the history of ideas.  It'sa bit like watching the discovery of the atom, or the invention ofantibiotics: invention and discovery on a historic scale.  And sowhile we won't dig down deep into details, it's worth getting someidea of the exciting discoveries currently being made.-->最新論文の細かい結果の重要性は徐々に薄れていく、ということに関しては、懐疑論者は正しいです。とは言うものの、ここ数年、途轍もなく難しい画像認識問題に深層ネットワークが挑み、その素晴らしい結果が立て続けに発表されています。2100年の、コンピュータビジョンの歴史を綴る歴史家のことを想像してください。彼らはきっと、2011年から2015年（とさらに数年）を、深層畳込みネットワークによるブレークスルーの時代と位置づけるでしょう。それは2100年になっても、深層畳込みネットワークが通用するか否かとは無関係です。もちろん、ドロップアウトやReLUなどのアイデアが使用されているかどうかも関係ありません。それは、歴史の中でまさに今、重大な進化が起きているということを意味するのです。原子の発見や抗生物質の発明を目撃しているようなものだと思います。これは、歴史的な規模の発明と発見と私は信じています。したがって、詳細には踏み込みませんが、この瞬間も新たに発見されているアイデアを確認しておくことは重要なのです。</p><p><!--<strong>The 2012 LRMD paper:</strong> Let me start with a 2012paper*<span class="marginnote">*<a href="http://research.google.com/pubs/pub38115.html">Building    high-level features using large scale unsupervised learning</a>, by  Quoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai  Chen, Greg Corrado, Jeff Dean, and Andrew Ng (2012).  Note that the  detailed architecture of the network used in the paper differed in  many details from the deep convolutional networks we've been  studying.  Broadly speaking, however, LRMD is based on many similar  ideas.</span>from a group of researchers from Stanford and Google.--><strong>2010年のLRMD論文：</strong>2012年のStanfordとGoogleの研究者による論文からまず始めましょう*<span class="marginnote">*2012年のQuoc Le, Marc'Aurelio Ranzato, Rajat Monga, Matthieu Devin, Kai  Chen, Greg Corrado, Jeff Dean, Andrew Ngによる<a href="http://research.google.com/pubs/pub38115.html">Building    high-level features using large scale unsupervised learning</a>。  この論文で扱われているネットワークの詳細な構造は、これまで学んできた深層畳み込みネットワークと多くの点で異なります。  しかし、大まかな視点で見ると、同じアイデアに基づいていることが分かります。</span>。<!--I'llrefer to this paper as LRMD, after the last names of the first fourauthors. LRMD used a neural network to classify images from<a href="http://www.image-net.org">ImageNet</a>, a very challenging imagerecognition problem.  The 2011 ImageNet data that they used included16 million full color images, in 20 thousand categories.  The imageswere crawled from the open net, and classified by workers fromAmazon's Mechanical Turk service.  Here's a few ImageNetimages*-->この論文を、最初の4人の著者の苗字からLRMDと呼びます。LRMDはニューラルネットワークを用いて<a href="http://www.image-net.org">ImageNet</a>という画像分類問題を解いています。ImageNetは画像認識の難問です。彼らが使用した2011年のImageNetのデータは、2万カテゴリに分類された1600万枚のフルカラー画像でした。これらの画像はウェブ上で集められ、Amazon Mechanical Turkのサービスにより分類されたものです。これがImageNetの画像の例です*<!--<span class="marginnote">*These are from the 2014 dataset, which is somewhat  changed from 2011.  Qualitatively, however, the dataset is extremely  similar.  Details about ImageNet are available in the original  ImageNet paper,  <a href="http://www.image-net.org/papers/imagenet_cvpr09.pdf">ImageNet:    a large-scale hierarchical image database</a>, by Jia Deng, Wei Dong,  Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei (2009).</span>:--><span class="marginnote">  *これらは2014のデータセットのものです。  2011年のデータとは少し異なります。  しかし質的にはほとんど同じです。  ImageNetの詳細は、Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Li Fei-Feiによる2009年のImageNetの論文、  <a href="http://www.image-net.org/papers/imagenet_cvpr09.pdf">ImageNet:    a large-scale hierarchical image database</a>を参照してください。</span>。</p><p><img src="images/imagenet1.jpg" height="120px"><img src="images/imagenet2.jpg" height="120px"><img src="images/imagenet3.jpg" height="120px"><img src="images/imagenet4.jpg" height="120px"></p><p><!--These are, respectively, in the categories for beading plane, brownroot rot fungus, scalded milk, and the common roundworm.  If you'relooking for a challenge, I encourage you to visit ImageNet's list of<a href="http://www.image-net.org/synset?wnid=n03489162">hand tools</a>,which distinguishes between beading planes, block planes, chamferplanes, and about a dozen other types of plane, amongst othercategories.  I don't know about you, but I cannot confidentlydistinguish between all these tool types.  This is obviously a muchmore challenging image recognition task than MNIST!  LRMD's networkobtained a respectable $15.8$ percent accuracy for correctlyclassifying ImageNet images.  That may not sound impressive, but itwas a huge improvement over the previous best result of $9.3$ percentaccuracy.  That jump suggested that neural networks might offer apowerful approach to very challenging image recognition tasks, such asImageNet.-->これらの画像の分類クラスはそれぞれ、玉縁装飾かんな、キュウリ科の植物の褐色根腐れを引き起こす菌類、加熱された牛乳、線虫です。問題を試したいのであれば、ImageNetの<a href="http://www.image-net.org/synset?wnid=n03489162">手工具</a>のページをお薦めします。上記サイトでは、玉縁装飾かんな、木口用かんな、面取りかんなを始め、10種類程度のかんなの分類があります。あなたがどうかは分からないですが、私はこれらの道具を自信を持って区別できません。これは明らかにMNISTよりも難しい画像認識問題です！LRMDのネットワークはImageNet画像に対して、 $15.8$ %の分類精度を得ています。あまり精度が良くないように聞こえるでしょう。でも、LRMDより以前の最高結果は $9.3$ %だったのです。そこから考えると大きな進展です。この進展は、ImageNetのような難しい画像認識問題に対して、ニューラルネットが強力なアプローチであることを示しています。</p><p><!--<strong>The 2012 KSH paper:</strong> The work of LRMD was followed by a 2012paper of Krizhevsky, Sutskever and Hinton(KSH)*<span class="marginnote">*<a href="http://www.cs.toronto.edu/&#126;fritz/absps/imagenet.pdf">ImageNet    classification with deep convolutional neural networks</a>, by Alex  Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton (2012).</span>.--><strong>2012年のKSH論文：</strong> 2012年にKrizhevsky, Sutskever, Hinton(KSH)がLRMDの研究を追って論文*を出しました<span class="marginnote">*2012年のAlex  Krizhevsky, Ilya Sutskever, Geoffrey E. Hintonによる<a href="http://www.cs.toronto.edu/&#126;fritz/absps/imagenet.pdf">ImageNet    classification with deep convolutional neural networks</a></span>。<!--KSH trained and tested a deep convolutional neural network using arestricted subset of the ImageNet data. The subset they used came froma popular machine learning competition - the ImageNet Large-ScaleVisual Recognition Challenge (ILSVRC).  Using a competition datasetgave them a good way of comparing their approach to other leadingtechniques.  The ILSVRC-2012 training set contained about 1.2 millionImageNet images, drawn from 1,000 categories.  The validation and testsets contained 50,000 and 150,000 images, respectively, drawn from thesame 1,000 categories.-->KSHは、ImageNetデータのサブセットを使って、深層畳み込みニューラルネットワークの訓練とテストをしました。このサブセットは、機械学習の人気の大会であるILSVRC(the ImageNet Large-Scale Visual Recognition Challenge)から引用したものです。この大会で使用されるデータセットを用いると、他の最先端アプローチと比較することができます。ILSVRC-2012の訓練データ・セットは、1000の分類クラスからなる120万のImageNet画像です。検証とテストのデータは、1000の分類クラスからなるそれぞれ5万と15万の画像です。</p><p><!--One difficulty in running the ILSVRC competition is that many ImageNetimages contain multiple objects.  Suppose an image shows a labradorretriever chasing a soccer ball.  The so-called "correct" ImageNetclassification of the image might be as a labrador retriever.  Shouldan algorithm be penalized if it labels the image as a soccer ball?Because of this ambiguity, an algorithm was considered correct if theactual ImageNet classification was among the $5$ classifications thealgorithm considered most likely.  By this top-$5$ criterion, KSH'sdeep convolutional network achieved an accuracy of $84.7$ percent,vastly better than the next-best contest entry, which achieved anaccuracy of $73.8$ percent.  Using the more restrictive metric ofgetting the label exactly right, KSH's network achieved an accuracy of$63.3$ percent.-->ILSVRCの難しい点はImageNetの画像が複数の物体を含んでいることです。ラブラドールレトリバーがサッカーボールを追いかけている画像を思い浮かべてください。ImageNetによる「正しい」分類クラスは、きっとラブラドールレトリバーでしょう。ここで、この画像をサッカーボールと分類した時に、このアルゴリズムはペナルティを受けるべきでしょうか？この曖昧性を考慮して、実際のImageNetの分類は $5$ つ選んだ分類の中に正解があれば、アルゴリズムは正しいと判定することになっています。この上位 $5$ の分類を使う基準に則ると、KSHの深層畳み込みネットワークは $84.7$ %の精度を達成しています。ちなみに次点のネットワークの精度は $73.8$ %でした。分類の正確性の基準をもう少し厳しいものを適用すると、KSHのネットワークの分類精度は $63.3$ %となります。</p><p><!--It's worth briefly describing KSH's network, since it has inspiredmuch subsequent work.  It's also, as we shall see, closely related tothe networks we trained earlier in this chapter, albeit moreelaborate.  KSH used a deep convolutional neural network, trained ontwo GPUs.  They used two GPUs because the particular type of GPU theywere using (an NVIDIA GeForce GTX 580) didn't have enough on-chipmemory to store their entire network.  So they split the network intotwo parts, partitioned across the two GPUs.-->KSHのネットワークは後続の研究に大きな影響を与えたネットワークであるので、今後の参考のために簡単に描写してみようと思います。KSHの方が手が込んでいますが、本章でこれまで私たちが訓練してきたネットワークにとてもよく似ています。KSHは深層畳み込みニューラルネットワークであり、2台のGPU上で訓練を行っています。2台のGPUを使用した理由は、使用しているGPU（NVIDIA GeForce GTX580)に原因があります。ネットワーク全体を保持するためには、このGPUのオンチップメモリが足りないのです。そのため、ネットワークを二分割して2台のGPUに分けて搭載しています。</p><p><!--The KSH network has $7$ layers of hidden neurons.  The first $5$hidden layers are convolutional layers (some with max-pooling), whilethe next $2$ layers are fully-connected layers.  The output layer is a$1,000$-unit softmax layer, corresponding to the $1,000$ imageclasses. Here's a sketch of the network, taken from the KSHpaper*<span class="marginnote">*Thanks to Ilya Sutskever.</span>.  The details are explainedbelow.  Note that many layers are split into $2$ parts, correspondingto the $2$ GPUs.-->KSHのネットワークには7つの隠れ層があります。前方の $5$ つの隠れ層は畳み込み層（幾つかはMaxプーリング付き）で、次の $2$ 層は全結合層です。出力層は $1,000$ ユニットからなるソフトマックス層で、 $1,000$ の分類クラスに対応しています。下図がKSHの論文*から引用したネットワークのスケッチです<span class="marginnote">*Ilya Sutskeverに感謝します。</span>。詳細は下に記述します。多くの層が $2$ つのGPUに対応するために $2$ 部分に分割されていることに注意してください。</p><p><img src="images/KSH.jpg" width="600px"></p><p><!--The input layer contains $3 \times 224 \times 224$ neurons,representing the RGB values for a $224 \times 224$ image.  Recallthat, as mentioned earlier, ImageNet contains images of varyingresolution.  This poses a problem, since a neural network's inputlayer is usually of a fixed size.  KSH dealt with this by rescalingeach image so the shorter side had length $256$. They then cropped outa $256 \times 256$ area in the center of the rescaled image.  Finally,KSH extracted random $224 \times 224$ subimages (and horizontalreflections) from the $256 \times 256$ images.  They did this randomcropping as a way of expanding the training data, and thus reducingoverfitting.  This is particularly helpful in a large network such asKSH's.  It was these $224 \times 224$ images which were used as inputsto the network.  In most cases the cropped image still contains themain object from the uncropped image.-->入力層は、 $3 \times 224 \times 224$ ニューロンを含み、 $224 \times 224$ の画像のRGB値を表します。以前述べたように、ImageNetの画像は解像度が異なることを思い出してください。ニューラルネットワークの入力層のサイズは固定なので、このままでは問題が起きます。そこで、KSHは各画像を拡大縮小して、短辺が長さ $256$ となるように調整しています。次に、その画像の中央 $256 \times 256$ の領域を切り取ります。最後に、その $256 \times 256$ 領域の中から、ランダムに $224 \times 224$ の部分画像（水平反転も含む）を抜き出します。このランダムに抜き出す操作により、訓練データを拡張し、過適合を防いでいます。この一連の操作はKSHのような巨大なネットワークの場合、有効です。$224 \times 224$ の画像がネットワークの入力に使われています。大抵の場合、抜き出された画像は目的の物体を含んでいるはずです。</p><p><!--Moving on to the hidden layers in KSH's network, the first hiddenlayer is a convolutional layer, with a max-pooling step.  It useslocal receptive fields of size $11 \times 11$, and a stride length of$4$ pixels.  There are a total of $96$ feature maps.  The feature mapsare split into two groups of $48$ each, with the first $48$ featuremaps residing on one GPU, and the second $48$ feature maps residing onthe other GPU.  The max-pooling in this and later layers is done in $3\times 3$ regions, but the pooling regions are allowed to overlap, andare just $2$ pixels apart.-->KSHの隠れ層の話に移ります。1つ目の隠れ層はMaxプーリング付きの畳み込み層です。この層はサイズが $11 \times 11$ の局所受容野を、ストライド長さ $4$ ピクセルで使います。全体で $96$ の特徴マップとなります。特徴マップは各 $48$ の2グループに分割され、始めの $48$ の特徴マップは片方のGPUに置かれ、後半の $48$ の特徴マップはもう片方のGPUに置かれます。この層含めて後層でも、Maxプーリングは $3 \times 3$ の領域で行われます。しかしプーリング領域は重複が許されており、実際 $2$ ピクセルしか離れていません。</p><p><!--The second hidden layer is also a convolutional layer, with amax-pooling step.  It uses $5 \times 5$ local receptive fields, andthere's a total of $256$ feature maps, split into $128$ on each GPU.Note that the feature maps only use $48$ input channels, not the full$96$ output from the previous layer (as would usually be the case).This is because any single feature map only uses inputs from the sameGPU.  In this sense the network departs from the convolutionalarchitecture we described earlier in the chapter, though obviously thebasic idea is still the same.-->2つ目の隠れ層もMaxプーリング付きの畳み込み層です。$5 \times 5$ の局所受容野を使い、全体で $256$ の特徴マップを持ち、各GPUに $128$ ずつ分割され置かれます。ここで特徴マップは、前層の出力の $96$ チャネル全てを利用するのではなく、 $48$ の入力チャネルのみ使うことに注意してください（これは通常の操作ではありません）。なぜかと言うと、同じGPUからしか入力を受け取れないからです。この点で、これまで私たちが学んできた畳み込み層とは異なります。ただし、根底に流れる基礎的なアイデアはやはり同じです。</p><p><!--The third, fourth and fifth hidden layers are convolutional layers,but unlike the previous layers, they do not involve max-pooling.Their respectives parameters are: (3) $384$ feature maps, with $3\times 3$ local receptive fields, and $256$ input channels; (4) $384$feature maps, with $3 \times 3$ local receptive fields, and $192$input channels; and (5) $256$ feature maps, with $3 \times 3$ localreceptive fields, and $192$ input channels.  Note that the third layerinvolves some inter-GPU communication (as depicted in the figure) inorder that the feature maps use all $256$ input channels.-->3、4、5層目の隠れ層も畳込み層ですが、前層までと異なりMaxプーリングは行いません。各パラメータは次のようになっています。(3) 特徴マップ $384$ 個、局所受容野のサイズ $3 \times 3$ 、入力チャネル $256$ 、(4) 特徴マップ $384$ 個、局所受容野のサイズ $3 \times 3$ 、入力チャネル $192$ 、(5) 特徴マップ $256$ 個、局所受容野のサイズ $3 \times 3$ 、入力チャネル $192$ 、3層目は、特徴マップが全ての入力チャネルを使うために、（図に示すように）GPU間通信を行うことに注意してください。</p><p><!--The sixth and seventh hidden layers are fully-connected layers, with$4,096$ neurons in each layer.-->6、7層目の隠れ層は $4,096$ のニューロンからなる全結合層です。</p><p><!--The output layer is a $1,000$-unit softmax layer.-->出力層は $1,000$ ユニットのソフトマックス層です。</p><p><!--  The KSH network takes advantage of many techniques.  Instead of usingthe sigmoid or tanh activation functions, KSH use rectified linearunits, which sped up training significantly.  KSH's network hadroughly 60 million learned parameters, and was thus, even with thelarge training set, susceptible to overfitting.  To overcome this,they expanded the training set using the random cropping strategy wediscussed above.  They also further addressed overfitting by using avariant of <a href="chap3.html#regularization">l2 regularization</a>, and<a href="chap3.html#other_techniques_for_regularization">dropout</a>.The network itself was trained using<a href="chap3.html#variations_on_stochastic_gradient_descent">momentum-based</a>mini-batch stochastic gradient descent.-->KSHのネットワークは多くのテクニックを利用しています。シグモイドやtanhを活性化関数に使う代わりに、ReLUを使って訓練を高速化しています。もともと巨大な訓練データセットを使っているとはいえ、KSHのネットワークには約6000万のパラメータがあるので、過適合しやすいです。これを克服するために、上述の通りランダムに抜き出す戦略により、訓練データを拡張したのです。さらに<a href="chap3.html#regularization">L2正規化</a>の一種や、<a href="chap3.html#other_techniques_for_regularization">ドロップアウト</a>を用いて過適合を抑制しています。ネットワークの訓練は、<a href="chap3.html#variations_on_stochastic_gradient_descent">モメンタム</a>を用いたミニバッチ確率的勾配降下法で行います。</p><p><!--That's an overview of many of the core ideas in the KSH paper.  I'veomitted some details, for which you should look at the paper.  You canalso look at Alex Krizhevsky's<a href="https://code.google.com/p/cuda-convnet/">cuda-convnet</a> (andsuccessors), which contains code implementing many of the ideas.-->以上がKSH論文の重要アイデアの概要です。いくつかの詳細は省きました。論文で確認してみてください。またAlex Krizhevskyによる<a href="https://code.google.com/p/cuda-convnet/">cuda-convnet</a>（とその後継情報）を見るのもよいでしょう。コード実装に関するたくさんのアイデアが載っています。<!--A Theano-based implementation has also beendeveloped*<span class="marginnote">*<a href="http://arxiv.org/abs/1412.2302">Theano-based    large-scale visual recognition with multiple GPUs</a>, by Weiguang  Ding, Ruoyan Wang, Fei Mao, and Graham Taylor (2014).</span>, with thecode available<a href="https://github.com/uoguelph-mlrg/theano_alexnet">here</a>.-->Theanoベースの実装は発表*されており、<span class="marginnote">*2014年のWeiguang Ding, Ruoyan Wang, Fei Mao, and Graham Taylorによる<a href="http://arxiv.org/abs/1412.2302">Theano-based large-scale visual recognition with multiple GPUs</a></span><a href="https://github.com/uoguelph-mlrg/theano_alexnet">ここ</a>でそのコードが入手できます。<!--The code is recognizably along similar lines to that developed in thischapter, although the use of multiple GPUs complicates thingssomewhat.  The Caffe neural nets framework also includes a version ofthe KSH network, see their<a href="http://caffe.berkeleyvision.org/model_zoo.html">Model Zoo</a> fordetails.-->そのコードは複数GPUの使用するため少し複雑ですが、この章で見てきたものにそっくりです。Caffeフレームワークの中でもKSHネットワークが実装されています。<a href="http://caffe.berkeleyvision.org/model_zoo.html">Model Zoo</a>を見てください。</p><p><!--<strong>2014 ILSVRC competition:</strong> Since 2012, rapid progresscontinues to be made.  Consider the 2014 ILSVRC competition.  As in2012, it involved a training set of $1.2$ million images, in $1,000$categories, and the figure of merit was whether the top $5$predictions included the correct category.--><strong>2014のILSVRC：</strong>2012年から急激な発展が続きました。2014年のILSVRCコンペティションを概観しましょう。2012年の場合と同じく、訓練データは $1,000$ の分類クラスからなる $120$ 万の画像です。正解の基準は、画像に対して分類した上位 $5$ カテゴリの中に正しいラベルが含まれていることです。<!--The winning team, based primarily at Google*<span class="marginnote">*<a href="http://arxiv.org/abs/1409.4842">Going deeper    with convolutions</a>, by Christian Szegedy, Wei Liu, Yangqing Jia,  Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,  Vincent Vanhoucke, and Andrew Rabinovich (2014).</span>, used a deepconvolutional network with $22$ layers of neurons.  They called theirnetwork GoogLeNet, as a homage to LeNet-5.  GoogLeNet achieved a top-5accuracy of $93.33$ percent, a giant improvement over the 2013 winner(<a href="http://www.clarifai.com">Clarifai</a>, with $88.3$ percent), andthe 2012 winner (KSH, with $84.7$ percent).-->勝者はGoogle*を主体としたチームで<span class="marginnote">*2014年のChristian Szegedy, Wei Liu, Yangqing Jia,  Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan,  Vincent Vanhoucke, Andrew Rabinovichによる<a href="http://arxiv.org/abs/1409.4842">Going deeper    with convolutions</a></span>、 $22$ 層の深い畳み込みネットワークを使用していました。彼らは自身のネットワークを、LeNet-5に対するオマージュとしてGoogLeNetと名付けました。GoogLeNetは上位5分類基準の精度で評価すると $93.33$ %でした。これは2013年の勝者の記録(<a href="http://www.clarifai.com">Clarifai</a>は $88.3$ %)と2012年の勝者の記録(KSHは $84.7$ %)を大幅に上回っています。</p><p><!--Just how good is GoogLeNet's $93.33$ percent accuracy?  In 2014 a teamof researchers wrote a survey paper about the ILSVRCcompetition*<span class="marginnote">*<a href="http://arxiv.org/abs/1409.0575">ImageNet    large scale visual recognition challenge</a>, by Olga Russakovsky,  Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,  Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,  Alexander C. Berg, and Li Fei-Fei (2014).</span>.One of the questionsthey address is how well humans perform on ILSVRC.  To do this, theybuilt a system which lets humans classify ILSVRC images.  As one ofthe authors, Andrej Karpathy, explains in an informative<a href="http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/">blog  post</a>, it was a lot of trouble to get the humans up to GoogLeNet'sperformance:-->GoogLeNetの $93.33$ %という精度はどのくらい良い結果なのでしょうか？2014年に研究者がILSVRCに関するサーベイ論文*を書いています<span class="marginnote">*2014年のOlga Russakovsky,  Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,  Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein,  Alexander C. Berg, Li Fei-Feiによる<a href="http://arxiv.org/abs/1409.0575">ImageNet    large scale visual recognition challenge</a>。</span>。彼らは、ILSVRCに人間が挑むとどのような精度になるか、という問題を提起しました。これを調べるために、彼らは人間がILSVRC画像を分類するためのシステムを作りました。著者の1人であるAndrej Karpathyが有益な<a href="http://karpathy.github.io/2014/09/02/what-i-learned-from-competing-against-a-convnet-on-imagenet/">ブログの投稿</a>を行っています。その内容は、人間がGoogLeNetのパフォーマンスにたどり着くのは難しいというものでした。</p><p>  <!--<blockquote> ...the task of labeling images with 5 out of 1000  categories quickly turned out to be extremely challenging, even for  some friends in the lab who have been working on ILSVRC and its  classes for a while. First we thought we would put it up on [Amazon  Mechanical Turk]. Then we thought we could recruit paid  undergrads. Then I organized a labeling party of intense labeling  effort only among the (expert labelers) in our lab. Then I developed  a modified interface that used GoogLeNet predictions to prune the  number of categories from 1000 to only about 100. It was still too  hard - people kept missing categories and getting up to ranges of  13-15&#37; error rates. In the end I realized that to get anywhere  competitively close to GoogLeNet, it was most efficient if I sat  down and went through the painfully long training process and the  subsequent careful annotation process myself... The labeling  happened at a rate of about 1 per minute, but this decreased over  time... Some images are easily recognized, while some images (such  as those of fine-grained breeds of dogs, birds, or monkeys) can  require multiple minutes of concentrated effort. I became very good  at identifying breeds of dogs... Based on the sample of images I  worked on, the GoogLeNet classification error turned out to be  6.8&#37;... My own error in the end turned out to be 5.1&#37;,  approximately 1.7&#37; better.  </blockquote>--><blockquote>...画像に対して1000のカテゴリの中から5カテゴリを選んでラベリングするタスクは極めて難しかったです。  ILSVRCに日常的に取り組んでいて、分類に馴染みがある研究所の友人にとってさえ難しいものでした。  一番初めは、Amazon Mechanical Turkのサービスを利用しようと考えていました。  その後、学部生にバイトさせる方針に切り替えました。  最後には、私たちの研究所の（ラベリングを専門とする）人たちの中で組織を作ることにしました。  そこで、GoogLeNetの予測のために使ったインターフェースに修正を加えて、分類のカテゴリ数を1000から100へ減らしました。  それでもまだまだ難しすぎました。  皆、分類を間違えて 13-15&#37; の誤差を出し続けたのです。  結局、GoogLeNetの精度に接近するためには、私自身が長時間座り続けて長く辛い訓練を行った上で、注意深くラベリングするしかない...と気づきました。  訓練し始めた当初は、分類作業を1分に1回しかできませんでした。  しばらく経つと、次第に高速にできるようになり...画像によってはすぐに認識できる状態となりました。  一方、画像（細かい犬種や鳥の種類、猿の種類を示す画像など）によっては、集中して数分取り組まないといけませんものもありました。  しかし更に時間が経つと、訓練画像をもとにして...犬種の識別など、かつて難しかったタスクを容易に行えるようになりました。  GooLeNetは分類誤差が 6.8&#37; でしたが、... 私の分類誤差は最終的には 5.1&#37; となり、約 1.7&#37; ポイント勝りました。</blockquote></p><p><!--In other words, an expert human, working painstakingly, was with greateffort able to narrowly beat the deep neural network.  In fact,Karpathy reports that a second human expert, trained on a smallersample of images, was only able to attain a $12.0$ percent top-5 errorrate, significantly below GoogLeNet's performance.  About half theerrors were due to the expert "failing to spot and consider theground truth label as an option".-->つまり、専門家の人間が苦痛を伴う努力をして初めて、深層ニューラルネットワークを僅かに上回れるということです。実際、2人目の専門家は、サンプル数の少ない訓練画像で訓練して挑んだものの、$12.0$ %の誤差までしか到達できなかったとKarpathyは報告しています。これはGoogLeNetの性能に大きく劣ります。間違えた問題の半数は、選択肢にさえ真のラベルを選べなかったそうです。</p><p><!--These are astonishing results.  Indeed, since this work, several teamshave reported systems whose top-5 error rate is actually <em>better</em>than 5.1&#37;.  This has sometimes been reported in the media as thesystems having better-than-human vision.  While the results aregenuinely exciting, there are many caveats that make it misleading tothink of the systems as having better-than-human vision.  The ILSVRCchallenge is in many ways a rather limited problem - a crawl of theopen web is not necessarily representative of images found inapplications!  And, of course, the top-$5$ criterion is quiteartificial.  We are still a long way from solving the problem of imagerecognition or, more broadly, computer vision.  Still, it's extremelyencouraging to see so much progress made on such a challengingproblem, over just a few years.-->この報告は驚異的です。この研究以降、実は幾つかのチームが 5.1&#37; を<em>超える</em>結果を報告しています。これらの結果を受けて、「システムが人間を超える視覚を手に入れた」とメディアで報道がなされました。たしかに結果は本当に素晴らしいものですが、注意することとして、「人間を超える視覚を手に入れた」、というのは誤解です。ILSVRCはとても制約の大きい問題です。ウェブから集めた画像を分類しているので、様々な応用目的で使う画像とは必ずしも一致しません。もちろん評価指標である、上位 $5$ を基準にするというのも非常に恣意的です。画像認識、さらに広く言うとコンピュータビジョンの問題を完全に解いたとはまだとても言えないのです。ただし、難問に対して、たった数年でこの結果が得られたというのは、とても励みになります。</p><p><!--<strong>Other activity:</strong> I've focused on ImageNet, but there's aconsiderable amount of other activity using neural nets to do imagerecognition.  Let me briefly describe a few interesting recentresults, just to give the flavour of some current work.--><strong>他の活動：</strong>これまでImageNetを見てきました。しかし、他にもニューラルネットワークを使用して画像認識する活動があります。興味深い近年の研究結果を簡単に紹介します。</p><p><!--One encouraging practical set of results comes from a team at Google,who applied deep convolutional networks to the problem of recognizingstreet numbers in Google's Street Viewimagery*<span class="marginnote">*<a href="http://arxiv.org/abs/1312.6082">Multi-digit    Number Recognition from Street View Imagery using Deep    Convolutional Neural Networks</a>, by Ian J. Goodfellow, Yaroslav  Bulatov, Julian Ibarz, Sacha Arnoud, and Vinay Shet (2013).</span>.-->Googleによって生まれた実践的な結果があります。彼らは深層畳み込みネットワークを、Google Street Viewの景色の中の数字認識の問題に適用*しました<span class="marginnote">*2013年のIan J. Goodfellow, Yaroslav  Bulatov, Julian Ibarz, Sacha Arnoud, Vinay Shetによる<a href="http://arxiv.org/abs/1312.6082">Multi-digit    Number Recognition from Street View Imagery using Deep    Convolutional Neural Networks</a></span>。<!--In their paper, they report detecting and automatically transcribingnearly 100 million street numbers at an accuracy similar to that of ahuman operator.  The system is fast: their system transcribed all ofStreet View's images of street numbers in France in less than an hour!They say: "Having this new dataset significantly increased thegeocoding quality of Google Maps in several countries especially theones that did not already have other sources of good geocoding."  Andthey go on to make the broader claim: "We believe with this model wehave solved [optical character recognition] for short sequences [ofcharacters] for many applications."-->論文の中では、1億の路地番号を検知して自動的に文字に起こすタスクが、人間と同等の精度で行われたと報告されています。さらに、このシステムは非常に高速です。Street Viewのフランス国内の全ての画像において、路地番号全てを1時間以内に文字に起こしたのです！「生成した路地番号のデータセットを利用すると、Google Mapの地理情報の質が驚異的に向上しました。他の地理情報源を従来持たなかった幾つかの国では、特に影響が大きかったです」と彼らは述べています。また、「短文の視覚文字認識の問題をこのモデルでは解決したと思っています。このモデルは多くのアプリケーションに利用できます」とも述べています。</p><p><!--I've perhaps given the impression that it's all a parade ofencouraging results.  Of course, some of the most interesting workreports on fundamental things we don't yet understand.  For instance,a 2013 paper*<span class="marginnote">*<a href="http://arxiv.org/abs/1312.6199">Intriguing    properties of neural networks</a>, by Christian Szegedy, Wojciech  Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow,  and Rob Fergus (2013)</span> showed that deep networks may suffer fromwhat are effectively blind spots.  Consider the lines of images below.On the left is an ImageNet image classified correctly by theirnetwork.  On the right is a slightly perturbed image (the perturbationis in the middle) which is classified <em>incorrectly</em> by thenetwork.  The authors found that there are such "adversarial" imagesfor every sample image, not just a few special ones.-->これらは素晴らしい結果だと私も思っています。しかし、別の研究では、私たちは本質をまだ理解できていないという指摘がなされています。例えば、2013年の論文*<span class="marginnote">*2013年のChristian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, Rob Fergusによる<a href="http://arxiv.org/abs/1312.6199">Intriguing    properties of neural networks</a>。</span>では、深層ネットワークが盲点を持ち、挙動が不安定となる様子が示されています。下図を見てください。左は、ネットワークによって正しく分類されたImageNet画像です。右は、少し外乱（中央の画像）が挿入された画像です。ネットワークは右の画像を<em>誤って</em>分類しました。特別な画像だけでなく、どの画像にもそのような"adversarial"な画像が存在すると著者は指摘しています。</p><p><img src="images/adversarial.jpg"></p><p><!--This is a disturbing result.  The paper used a network based on thesame code as KSH's network - that is, just the type of network thatis being increasingly widely used.  While such neural networks computefunctions which are, in principle, continuous, results like thissuggest that in practice they're likely to compute functions which arevery nearly discontinuous.  Worse, they'll be discontinuous in waysthat violate our intuition about what is reasonable behavior.  That'sconcerning.  Furthermore, it's not yet well understood what's causingthe discontinuity: is it something about the loss function?  Theactivation functions used?  The architecture of the network?Something else?  We don't yet know.-->この結果には戸惑います。論文では、KSHのネットワークの場合をもとにネットワークを組み立てていました。KSHのネットワークの種類は、とても幅広く使われているのです。そのようなニューラルネットワークの計算する関数は原理的に連続的であるはずなのに、上記の結果は極端な非連続性を示すものでした。しかもこれは、私たちの直感に反する非連続性です。<!-- 上手く訳せていません -->これは気がかりです。何が非連続性の要因なのかがまだ良くわかっていないのです。誤差関数の何かに関係しているのでしょうか？活性化関数？ネットワークの構造？それとも他の何かでしょうか？まだ分かっていません。</p><p><!--Now, these results are not quite as bad as they sound.  Although suchadversarial images are common, they're also unlikely in practice.  Asthe paper notes:-->実は、これらの結果に悲観する必要はありません。adversarialな画像は普遍的に存在しますが、実践では発生しにくいのです。</p><p>  <blockquote>  <!--  The existence of the adversarial negatives appears to be in  contradiction with the network’s ability to achieve high  generalization performance. Indeed, if the network can generalize  well, how can it be confused by these adversarial negatives, which  are indistinguishable from the regular examples? The explanation is  that the set of adversarial negatives is of extremely low  probability, and thus is never (or rarely) observed in the test set,  yet it is dense (much like the rational numbers), and so it is found  near virtually every test case.  -->  adversarialな存在はネットワークの高い汎化性能に矛盾するように思えます。  実際、ネットワークが汎用性を獲得できた場合、通常の画像と一見区別できないようなadversarialな存在によって騙されうるのでしょうか？  これに対する反論は、adversarialな画像はめったに発生しないため、テストセットの中には観測されないというものです。  しかし、テストセットは有理数のように詰まっているため、本質的にはどんなテストケースにも存在するはずです。  </blockquote></p><p><!--Nonetheless, it is distressing that we understand neural nets sopoorly that this kind of result should be a recent discovery.  Ofcourse, a major benefit of the results is that they have stimulatedmuch followup work.  For example, one recentpaper*<span class="marginnote">*<a href="http://arxiv.org/abs/1412.1897">Deep Neural    Networks are Easily Fooled: High Confidence Predictions for    Unrecognizable Images</a>, by Anh Nguyen, Jason Yosinski, and Jeff  Clune (2014).</span> shows that given a trained network it's possible togenerate images which look to a human like white noise, but which thenetwork classifies as being in a known category with a very highdegree of confidence.  This is another demonstration that we have along way to go in understanding neural networks and their use in imagerecognition.-->このような類の結果が近年発見されるのは、本質的にニューラルネットワークを理解していないことの現れでしょう。もちろん、このような結果の発表されることで、追跡調査が盛んに行われ、研究が進みます。他にも例えば、最近の論文*<span class="marginnote">*2014年のAnh Nguyen, Jason Yosinski, Jeff  Cluneによる<a href="http://arxiv.org/abs/1412.1897">Deep Neural    Networks are Easily Fooled: High Confidence Predictions for    Unrecognizable Images</a></span>では、人間にはホワイトノイズに見えるのに、ネットワークは確信を持って既知のカテゴリに分類するような画像を生成する結果が報告されています。ニューラルネットワークとその画像認識法を理解するまでにはまだまだ時間がかかりそうです。</p><p><!--Despite results like this, the overall picture is encouraging.  We'reseeing rapid progress on extremely difficult benchmarks, likeImageNet.  We're also seeing rapid progress in the solution ofreal-world problems, like recognizing street numbers in StreetView.But while this is encouraging it's not enough just to see improvementson benchmarks, or even real-world applications.  There are fundamentalphenomena which we still understand poorly, such as the existence ofadversarial images.  When such fundamental problems are still beingdiscovered (never mind solved), it is premature to say that we're nearsolving the problem of image recognition.  At the same time suchproblems are an exciting stimulus to further work.-->しかし現状を俯瞰すると、励みになる結果が多いです。ImageNetのようにとても難易度の高いベンチマークに対する急速な進展がありました。また、StreetViewの路地番号を認識するという現実世界の問題に対処する例も確認しました。これらは励みになります。ただし、ベンチマークに対する良い結果や現実への応用を追うのみでは不十分だと思います。まだ、adversarialな画像の例など、私たちがほとんど理解できていない本質的な現象があります。そのような本質的な問題は、現在も研究されている途中であり、画像認識の問題を完全に解くには、まだまだ発展途上だと言えます。別の言い方をすると、今後の研究には余地がたくさん残っており、課題としてはとても魅力的なのです。</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="other_approaches_to_deep_neural_nets"></a><a href="#other_approaches_to_deep_neural_nets"><!-- Other approaches to deep neural nets -->深層ニューラルネットワークに対する他のアプローチ</a></h3></p><p><!--Through this book, we've concentrated on a single problem: classifyingthe MNIST digits.  It's a juicy problem which forced us to understandmany powerful ideas: stochastic gradient descent, backpropagation,convolutional nets, regularization, and more.  But it's also a narrowproblem.  If you read the neural networks literature, you'll run intomany ideas we haven't discussed: recurrent neural networks, Boltzmannmachines, generative models, transfer learning, reinforcementlearning, and so on, on and on $\ldots$ and on!  Neural networks is avast field.  However, many important ideas are variations on ideaswe've already discussed, and can be understood with a little effort.In this section I provide a glimpse of these as yet unseen vistas.-->本書では、MNISTの数字分類というただ一つの問題に専念してきました。MNISTの問題は味わい深くて、重要なアイデアを理解することができました。それは確率的勾配降下法や、逆伝播、畳込みネットワーク、正規化などです。しかし、この問題の扱う領域は広くはありません。ニューラルネットワークの文献を読むと、これまで議論に登場しなかった多くのアイデアに出会うはずです。例えば、再帰型ニューラルネットワークやボルツマンマシン、生成モデル、転移学習、強化学習、 $\ldots$ ！ニューラルネットワークはとても広い分野なのです。しかし、その多くの重要なアイデアは、実を言うとこれまで既に議論してきたアイデアの応用でしかありません。なので、少しの努力で理解できるはずです。このセクションでは、あなたのまだ知らないニューラルネットワークについて少しお見せしましょう。<!--The discussion isn't detailed, nor comprehensive - that wouldgreatly expand the book.  Rather, it's impressionistic, an attempt toevoke the conceptual richness of the field, and to relate some ofthose riches to what we've already seen.  Through the section, I'llprovide a few links to other sources, as entrees to learn more.  Ofcourse, many of these links will soon be superseded, and you may wishto search out more recent literature.  That point notwithstanding, Iexpect many of the underlying ideas to be of lasting interest.-->本書の範囲を越えてしまうため、詳しい議論や網羅的な議論は行いません。むしろ、各概念を直感的に理解できるように、これまで学んできたことに関連させて学んでいきます。セクションを通じて、他のソースへのリンクを示しておきます。このリンクを辿れば、さらに学ぶことができます。もちろん、リンクの多くは今後すぐに古びてしまい、最先端の文献を常に探すようになるでしょう。それでも、根底のアイデアは永続的に残るはずです。</p><p><!--<strong>Recurrent neural networks (RNNs):</strong> In the feedforward netswe've been using there is a single input which completely determinesthe activations of all the neurons through the remaining layers.  It'sa very static picture: everything in the network is fixed, with afrozen, crystalline quality to it.  But suppose we allow the elementsin the network to keep changing in a dynamic way.  For instance, thebehaviour of hidden neurons might not just be determined by theactivations in previous hidden layers, but also by the activations atearlier times.  Indeed, a neuron's activation might be determined inpart by its own activation at an earlier time.  That's certainly notwhat happens in a feedforward network.  Or perhaps the activations ofhidden and output neurons won't be determined just by the currentinput to the network, but also by earlier inputs.--><strong>再帰型ニューラルネットワーク (RNN)：</strong>これまで使ってきたフィードフォワードのネットワークでは、後方のニューロンの活性化を決める入力は1つでした。これはとても静的な構造です。ネットワークの全ては固定され、まるで凍結された結晶のようです。しかし、動的に変化し続けるネットワーク要素を想定することもできます。例えば、隠れ層の振る舞いが、1つ前の隠れ層の活性化によってのみ決まるのではなく、さらに以前の活性化にも影響されるような場合も考えられます。実際、あるニューロンの活性化が、そのニューロン自身の以前の活性化により定義されることもありそうです。単なるフィードフォワードネットワークでは、そんなことは起きません。さらに別の場合を考えてみると、隠れ層や出力層の活性化がネットワークの現在の入力だけでなく、もっと以前の入力によっても影響されるパターンも想定できます。</p><p><!--Neural networks with this kind of time-varying behaviour are known as<em>recurrent neural networks</em> or <em>RNNs</em>.  There are manydifferent ways of mathematically formalizing the informal descriptionof recurrent nets given in the last paragraph.  You can get theflavour of some of these mathematical models by glancing at<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">the  Wikipedia article on RNNs</a>.  As I write, that page lists no fewerthan 13 different models.  But mathematical details aside, the broadidea is that RNNs are neural networks in which there is some notion ofdynamic change over time.  And, not surprisingly, they're particularlyuseful in analysing data or processes that change over time.  Suchdata and processes arise naturally in problems such as speech ornatural language, for example.-->このような時間経過を考慮に入れたニューラルネットワークは、<em>再帰型ニューラルネットワーク</em>もしくは<em>RNN</em>として知られています。この再帰型ネットワークの数学的な定式化は、色々なやり方があります。<a href="http://en.wikipedia.org/wiki/Recurrent_neural_network">WikipediaのRNNのページ</a>を眺めると数学的モデルの雰囲気が掴めます。この執筆段階では、上記ページには13以上の異なるモデルが掲載されています。数学的詳細は置いておき、RNNを大雑把に表現すると、時間経過による動的変化の概念が含まれるニューラルネットです。このモデルは時間により変化するデータの分析や処理に役立ちます。そのようなデータや処理は、音声や自然言語などの問題に自然と含まれています。</p><p><!--One way RNNs are currently being used is to connect neural networksmore closely to traditional ways of thinking about algorithms, ways ofthinking based on concepts such as Turing machines and (conventional)programming languages.  <a href="http://arxiv.org/abs/1410.4615">A 2014  paper</a> developed an RNN which could take as input acharacter-by-character description of a (very, very simple!) Pythonprogram, and use that description to predict the output.  Informally,the network is learning to "understand" certain Python programs.<a href="http://arxiv.org/abs/1410.5401">A second paper, also from 2014</a>,used RNNs as a starting point to develop what they called a neuralTuring machine (NTM).  This is a universal computer whose entirestructure can be trained using gradient descent.  They trained theirNTM to infer algorithms for several simple problems, such as sortingand copying.-->現在のRNNの使われ方の1つは、アルゴリズムの概念やチューリングマシンやプログラミング言語などの概念を、ニューラルネットワークに学習させるというものです。<a href="http://arxiv.org/abs/1410.4615">2014の論文</a>では、（とても、とてもシンプルな！）Pythonプログラムの文字列を入力としてRNNに渡して、プログラムの出力を予測するのにRNNが使われました。砕けた言い方をすると、そのネットワークはPythonプログラムを「理解する」ことを学んでいます。同じく<a href="http://arxiv.org/abs/1410.5401">2014年の論文</a>では、ニューラルチューリングマシン（NTM）と呼ばれるものを開発するための足がかりとして、RNNを使っていました。このNTMは、勾配降下によって汎用計算機の構造を学習するものです。彼らは、NTMを訓練して、ソートやコピーなどの簡単な問題に対するアルゴリズムを推測させました。</p><p><!--As it stands, these are extremely simple toy models.  Learning toexecute the Python program <tt>print(398345+42598)</tt> doesn't make anetwork into a full-fledged Python interpreter!  It's not clear howmuch further it will be possible to push the ideas.  Still, theresults are intriguing.  Historically, neural networks have done wellat pattern recognition problems where conventional algorithmicapproaches have trouble.  Vice versa, conventional algorithmicapproaches are good at solving problems that neural nets aren't sogood at.  No-one today implements a web server or a database programusing a neural network!  It'd be great to develop unified models thatintegrate the strengths of both neural networks and more traditionalapproaches to algorithms.  RNNs and ideas inspired by RNNs may help usdo that.-->現状では、これらはシンプルすぎるトイモデルにとどまっています。<tt>print(398345+42598)</tt>というPythonプログラムを実行することを学習しただけでは、一人前のPythonインタープリターとは言えません。このアイデアを推し進めると、何が実現できるようになるかは明らかではありません。しかし、結果は実に面白いです。歴史的には、ニューラルネットワークは既存のアルゴリズムによるアプローチが手を焼いていたパターン認識問題を上手く解決してきました。対照的に、既存のアルゴリズムによるアプローチは、ニューラルネットワークが得意でない問題を上手く対処できます。今日の誰も、ウェブサーバやデータベースのプログラムにニューラルネットワークを使いません！ニューラルネットワークと伝統的なアルゴリズムによるアプローチの強みを両方取り入れたモデルが作れれば、素晴らしいと思います。RNNやRNNに端を発するアイデアはきっと、その目標に向けた良い足がかりとなるでしょう。</p><p><!--RNNs have also been used in recent years to attack many otherproblems.  They've been particularly useful in speech recognition.Approaches based on RNNs have, for example,<a href="http://arxiv.org/abs/1303.5778">set records for the accuracy of  phoneme recognition</a>.  They've also been used to develop<a href="http://www.fit.vutbr.cz/&#126;imikolov/rnnlm/thesis.pdf">improved  models of the language people use while speaking</a>.  Better languagemodels help disambiguate utterances that otherwise sound alike.  Agood language model will, for example, tell us that "to infinity andbeyond" is much more likely than "two infinity and beyond", despitethe fact that the phrases sound identical.  RNNs have been used to setnew records for certain language benchmarks.-->RNNは他にも多くの問題に利用されています。その中でも特に音声認識において、有効活用されています。例えば、RNNをもとにしたアプローチは<a href="http://arxiv.org/abs/1303.5778">音素認識の精度向上に貢献</a>しています。また、<a href="http://www.fit.vutbr.cz/&#126;imikolov/rnnlm/thesis.pdf">会話中の言語モデルの改善</a>にも寄与しています。言語モデルが良ければ、発音の似ているフレーズ間でも区別できます。例えば、言語モデルにより"to infinity and beyond" の方が "two infinity and beyond"よりも起きやすいことが分かります。RNNは言語に関するベンチマークでも、貢献してきたのです。</p><p><!--This work is, incidentally, part of a broader use of deep neural netsof all types, not just RNNs, in speech recognition.  For example, anapproach based on deep nets has achieved<a href="http://arxiv.org/abs/1309.1501">outstanding results on large  vocabulary continuous speech recognition</a>.  And another system basedon deep nets has been deployed in<a href="http://www.wired.com/2013/02/android-neural-network/">Google's  Android operating system</a> (for related technical work, see<a href="http://research.google.com/pubs/VincentVanhoucke.html">Vincent  Vanhoucke's 2012-2015 papers</a>).-->これらは、深層ニューラルネットワークの音声認識における成果の一部です。例えば、他の深層ネットワークに基づくアプローチは<a href="http://arxiv.org/abs/1309.1501">大語彙連続音声認識（LVCSR）で驚異的な結果</a>を残しました。また、深層ネットワークベースの別のシステムは<a href="http://www.wired.com/2013/02/android-neural-network/">GoogleのAndroidオペレーティングシステム</a>に採用されるほど精度が高いです(関連する技術動向は<a href="http://research.google.com/pubs/VincentVanhoucke.html">Vincent Vanhouckeによる2012から2015の論文</a>を参照してください)。</p><p><!--I've said a little about what RNNs can do, but not so much about howthey work.  It perhaps won't surprise you to learn that many of theideas used in feedforward networks can also be used in RNNs.  Inparticular, we can train RNNs using straightforward modifications togradient descent and backpropagation.  Many other ideas used infeedforward nets, ranging from regularization techniques toconvolutions to the activation and cost functions used, are alsouseful in recurrent nets.  And so many of the techniques we'vedeveloped in the book can be adapted for use with RNNs.-->RNNによって実現可能な具体例を述べてきましたが、どう実現するかについては触れてきませんでした。フィードフォワードのネットワークで学んできたアイデアがRNNでも同様に使われているので、きっとあなたに驚きはないでしょう。勾配降下法と逆伝播を単純に修正するだけで、RNNの訓練は可能となります。他の多くのアイデアも、RNNで有効です。例えば、正規化のテクニック、畳み込みや活性化関数、コスト関数などです。本書で見てきた多くのテクニックが、RNNに適用可能なのです。</p><p></p><p></p><p></p><p><!--<strong>Long short-term memory units (LSTMs):</strong> One challenge affectingRNNs is that early models turned out to be very difficult to train,harder even than deep feedforward networks.  The reason is theunstable gradient problem discussed in <a href="chap5.html">Chapter 5</a>.Recall that the usual manifestation of this problem is that thegradient gets smaller and smaller as it is propagated back throughlayers.  This makes learning in early layers extremely slow.  Theproblem actually gets worse in RNNs, since gradients aren't justpropagated backward through layers, they're propagated backwardthrough time.  If the network runs for a long time that can make thegradient extremely unstable and hard to learn from.  Fortunately, it'spossible to incorporate an idea known as long short-term memory units(LSTMs) into RNNs.  The units were introduced by<a href="http://dx.doi.org/10.1162/neco.1997.9.8.1735">Hochreiter and  Schmidhuber in 1997</a> with the explicit purpose of helping addressthe unstable gradient problem.  LSTMs make it much easier to get goodresults when training RNNs, and many recent papers (including manythat I linked above) make use of LSTMs or related ideas.--><strong>長期短期記憶ユニット（LSTM）：</strong>RNNの難しさの1つは、モデルを訓練するのがとても大変なことです。深層フィードフォワードネットワークよりもさらに訓練が難しいのです。その理由は<a href="chap5.html">5章</a>で議論した勾配の不安定性に起因します。思い出してください。これは、後ろの層に伝播するにつれ、勾配がどんどん小さくなっていくという問題でした。これにより、前方の層ほど学習が遅くなるのです。RNNではこの問題がさらに悪化します。なぜかと言うと、RNNでは勾配は単に前方の層へ伝播するだけではなく、時間経過に従い後方へも伝播するからです。ネットワークが長時間動いている場合、勾配は極めて不安定になり、学習は難しくなるでしょう。幸運なことに、長期短期記憶ユニットとして知られるアイデアをRNNへ取り入れることができます。このユニットは<a href="http://dx.doi.org/10.1162/neco.1997.9.8.1735">1997年にHochreiterとSchmidhuber</a>が、勾配の不安定性に取り組むために考案したものです。LSTMにより、RNNの訓練で良い結果を簡単に得ることができるようになったため、最近の論文の多く（私が上でリンクした論文も含め）では、LSTMもしくは関連するアイデアを利用しています。</p><p><!--<strong>Deep belief nets, generative models, and Boltzmann machines:</strong>Modern interest in deep learning began in 2006, with papers explaininghow to train a type of neural network known as a <em>deep belief  network</em> (DBN)*<span class="marginnote">*See  <a href="http://www.cs.toronto.edu/&#126;hinton/absps/fastnc.pdf">A fast    learning algorithm for deep belief nets</a>, by Geoffrey Hinton,  Simon Osindero, and Yee-Whye Teh (2006), as well as the related work  in  <a href="http://www.sciencemag.org/content/313/5786/504.short">Reducing    the dimensionality of data with neural networks</a>, by Geoffrey  Hinton and Ruslan Salakhutdinov (2006).</span>.--><strong>Deep Belief Network、生成モデル、ボルツマンマシン：</strong>近年のディープラーニングの流行の発端は2006年です。そのきっかけは、ニューラルネットワークの一種として知られる<em>Deep Belief Network</em> (DBN)*の訓練方法に関する論文でした。<span class="marginnote">*2006年のGeoffrey Hinton, Simon Osindero, Yee-Whye Tehによる<a href="http://www.cs.toronto.edu/&#126;hinton/absps/fastnc.pdf">A fast    learning algorithm for deep belief nets</a>と合わせて、2006年のGeoffrey  HintonとRuslan Salakhutdinovによる  <a href="http://www.sciencemag.org/content/313/5786/504.short">Reducing    the dimensionality of data with neural networks</a>を確認してください。</span><!--  DBNs were influential forseveral years, but have since lessened in popularity, while modelssuch as feedforward networks and recurrent neural nets have becomefashionable.  Despite this, DBNs have several properties that makethem interesting.-->DBNはその後数年は影響力がありましたが、人気は徐々に衰えていきました。その間、フィードフォワードモデルと再帰型ニューラルネットワークが流行っていきました。現在は流行っていないのですが、DBN自体は興味深い性質を幾つか備えています。</p><p><!--One reason DBNs are interesting is that they're an example of what'scalled a <em>generative model</em>.  In a feedforward network, wespecify the input activations, and they determine the activations ofthe feature neurons later in the network.  A generative model like aDBN can be used in a similar way, but it's also possible to specifythe values of some of the feature neurons and then "run the networkbackward", generating values for the input activations.  Moreconcretely, a DBN trained on images of handwritten digits can(potentially, and with some care) also be used to generate images thatlook like handwritten digits.  In other words, the DBN would in somesense be learning to write.  In this, a generative model is much likethe human brain: not only can it read digits, it can also write them.In Geoffrey Hinton's memorable phrase,<a href="http://www.sciencedirect.com/science/article/pii/S0079612306650346">to  recognize shapes, first learn to generate images</a>.-->DBNが興味を引く一つの理由は、<em>生成モデル</em>と呼ばれるものの一例だからです。フィードフォワードネットワークでは、入力の活性化状態を決まることで、ネットワークの後方の層の特徴の活性化状態も決まります。DBNのような生成モデルも、同じように使うことができます。しかし別の使い方として、あるニューロンの特徴を特定したら、「ネットワークを逆伝播」させ、入力の活性化値を生成することができます。もっと具体的に言うと、手書き数字について学習したDBNは、人の手書き数字のような画像を生成可能であるということです。抽象的に言い換えると、DBNは書くことを学べるのです。この点において、生成モデルは人間の脳のようです。数字を読むだけでなく、数字を書くこともできるのです。Geoffrey Hintonはこのことを、<a href="http://www.sciencedirect.com/science/article/pii/S0079612306650346">「形状を認識するために、まず画像を生成するのです」</a>という記憶に残るフレーズで表現しています。</p><p><!--A second reason DBNs are interesting is that they can do unsupervisedand semi-supervised learning.  For instance, when trained with imagedata, DBNs can learn useful features for understanding other images,even if the training images are unlabelled.  And the ability to dounsupervised learning is extremely interesting both for fundamentalscientific reasons, and - if it can be made to work well enough -for practical applications.-->DBNが好奇心をそそる2つ目の理由は、DBNで教師なし学習や半教師あり学習が可能であるからです。例えば、画像データを入力とする学習を行う時に、訓練画像にラベルがなくても、DBNは他の画像についての有効な特徴を学習できます。半教師あり学習は科学的にも、（もし十分上手く動作しているなら）実用的にもとても興味深いです。</p><p><!--Given these attractive features, why have DBNs lessened in popularityas models for deep learning?  Part of the reason is that models suchas feedforward and recurrent nets have achieved many spectacularresults, such as their breakthroughs on image and speech recognitionbenchmarks.  It's not surprising and quite right that there's now lotsof attention being paid to these models.  There's an unfortunatecorollary, however.  The marketplace of ideas often functions in awinner-take-all fashion, with nearly all attention going to thecurrent fashion-of-the-moment in any given area.-->こうした魅力的な性質を持っているのに、なぜDBNはディープラーニングと比べて人気がないのでしょうか？理由の1つは、フィードフォワードモデルや再帰型ネットワークモデルが、画像認識や音声認識のベンチマークで、目を見張るほどの成果を出したからです。これらのモデルへ注目が集まるのも驚きはないですし、納得できます。DBNに人気がないのは残念ですが、当然の帰結と言えます。アイデアの市場は、しばしば勝者総取り方式であり、その瞬間において全ての注目が一部に集まる性質があります。<!--It can becomeextremely difficult for people to work on momentarily unfashionableideas, even when those ideas are obviously of real long-term interest.My personal opinion is that DBNs and other generative models likelydeserve more attention than they are currently receiving.  And I won'tbe surprised if DBNs or a related model one day surpass the currentlyfashionable models.  For an introduction to DBNs, see<a href="http://www.scholarpedia.org/article/Deep_belief_networks">this  overview</a>.  I've also found<a href="http://www.cs.toronto.edu/&#126;hinton/absps/guideTR.pdf">this  article</a> helpful.  It isn't primarily about deep belief nets,<em>per se</em>, but does contain much useful information aboutrestricted Boltzmann machines, which are a key component of DBNs.-->その時、流行っていないアイデアに取り組もうとするのは、とても難しいのでしょう。たとえそのアイデアが、長期的には利益が出ることが明らかだったとしても。私の個人的な意見では、DBNや他の生成モデルは、潜在的には現在よりもさらに注目を集めてよいはずです。DBNや関連するモデルがいつの日か、現在流行のモデルよりも人気を得ても驚きません。DBNについて入門するには、<a href="http://www.scholarpedia.org/article/Deep_belief_networks">この要約</a>を見てください。<a href="http://www.cs.toronto.edu/&#126;hinton/absps/guideTR.pdf">この記事</a>も有益です。<em>当然</em>、DBNがメインではないのですが、DBNの重要な要素である制約ボルツマンマシンに関する有益な情報が含まれています。</p><p><!--<strong>Other ideas:</strong> What else is going on in neural networks anddeep learning?  Well, there's a huge amount of other fascinating work.Active areas of research include using neural networks to do<a href="http://machinelearning.org/archive/icml2008/papers/391.pdf">natural  language processing</a> (see <a href="http://arxiv.org/abs/1103.0398">also  this informative review paper</a>),<a href="assets/MachineTranslation.pdf">machine translation</a>, as well asperhaps more surprising applications such as<a href="http://yann.lecun.com/exdb/publis/pdf/humphrey-jiis-13.pdf">music  informatics</a>.  There are, of course, many other areas too.  In manycases, having read this book you should be able to begin followingrecent work, although (of course) you'll need to fill in gaps inpresumed background knowledge.--><strong>他のアイデア：</strong>ニューラルネットワークとディープラーニングには他にどんなアイデアがあるでしょう？ええ、他にも魅力的な研究がたくさんあります。ニューラルネットワークの研究が盛んに行われている分野として、<a href="http://machinelearning.org/archive/icml2008/papers/391.pdf">自然言語処理</a> (<a href="http://arxiv.org/abs/1103.0398">この有益なレビュー論文</a>を確認してください)や、<a href="assets/MachineTranslation.pdf">機械翻訳</a>、そして<a href="http://yann.lecun.com/exdb/publis/pdf/humphrey-jiis-13.pdf">音楽情報学</a>などにも驚きの応用もあります。今挙げた以外の領域にもあります。背景知識のギャップを埋める必要はありますが、本書を読み終えたら、最近の研究を読み始めることができるはずです。</p><p><!--Let me finish this section by mentioning a particularly fun paper.  Itcombines deep convolutional networks with a technique known asreinforcement learning in order to learn to<a href="http://www.cs.toronto.edu/&#126;vmnih/docs/dqn.pdf">play video games  well</a> (see also<a href="http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html">this  followup</a>).  The idea is to use the convolutional network tosimplify the pixel data from the game screen, turning it into asimpler set of features, which can be used to decide which action totake: "go left", "go down", "fire", and so on.  What isparticularly interesting is that a single network learned to playseven different classic video games pretty well, outperforming humanexperts on three of the games.-->このセクションの締めに、特別楽しめる論文を紹介しましょう。この論文は深層畳み込みネットワークと強化学習を組み合わせて、<a href="http://www.cs.toronto.edu/&#126;vmnih/docs/dqn.pdf">ビデオゲームを上手くプレイ</a> (<a href="http://www.nature.com/nature/journal/v518/n7540/abs/nature14236.html">こちらの追加情報</a>も参照してください)するよう学習するものです。畳み込みネットワークを使って、ゲームスクリーンのピクセルデータを単純化し、特徴へ変換します。それをもとに「左へ移動する」「下へ移動する」「発泡する」などの行動を決定します。一番興味深いのは、単一のネットワークで、7つの異なるクラシックなビデオゲームを、とても上手くプレイするよう学習したことです。そのうち3つのゲームでは、人間の専門家のパフォーマンスで上回りました。<!--Now, this all sounds like a stunt, andthere's no doubt the paper was well marketed, with the title "PlayingAtari with reinforcement learning".  But looking past the surfacegloss, consider that this system is taking raw pixel data - itdoesn't even know the game rules! - and from that data learning todo high-quality decision-making in several very different and veryadversarial environments, each with its own complex set of rules.That's pretty neat.-->これは離れ業のように聞こえるでしょう。この論文は "Playing Atari with reinforcement learning" というタイトルで注目を集めました。定性的に考えると、このシステムは生のピクセルデータを入力とするだけです。つまり、ゲームのルールすら知らないのです！複雑なルールで構成される様々な環境において、単なるピクセルデータのみを入力として、高品質な意思決定を行っているのです。極めて巧妙なのです。</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p><h3><a name="on_the_future_of_neural_networks"></a><a href="#on_the_future_of_neural_networks"><!--On the future of neural networks--> ニューラルネットワークの未来</a></h3></p><p><!--<strong>Intention-driven user interfaces:</strong> There's an old joke inwhich an impatient professor tells a confused student: "don't listento what I say; listen to what I <em>mean</em>".  Historically,computers have often been, like the confused student, in the darkabout what their users mean.  But this is changing.  I still remembermy surprise the first time I misspelled a Google search query, only tohave Google say "Did you mean [corrected query]?" and to offer thecorresponding search results.  Google CEO Larry Page<a href="http://googleblog.blogspot.ca/2012/08/building-search-engine-of-future-one.html">once  described the perfect search engine as understanding exactly what  [your queries] mean and giving you back exactly what you want</a>.--><strong>意図で駆動するユーザインタフェース：</strong>気短な教授が、どうすればよいか分かっていない学生に次のように伝える古いジョークがあります。「私が言うことを聞かなくていい。私の<em>意図</em>を読み取ってくれ」と。昔からコンピュータは、ユーザの意図が分からず戸惑っている学生のような存在でした。しかし、現在変化しつつあります。私が初めてGoogle検索でスペルを間違えてしまった時に、Googleが "Did you mean [正しいクエリ]?"と伝えてきて、対応する検索結果を提示したときの驚きをまだ覚えています。GoogleのCEOのLarry Pageは、<a href="http://googleblog.blogspot.ca/2012/08/building-search-engine-of-future-one.html">あなたの検索の意味を正確に理解し、まさに望むものを提供する完璧な検索エンジンについて記述</a>を残しています。</p><p><!--This is a vision of an <em>intention-driven user interface</em>.  Inthis vision, instead of responding to users' literal queries, searchwill use machine learning to take vague user input, discern preciselywhat was meant, and take action on the basis of those insights.-->これは<em>意図駆動のユーザインタフェース</em>の理想像です。この理想像では、ユーザによる文字ベースのクエリの代わりに、ぼんやりとした入力からユーザの意図を機械学習で識別して、その意図に対して機能を提供するのです。</p><p><!--The idea of intention-driven interfaces can be applied far morebroadly than search.  Over the next few decades, thousands ofcompanies will build products which use machine learning to make userinterfaces that can tolerate imprecision, while discerning and actingon the user's true intent.  We're already seeing early examples ofsuch intention-driven interfaces: Apple's Siri; Wolfram Alpha; IBM'sWatson; systems which can<a href="http://arxiv.org/abs/1411.4555">annotate photos and videos</a>; andmuch more.-->意図駆動のインターフェースのアイデアは、検索以外にも幅広く適用できます。数十年以内に、何千もの会社が機械学習を使って、ユーザの真の意図を識別・行動するインターフェースを作り上げるでしょう。私たちは意図駆動のインターフェースの早熟な例を既に目の当たりにしています。AppleのSiri、Wolfram Alpha、IBMのWatsonや、<a href="http://arxiv.org/abs/1411.4555">写真やビデオに注釈をつける</a>システムなどです。</p><p><!--Most of these products will fail. Inspired user interface design ishard, and I expect many companies will take powerful machine learningtechnology and use it to build insipid user interfaces.  The bestmachine learning in the world won't help if your user interfaceconcept stinks.  But there will be a residue of products whichsucceed.  Over time that will cause a profound change in how we relateto computers.  Not so long ago - let's say, 2005 - users took itfor granted that they needed precision in most interactions withcomputers.  Indeed, computer literacy to a great extent meantinternalizing the idea that computers are extremely literal; a singlemisplaced semi-colon may completely change the nature of aninteraction with a computer.  But over the next few decades I expectwe'll develop many successful intention-driven user interfaces, andthat will dramatically change what we expect when interacting withcomputers.-->これらの製品の大半は失敗するでしょう。見事なユーザインターフェースデザインは難しいのです。しかし、強力な機械学習技術を使うことで、素晴らしいユーザインターフェースを作り上げることを期待しています。ユーザインターフェースのコンセプト自体がもともと悲惨だと、機械学習が素晴らしくても意味がないでしょう。しかし、それ以外の製品は成功すると思います。そのうち、私たちのコンピュータへの関わり方に大きな変化が起こるでしょう。つい最近まで、2005年くらいまででしょうか、ユーザはコンピュータとやり取りするのに正確な操作を必要としていました。実際、当時のコンピュータリテラシの意味というのは、コンピュータには想像力がないということを理解することでした。一文字分セミコロンの位置を間違えただけで、コンピュータは意図通りに動きません。しかし、数十年後には、上手く作動する意図駆動のユーザインターフェースが開発されることを私は期待します。その結果、コンピュータとの関わり方が劇的に変わるでしょう。</p><p><!--<strong>Machine learning, data science, and the virtuous circle of  innovation:</strong> Of course, machine learning isn't just being used tobuild intention-driven interfaces.  Another notable application is indata science, where machine learning is used to find the "knownunknowns" hidden in data.  This is already a fashionable area, andmuch has been written about it, so I won't say much.  But I do want tomention one consequence of this fashion that is not so often remarked:over the long run it's possible the biggest breakthrough in machinelearning won't be any single conceptual breakthrough. Rather, thebiggest breakthrough will be that machine learning research becomesprofitable, through applications to data science and other areas.--><strong>機械学習、データサイエンス、イノベーションの好循環：</strong>もちろん、機械学習は意図駆動のユーザインターフェースだけに使われるのではありません。注目の応用先はデータサイエンスです。そこでは、機械学習がデータの中の「既知の未知」を発見するのに使われます。この分野は既に流行っており、文献はたくさんあるので、私は多くを語りません。しかし、この流行りの行き着く先について、一つ述べておきます。長期的に見ると、機械学習でのブレークスルーはおそらく、アイデアや発明のブレークスルーではありません。一番大きなブレークスルーは、データサイエンスか他の分野かにおいて、機械学習研究が利益を出すことだと思います。<!--If a company can invest 1 dollar in machine learning research and get 1dollar and 10 cents back reasonably rapidly, then a lot of money willend up in machine learning research.  Put another way, machinelearning is an engine driving the creation of several major newmarkets and areas of growth in technology. The result will be largeteams of people with deep subject expertise, and with access toextraordinary resources.  That will propel machine learning furtherforward, creating more markets and opportunities, a virtuous circle ofinnovation.-->ある会社が1ドルを機械学習研究に投資して、すぐに1ドルと10セントを回収したとすると、多くのお金が機械学習研究につぎ込まれるでしょう。言い換えると、機械学習は新たな巨大市場を生成し、技術の成長分野となりうるのです。その結果、深い専門技術をもった多くのチームと、途轍もなく巨大なリソースが誕生します。最終的に、それらが機械学習を更に促進させ、さらなる市場と機会を作り、イノベーションの好循環の図となるはずです。</p><p><!--<strong>The role of neural networks and deep learning:</strong> I've beentalking broadly about machine learning as a creator of newopportunities for technology.  What will be the specific role ofneural networks and deep learning in all this?--><strong>ニューラルネットワークとディープラーニングの役割：</strong>機械学習について、技術にとって新たな機会を創出するだろうとこれまで広く述べてきました。ニューラルネットワークとディープラーニングの役割は、特にどのようなものでしょうか？</p><p><!--To answer the question, it helps to look at history.  Back in the1980s there was a great deal of excitement and optimism about neuralnetworks, especially after backpropagation became widely known.  Thatexcitement faded, and in the 1990s the machine learning baton passedto other techniques, such as support vector machines.  Today, neuralnetworks are again riding high, setting all sorts of records,defeating all comers on many problems.  But who is to say thattomorrow some new approach won't be developed that sweeps neuralnetworks away again?  Or perhaps progress with neural networks willstagnate, and nothing will immediately arise to take their place?-->この疑問に答えるには、歴史を振り返るのが助けになるかもしれません。1980年代は多くの実験がなされており、ニューラルネットワークに対する楽観的な見方が広がっていました。逆伝播が広く知られるようになってからは、特に楽観的でした。実験が減っていき、1990年代に入って、他のサポートベクターマシンなどのテクニックにバトンを渡しました。今日、ニューラルネットワークは再び、勢いに乗っています。様々な記録を打ち立て、多くの問題に対する他の解決法を負かしました。しかし、明日には別の新しいアプローチが登場して、ニューラルネットワークを再び淘汰する可能性を誰が否定できるでしょうか？もしくは、代替技術がなくても、そのうちニューラルネットワークの進化は停滞するのではないでしょうか？</p><p><!--For this reason, it's much easier to think broadly about the future ofmachine learning than about neural networks specifically.  Part of theproblem is that we understand neural networks so poorly.  Why is itthat neural networks can generalize so well?  How is it that theyavoid overfitting as well as they do, given the very large number ofparameters they learn?  Why is it that stochastic gradient descentworks as well as it does?  How well will neural networks perform asdata sets are scaled?  For instance, if ImageNet was expanded by afactor of $10$, would neural networks' performance improve more orless than other machine learning techniques?  These are all simple,fundamental questions.  And, at present, we understand the answers tothese questions very poorly.  While that's the case, it's difficult tosay what role neural networks will play in the future of machinelearning.-->このような可能性を考えると、ニューラルネットワークだけを予想するよりも、広く機械学習の未来を考えるほうが簡単です。そもそも私たちは、ニューラルネットワークを深く理解できていません。なぜ、ニューラルネットワークは十分に汎化できるのでしょうか？パラメータ数がとても多いときに、過適合を避けるにはどうすればよいのでしょうか？確率的勾配降下法はなぜ上手く動作するのでしょうか？データセットのサイズが変化したとき、ニューラルネットワークはどの程度上手く動作するでしょうか？例えば、ImageNetが $10$ 倍に拡張された場合、他の機械学習技術と比較して、ニューラルネットワークの性能は向上するでしょうか？それとも低下するでしょうか？これらは全てシンプルで本質的な疑問です。そして現在、私たちはこれらの疑問への答えを持ち合わせていません。ですので、機械学習の未来におけるニューラルネットワークの役割を述べるのは難しいのです。</p><p><!--I will make one prediction: I believe deep learning is here to stay.The ability to learn hierarchies of concepts, building up multiplelayers of abstraction, seems to be fundamental to making sense of theworld.  This doesn't mean tomorrow's deep learners won't be radicallydifferent than today's.  We could see major changes in the constituentunits used, in the architectures, or in the learning algorithms.Those changes may be dramatic enough that we no longer think of theresulting systems as neural networks.  But they'd still be doing deeplearning.-->1つ予測します。ディープラーニングは定着すると私は思います。階層的な概念を学習する能力や、層を組み立てて抽象化を行う能力は、世界を構成している本質そのものだと思います。これは、明日のディープラーニングのモデルが、今日のモデルと同じであることを意味しているのではありません。構造の中の構成要素や学習アルゴリズムに、大きな進展があるはずです。それはきっと、あまりに目覚ましい変化なので、その結果のシステムがニューラルネットかどうかをもはや気にしていないと思います。しかし、きっとディープラーニングは使われているでしょう。</p><p><a name="AI"></a></p><p><!--<strong>Will neural networks and deep learning soon lead to artificial  intelligence?</strong>  In this book we've focused on using neural nets todo specific tasks, such as classifying images.  Let's broaden ourambitions, and ask: what about general-purpose thinking computers?Can neural networks and deep learning help us solve the problem of(general) artificial intelligence (AI)?  And, if so, given the rapidrecent progress of deep learning, can we expect general AI any timesoon?--><strong>ニューラルネットワークとディープラーニングはすぐに人工知能となるか?</strong>本書ではニューラルネットワークを画像分類など特定の目的に使ってきました。私たちの野望を大きくして、尋ねてみましょう。汎用に考えるコンピュータになりえますか？ニューラルネットワークとディープラーニングは、（汎用）人工知能（AI）の問題を解けますか？そしてその場合は、ディープラーニングがさらに発展すれば、すぐに汎用AIを実現できますか？</p><p><!--Addressing these questions comprehensively would take a separate book.Instead, let me offer one observation.  It's based on an idea known as<a href="http://en.wikipedia.org/wiki/Conway%27s_law">Conway's law</a>:<blockquote>  Any organization that designs a system... will inevitably produce a  design whose structure is a copy of the organization's communication  structure.</blockquote>-->これらの疑問に対して包括的に取り組むためには、別の本を書かなくてはいけません。代わりに、一つ考察してみましょう。この考察は、<a href="http://en.wikipedia.org/wiki/Conway%27s_law">コンウェイの法則</a>として知られるアイデアに基づいています。コンウェイの法則とは、次のことを主張しているものです。<blockquote>  システムを設計する組織は、その構造をそっくり真似た構造の設計を生み出してしまう。</blockquote><!--So, for example, Conway's law suggests that the design of a Boeing 747aircraft will mirror the extended organizational structure of Boeingand its contractors at the time the 747 was designed.  Or for asimple, specific example, consider a company building a complexsoftware application.  If the application's dashboard is supposed tobe integrated with some machine learning algorithm, the personbuilding the dashboard better be talking to the company's machinelearning expert.  Conway's law is merely that observation, writ large.-->これは例えば、ボーイング747航空機の設計が、当時のボーイング社と業者の組織的な構造を反映していることを示唆します。もしくは、さらにシンプルな具体例として、複雑なソフトウェアアプリケーションを作る会社を考えてみます。アプリケーションのダッシュボードに、ある機械学習アルゴリズムを取り入れようとする場合、ダッシュボードの製作者は、その会社の機械学習の専門家と話をした方がよいでしょう。コンウェイの法則は、この例を大規模にしたものに相当します。</p><p><!--Upon first hearing Conway's law, many people respond either "Well,isn't that banal and obvious?" or "Isn't that wrong?"  Let me startwith the objection that it's wrong.  As an instance of this objection,consider the question: where does Boeing's accounting department showup in the design of the 747?  What about their janitorial department?Their internal catering?  And the answer is that these parts of theorganization probably don't show up explicitly anywhere in the 747.So we should understand Conway's law as referring only to those partsof an organization concerned explicitly with design and engineering.-->コンウェイの法則を初めて聞いた時、多くの人は「当たり前でしょ？」と反応するか、「間違ってない？」と反応します。コンウェイの法則が間違っている、という二つ目の指摘に対して意見を述べさせてください。この指摘の背景にはきっと、次のような質問が想定されているのでしょう。「ボーイングの経理部署が、747の設計のどこに現れていますか？」、「管理部署はどこですか？」、「ケータリングを行う部署は？」と。確かに、これらの組織部署は747のどこにも、現れていません。しかし、コンウェイの法則における組織とは、設計とエンジニアリングに明らかに関わる部署のみを意図していると理解すべきなのです。</p><p><!--What about the other objection, that Conway's law is banal andobvious?  This may perhaps be true, but I don't think so, fororganizations too often act with disregard for Conway's law.  Teamsbuilding new products are often bloated with legacy hires or,contrariwise, lack a person with some crucial expertise.  Think of allthe products which have useless complicating features.  Or think ofall the products which have obvious major deficiencies - e.g., aterrible user interface.  Problems in both classes are often caused bya mismatch between the team that was needed to produce a good product,and the team that was actually assembled.  Conway's law may beobvious, but that doesn't mean people don't routinely ignore it.-->コンウェイの法則が、平凡で当たり前のことであるという指摘はどうでしょうか？この指摘は部分的には正しい気がします。しかし、人材配置に失敗している組織に対しては、コンウェイの法則は成り立たないと思っています。新製品を開発するチームに、年配者しかいなかったり、逆に若者のみで専門性を持つものがいなかったりする状況があります。これは単純に、製品のターゲットと開発チームとのミスマッチです。<!--一般的には、コンウェイの法則を適用する方が上手くいくはずなのですが、-->時に人がこの法則を無視して人材配置する場合があるのです。</p><p><!--Conway's law applies to the design and engineering of systems where westart out with a pretty good understanding of the likely constituentparts, and how to build them.  It can't be applied directly to thedevelopment of artificial intelligence, because AI isn't (yet) such aproblem: we don't know what the constituent parts are.  Indeed, we'renot even sure what basic questions to be asking.  In others words, atthis point AI is more a problem of science than of engineering.Imagine beginning the design of the 747 without knowing about jetengines or the principles of aerodynamics.  You wouldn't know whatkinds of experts to hire into your organization.  As Wernher von Braunput it, "basic research is what I'm doing when I don't know what I'mdoing".  Is there a version of Conway's law that applies to problemswhich are more science than engineering?-->コンウェイの法則は、私たちが構成要素とその組立て方をよく理解している、システムの設計とエンジニアリングに適用されます。人工知能の開発に直接には適用できません。なぜかと言うと、AIはまだそのような問題ではないからです。どんな構成要素からなるのか知りません。実際、AIに関する基礎的な問いすら分かっていないのです。つまり、現時点でAIは工学の問題というよりは、科学の領域の問題と言えます。ジェットエンジンや空気力学の原理を知らない状態で、747を設計する状況を想像してください。あなたは、どんな種類の専門家を組織に雇えばいいか分からいないでしょう。Wernher von Braunは「何をしているのか分かっていない場合には、基礎研究こそ行うべきである」と述べています。さて、工学ではなく科学の問題に適用できるバージョンのコンウェイの法則というのはあるのでしょうか？</p><p><!--To gain insight into this question, consider the history of medicine.In the early days, medicine was the domain of practitioners like Galenand Hippocrates, who studied the entire body.  But as our knowledgegrew, people were forced to specialize.  We discovered many deep newideas*<span class="marginnote">*My apologies for overloading "deep".  I won't define  "deep ideas" precisely, but loosely I mean the kind of idea which  is the basis for a rich field of enquiry.  The backpropagation  algorithm and the germ theory of disease are both good examples.</span>:-->この問いに対する洞察を得るために、薬の歴史を考えてみてください。古代において、薬はガレノスやヒポクラテスら体全体を研究する専門家の領域でした。しかし、知識が蓄積されて膨大になるにつれ、徐々に領域が細分化されていきます。その過程で、私たちは多くの奥深い発見をしてきました*<span class="marginnote">  *「深い」と重ねてしまい申し訳ありません。  「深い」の意味の正確な定義を私はしませんが、大ざっぱ言うと研究分野全体に必要な教養のようなものを意図しています。  逆伝播のアルゴリズムと胚種説はどちらもその好例です。</span>。<!--think of things like the germ theory of disease, for instance, or theunderstanding of how antibodies work, or the understanding that theheart, lungs, veins and arteries form a complete cardiovascularsystem.  Such deep insights formed the basis for subfields such asepidemiology, immunology, and the cluster of inter-linked fieldsaround the cardiovascular system.  And so the structure of ourknowledge has shaped the social structure of medicine.  This isparticularly striking in the case of immunology: realizing the immunesystem exists and is a system worthy of study is an extremelynon-trivial insight.  So we have an entire field of medicine - withspecialists, conferences, even prizes, and so on - organized aroundsomething which is not just invisible, it's arguably not a distinctthing at all.-->例えば、胚種説のような理論や、抗生物質の作用や、心、肺、静脈や動脈による心臓血管の形成などを考えてみてください。そのような深い洞察はやがては、疫学、免疫学、心臓血管系の学問などの部分的な分野となりました。したがって、私たちの知識の構造は、薬学の社会的構造を形作ったのです。これは免疫学においては、とても画期的なことでした。免疫系の存在に気づき、免疫系を研究する価値を見出したのは深い洞察からでした。そのようにして今、薬学という分野があるのです。専門家がいて、会議があって、賞までもあります。</p><p><!--This is a common pattern that has been repeated in manywell-established sciences: not just medicine, but physics,mathematics, chemistry, and others.  The fields start out monolithic,with just a few deep ideas.  Early experts can master all those ideas.But as time passes that monolithic character changes.  We discovermany deep new ideas, too many for any one person to really master.  Asa result, the social structure of the field re-organizes and dividesaround those ideas.  Instead of a monolith, we have fields withinfields within fields, a complex, recursive, self-referential socialstructure, whose organization mirrors the connections between ourdeepest insights.  <em>And so the structure of our knowledge shapes  the social organization of science.  But that social shape in turn  constrains and helps determine what we can discover.</em>  This is the  scientific analogue of Conway's law.-->これは科学が確立するときに繰り返される普遍的なパターンです。薬学だけでなく、物理学、数学、化学でも同じです。そんな分野でも初めは、深いアイデアを2、3個だけ伴っていました。黎明期の専門家は、そのアイデアの全てを使いこなします。しかし時間が経つと、アイデアは一枚岩でなくなります。深いアイデアが新たに発見され、1人ではとても扱いきれなくなります。結果的に、その分野の社会構造は再編され、アイデアごとに分割されます。一枚岩の代わりに、分野の中に分野がある構造となり、複雑かつ再帰的で自己言及的な社会的構造となります。その組織は、私たちの深い洞察を反映するのです。<em>  つまり、私たちの知識は科学の社会的構造を形成します。  しかし一方で、社会的構造は、私たちの発見を邪魔したり促進したりします。  </em>これは科学におけるコンウェイの法則のアナロジーです。 </p><p></p><p><!--   So what's this got to do with deep learning or AI? -->さて、この話がディープラーニングやAIと何の関係があるのでしょうか？</p><p><!--Well, since the early days of AI there have been arguments about itthat go, on one side, "Hey, it's not going to be so hard, we've got[super-special weapon] on our side", countered by "[super-specialweapon] won't be enough".  Deep learning is the latest super-specialweapon I've heard used in such arguments*<span class="marginnote">*Interestingly, often  not by leading experts in deep learning, who have been quite  restrained.  See, for example, this  <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">thoughtful    post</a> by Yann LeCun. This is a difference from many earlier  incarnations of the argument.</span>;-->ええと、AIの黎明期には、人々は次のようなやり取りをよく行っていました。「そんなに難しくないよ、"超特別な武器"を俺たちは既に持ってるんだから」と一方が言うと、「"超特別な武器"じゃあまだ全然ダメだよ」と否定するというものです。ディープラーニングは、同様の話題に使われる"最新の超特別な武器"です*<span class="marginnote">  *興味深いことに、否定されている人はしばしばディープラーニングの最先端の人ではありません。  例えば、下記のYann LeCunによる  <a href="https://www.facebook.com/yann.lecun/posts/10152348155137143">思慮に満ちた投稿</a>を見てください。  この内容は先ほどの主張とは全然異なります。  </span>。<!--earlier versions of the argumentused logic, or Prolog, or expert systems, or whatever the mostpowerful technique of the day was.  The problem with such arguments isthat they don't give you any good way of saying just how powerful anygiven candidate super-special weapon is.  Of course, we've just spenta chapter reviewing evidence that deep learning can solve extremelychallenging problems.  It certainly looks very exciting and promising.But that was also true of systems like Prolog or<a href="http://en.wikipedia.org/wiki/Eurisko">Eurisko</a> or expert systemsin their day.  And so the mere fact that a set of ideas looks verypromising doesn't mean much.  How can we tell if deep learning istruly different from these earlier ideas?  Is there some way ofmeasuring how powerful and promising a set of ideas is?  Conway's lawsuggests that as a rough and heuristic proxy metric we can evaluatethe complexity of the social structure associated to those ideas.-->この主張の昔のバージョンでは、論理やProlog（非手続き型プログラミング言語）、エキスパートシステムなど、その時代の強力なテクニックが使われていました。そのような主張に伴う重大な欠陥は、その"超特別な武器"がいかに強力かを示すことができないことです。もちろん、ここまで1章分を費やして、ディープラーニングがとんでもない難問を解く様子を確かめました。非常に刺激的で将来有望に見えます。しかし、それはPrologや<a href="http://en.wikipedia.org/wiki/Eurisko">Eurisko</a>、当時のエキスパートシステムにも同じことが言えます。つまり、有望そうなアイデアがあるだけでは十分ではないのです。ディープラーニングがこれらの昔のアイデアよりも遥かに強力であることをどう証明するのでしょうか？アイデアがどれほど強力で有望かを定量化する方法はあるのでしょうか？そこでコンウェイの法則の出番です。コンウェイの法則では、雑でヒューリスティックな近似指標として、アイデアに関係する社会構造の複雑さを評価することを提案しています。</p><p><!--So, there are two questions to ask.  First, how powerful a set ofideas are associated to deep learning, according to this metric ofsocial complexity?  Second, how powerful a theory will we need, inorder to be able to build a general artificial intelligence?-->さあ、2つの質問を考えましょう。1つ目は、ディープラーニングに関連するアイデアはどれほど強力なのかについてです。これは、社会的複雑性の指標で評価します。2つ目は汎用人工知能を実現するために、どれほど強力な理論が必要なのかについてです。</p><p><!--As to the first question: when we look at deep learning today, it's anexciting and fast-paced but also relatively monolithic field.  Thereare a few deep ideas, and a few main conferences, with substantialoverlap between several of the conferences.  And there is paper afterpaper leveraging the same basic set of ideas: using stochasticgradient descent (or a close variation) to optimize a cost function.It's fantastic those ideas are so successful.  But what we don't yetsee is lots of well-developed subfields, each exploring their own setsof deep ideas, pushing deep learning in many directions.  And so,according to the metric of social complexity, deep learning is, ifyou'll forgive the play on words, still a rather shallow field.  It'sstill possible for one person to master most of the deepest ideas inthe field.-->一つ目の質問についてまず考えます。今日のディープラーニングを眺めると、興味深くて進展も速いですが、比較的一枚岩の分野ではあります。幾つかの深いアイデアがあり、幾つかの会議があり、それらの会議は実質的な重なりがあります。そして、どの論文も同じ基礎的なアイデアを使っています。どの論文でも、確率的勾配降下法（かその変種）を用いて、コスト関数を最適化しています。それらのアイデアが上手くいくのは素晴らしいことです。しかし、深いアイデアをそれぞれ探索し、ディープラーニングを各方向へ押し広げるような、学問分野の下位部分はまだないようです。したがって社会的複雑さの指標に従って、誤解を恐れずに言えば、ディープラーニングはまだまだ浅い分野であると断言できます。まだ一人の人間が、全ての深いアイデアを習得することが可能な分野なのです。</p><p><!--On the second question: how complex and powerful a set of ideas willbe needed to obtain AI?  Of course, the answer to this question is:no-one knows for sure.  But in the <a href="sai.html">appendix</a> I examinesome of the existing evidence on this question.  I conclude that, evenrather optimistically, it's going to take many, many deep ideas tobuild an AI.  And so Conway's law suggests that to get to such a pointwe will necessarily see the emergence of many interrelatingdisciplines, with a complex and surprising structure mirroring thestructure in our deepest insights.  We don't yet see this rich socialstructure in the use of neural networks and deep learning.  And so, Ibelieve that we are several decades (at least) from using deeplearning to develop general AI.-->今度は二つ目の質問に取り組みます。AIに到達するには、どれだけの複雑さと強力さを備えたアイデアが必要なのでしょうか？もちろん、この質問への答えは誰にも分かりません。しかし、<a href="sai.html">付録</a>では、この質問に対する実験を行っています。どれだけ楽観的に言ったとしても、AIを作るには深いアイデアがとてもたくさん必要です。そしてそこに到達するには、そのような深い洞察を反映するような、複雑で驚くべき社会的構造が現れる必要があることをコンウェイの法則が示唆しています。ニューラルネットワークとディープラーニングには、このような複雑な社会構造はまだありません。したがって、ディープラーニングを使って汎用AIを開発するのには、まだ数十年かかるはずです。</p><p><!--I've gone to a lot of trouble to construct an argument which istentative, perhaps seems rather obvious, and which has an indefiniteconclusion.  This will no doubt frustrate people who crave certainty.Reading around online, I see many people who loudly assert verydefinite, very strongly held opinions about AI, often on the basis offlimsy reasoning and non-existent evidence.  My frank opinion is this:it's too early to say.  As the old joke goes, if you ask a scientisthow far away some discovery is and they say "10 years" (or more),what they mean is "I've got no idea".  AI, like controlled fusionand a few other technologies, has been 10 years away for 60 plusyears.  On the flipside, what we definitely do have in deep learningis a powerful technique whose limits have not yet been found, and manywide-open fundamental problems.  That's an exciting creativeopportunity.-->ここまで、暫定的な議論を構築するのに苦労しました。この議論の流れは明らかであるように見えるものの、結論はぼんやりとしています。きっと、厳密さを追い求める人々をイライラさせていると思います。オンラインの文献を読んでいると、とても確信をもって主張をしている人が、AIに対して強い意見をもっているのに、実はその論理が薄っぺらいものであったり、存在しない証拠に依存していたりするのをよく見ます。私の率直な意見として、時期尚早で何も言うことはありません。次のような古いジョークがあります。科学者に発見はどれだけ先かを尋ねると、「10年（以上）」と答えます。その意図は「私にはいい考えがない」というものです。AIは、制御核融合や他の技術のように、10年プラス60年以上、その実現には時間がかかるでしょう。反面、ディープラーニングの能力は非常に強力で、その限界は今のところ分かっていません。そして、本質的な問題もまだまだたくさんあります。これはとてもクリエイティブで夢のある話だと思いませんか。</p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></div><div class="footer"> <span class="left_footer"> In academic work,  please cite this book as: Michael A. Nielsen, "Neural Networks and  Deep Learning", Determination Press, 2015  <br/>  <br/>  This work is licensed under a <a rel="license"  href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"  style="color: #eee;">Creative Commons Attribution-NonCommercial 3.0  Unported License</a>.  This means you're free to copy, share, and  build on this book, but not to sell it.  If you're interested in  commercial use, please <a  href="mailto:mn@michaelnielsen.org">contact me</a>.  </span>  <span class="right_footer">Last update: Thu Jan 19 06:09:48 2017
<br/>
<br/>
<br/>
<a rel="license" href="http://creativecommons.org/licenses/by-nc/3.0/deed.en_GB"><img alt="Creative Commons Licence" style="border-width:0" src="http://i.creativecommons.org/l/by-nc/3.0/88x31.png" /></a>
</span>
</div>
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-44208967-1', 'neuralnetworksanddeeplearning.com');
  ga('send', 'pageview');

</script>
</body>
</html>