Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
PeichenHan authored Dec 18, 2022
1 parent 68c6837 commit 3a3b02f
Showing 1 changed file with 72 additions and 0 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><title>Deep Q-Learning with simple arati game</title><style>
* {
font-family: Georgia, Cambria, "Times New Roman", Times, serif;
}
html, body {
margin: 0;
padding: 0;
}
h1 {
font-size: 50px;
margin-bottom: 17px;
color: #333;
}
h2 {
font-size: 24px;
line-height: 1.6;
margin: 30px 0 0 0;
margin-bottom: 18px;
margin-top: 33px;
color: #333;
}
h3 {
font-size: 30px;
margin: 10px 0 20px 0;
color: #333;
}
header {
width: 640px;
margin: auto;
}
section {
width: 640px;
margin: auto;
}
section p {
margin-bottom: 27px;
font-size: 20px;
line-height: 1.6;
color: #333;
}
section img {
max-width: 640px;
}
footer {
padding: 0 20px;
margin: 50px 0;
text-align: center;
font-size: 12px;
}
.aspectRatioPlaceholder {
max-width: auto !important;
max-height: auto !important;
}
.aspectRatioPlaceholder-fill {
padding-bottom: 0 !important;
}
header,
section[data-field=subtitle],
section[data-field=description] {
display: none;
}
</style></head><body><article class="h-entry">
<header>
<h1 class="p-name">Deep Q-Learning with simple arati game</h1>
</header>
<section data-field="subtitle" class="p-summary">
Using DQN method to teach a computer to play classic atari game — Pong
</section>
<section data-field="body" class="e-content">
<section name="e079" class="section section--body section--first section--last"><div class="section-divider"><hr class="section-divider"></div><div class="section-content"><div class="section-inner sectionLayout--insetColumn"><h3 name="ba4c" id="ba4c" class="graf graf--h3 graf--leading graf--title">Deep Q-Learning with simple atari game</h3><h4 name="aa7b" id="aa7b" class="graf graf--h4 graf-after--h3 graf--subtitle">Using DQN method to teach a computer to play classic atari game — Space Invaders</h4><figure name="74c6" id="74c6" class="graf graf--figure graf-after--h4"><img class="graf-image" data-image-id="1*aBoBJKGUt8wcXopJv2xoMw.png" data-width="597" data-height="396" src="https://cdn-images-1.medium.com/max/800/1*aBoBJKGUt8wcXopJv2xoMw.png"><figcaption class="imageCaption">Space Invaders</figcaption></figure><h3 name="bd3a" id="bd3a" class="graf graf--h3 graf-after--figure">Reinforcement Learning and Game</h3><p name="f16d" id="f16d" class="graf graf--p graf-after--h3">As one of the three major machine learning methods, reinforcement learning has an obvious feature that it has biological and psychological basis, and is based on control theory and statistics.</p><p name="302b" id="302b" class="graf graf--p graf-after--p">Reinforcement learning regards learning as a trial and evaluation process. The agent chooses an action for the environment, environment accepts the action and then generates a reinforcement signal (reward or punishment) to feed back to the agent. After receiving the signal, agent adjusts its strategy and chooses the next action.This process will repeat continuously, and finally the agent can always output correct action.</p><p name="f7ad" id="f7ad" class="graf graf--p graf-after--p">Obviously, because of this model of reinforcement learning, the application of reinforcement learning to games is very natural. An important reason is that game can quickly generate a large amount of naturally labeled (state-action-reward) data, which are high-quality training materials for reinforcement learning.</p><figure name="e39f" id="e39f" class="graf graf--figure graf-after--p"><img class="graf-image" data-image-id="1*ESgdHKJuhAoR7HoRxXG0VA.png" data-width="616" data-height="247" src="https://cdn-images-1.medium.com/max/800/1*ESgdHKJuhAoR7HoRxXG0VA.png"><figcaption class="imageCaption">Reinfocement method : Q-Learning</figcaption></figure><h3 name="df7a" id="df7a" class="graf graf--h3 graf-after--figure">Why DQN ?</h3><p name="1804" id="1804" class="graf graf--p graf-after--h3">As the game becomes more and more complex, more and more information is required to describe the current situation of the game, the corresponding state space becomes very large and most states are rarely observed, the estimation of Q table will take a lot of time and difficult to converge. Moreover, for a large number of unobserved possible states, we also hope to be able to estimate Q value of them. This is what DQN solves,Neural Network is very good at extracting good features from structured data, therefore, we can use NN to approximate Q function.</p><p name="564f" id="564f" class="graf graf--p graf-after--p">At the same time, states defined by humans may miss elements which can also affect rewards.Obviously, it is a better choice to use all pixel information of the picture as the state, which contains all information in this scene.</p><h3 name="bf58" id="bf58" class="graf graf--h3 graf-after--p">Environment</h3><p name="2357" id="2357" class="graf graf--p graf-after--h3"><strong class="markup--strong markup--p-strong">Game Name :</strong> Space Invaders</p><p name="bce2" id="bce2" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Game Version :</strong> gym.atari.SpaceInvaders-v4 (will not repeat the previous action, only execute the action given by the agent)</p><p name="086a" id="086a" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">State : </strong>3 channels pixel information 3 frames (3 x 210 x 160 x 3)</p><p name="0a61" id="0a61" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Action :</strong> [‘NOOP’, ‘FIRE’, ‘RIGHT’, ‘LEFT’, ‘RIGHTFIRE’, ‘LEFTFIRE’]</p><h3 name="0b27" id="0b27" class="graf graf--h3 graf-after--p">Q-Network</h3><figure name="bbe4" id="bbe4" class="graf graf--figure graf-after--h3"><img class="graf-image" data-image-id="1*rdm51z42TI03_GWsAKxfbA.png" data-width="1275" data-height="452" src="https://cdn-images-1.medium.com/max/800/1*rdm51z42TI03_GWsAKxfbA.png"><figcaption class="imageCaption"><strong class="markup--strong markup--figure-strong">Q-Network Model structure </strong></figcaption></figure><p name="01a4" id="01a4" class="graf graf--p graf-after--figure">The main body of this Q-Network consists of three Conv2D convolutional layers and two fully connected layers, they are transitioned through a Flatten layer. Their basic information is shown in the figure above, and the relevant codes are as follows</p><figure name="3f58" id="3f58" class="graf graf--figure graf-after--p"><img class="graf-image" data-image-id="1*7BDYskoEI4FYmdX37IDXTQ.png" data-width="670" data-height="246" src="https://cdn-images-1.medium.com/max/800/1*7BDYskoEI4FYmdX37IDXTQ.png"></figure><figure name="5235" id="5235" class="graf graf--figure graf-after--figure"><img class="graf-image" data-image-id="1*Fi8vtNbLWUDohApvmtxRnw.png" data-width="829" data-height="601" src="https://cdn-images-1.medium.com/max/800/1*Fi8vtNbLWUDohApvmtxRnw.png"><figcaption class="imageCaption">Network Summary</figcaption></figure><p name="adbe" id="adbe" class="graf graf--p graf-after--figure">This network receives the pixel image information and outputs the Q value corresponding to each action. At the same time, this network also has enough variables to allow various factors related to the Q value to be considered.</p><h3 name="122b" id="122b" class="graf graf--h3 graf-after--p">DQN Agent and Hyperparameters</h3><p name="c0d7" id="c0d7" class="graf graf--p graf-after--h3">The agent in this experiment is directly constructed by the DQNAgent in the rl.agent package.</p><pre data-code-block-mode="2" spellcheck="false" data-code-block-lang="python" name="a707" id="a707" class="graf graf--pre graf-after--p graf--preV2"><span class="pre--content"><span class="hljs-keyword">def</span> <span class="hljs-title function_">build_agent</span>(<span class="hljs-params">model, actions</span>):<br /> policy = LinearAnnealedPolicy(EpsGreedyQPolicy(), attr=<span class="hljs-string">&#x27;eps&#x27;</span>, <br /> value_max=epsilon, value_min=min_epsilon,<br /> value_test=<span class="hljs-number">.2</span>, nb_steps=max_steps)<br /> memory = SequentialMemory(limit=<span class="hljs-number">1000</span>, window_length=<span class="hljs-number">3</span>)<br /> dqn = DQNAgent(model=model, memory=memory, policy=policy,<br /> enable_dueling_network=<span class="hljs-literal">True</span>, dueling_type=<span class="hljs-string">&#x27;avg&#x27;</span>, <br /> nb_actions=actions, nb_steps_warmup=<span class="hljs-number">2000</span>,<br /> gamma=Gamma<br /> )<br /> <span class="hljs-keyword">return</span> dqn<br />dqn = build_agent(model, actions)<br />dqn.<span class="hljs-built_in">compile</span>(Adam(lr=learning_rate))</span></pre><p name="f8c4" id="f8c4" class="graf graf--p graf-after--pre"><strong class="markup--strong markup--p-strong">Policy : </strong>Epsilon Greedy Policy</p><p name="b651" id="b651" class="graf graf--p graf-after--p">Randomly select all actions with epsilon probability, and greedily select actions with 1-epsilon probability.Basic but effective.</p><p name="a1fd" id="a1fd" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Memory :</strong> size=1000, window length = 3</p><p name="2074" id="2074" class="graf graf--p graf-after--p">The hyperparameters related to this model are as follows.</p><p name="6cdb" id="6cdb" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">total_episodes : </strong>10000 (due to my computer, it is not large, but also can make some progress)</p><p name="c35a" id="c35a" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">total_test_episodes : </strong>10 (test 10 times after training)</p><p name="21f8" id="21f8" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">max_steps : </strong>10000 (max steps)</p><p name="64ac" id="64ac" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">learning_rate : </strong>0.01(using a low learning rate ensures we don’t miss any local minima, but also means we will take longer to converge)</p><p name="3a96" id="3a96" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Gamma :</strong> 0.99(high discount rate)</p><p name="42e3" id="42e3" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Max_epsilon : </strong>1</p><p name="99ad" id="99ad" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Min_epsilon :</strong> 0.1</p><p name="2f36" id="2f36" class="graf graf--p graf--empty graf-after--p"><br></p><h3 name="034d" id="034d" class="graf graf--h3 graf-after--p">Result</h3><p name="258a" id="258a" class="graf graf--p graf-after--h3">Because of the small number of training steps, it is difficult to evaluate this model in terms of high scores and stability.</p><p name="cb6b" id="cb6b" class="graf graf--p graf-after--p">At the same time, because this game has six actions, it is difficult to achieve a good score if you choose action randomly.</p><p name="dd18" id="dd18" class="graf graf--p graf-after--p">So I’m going to prove that the computer did learn something in this experiment by comparing it with the random selection strategy. The result is as follows.</p><figure name="7c1e" id="7c1e" class="graf graf--figure graf-after--p"><img class="graf-image" data-image-id="1*YePM1_lOe1cmrmX-trzsIw.png" data-width="955" data-height="654" src="https://cdn-images-1.medium.com/max/800/1*YePM1_lOe1cmrmX-trzsIw.png"></figure><p name="a291" id="a291" class="graf graf--p graf-after--figure">It can be seen that most of the time the trained agent is better than random.</p><p name="4003" id="4003" class="graf graf--p graf-after--p">This is further evidenced by the average score.</p><p name="aa3d" id="aa3d" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">DQN :</strong> 217</p><p name="81cf" id="81cf" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">Random:</strong> 146.5</p><h3 name="74ba" id="74ba" class="graf graf--h3 graf-after--p">Future Work</h3><p name="020f" id="020f" class="graf graf--p graf-after--h3"><strong class="markup--strong markup--p-strong">1 : More training steps</strong></p><p name="14ba" id="14ba" class="graf graf--p graf-after--p">Restricted by the experimental conditions, I only performed 10,000 steps of training this time. I believe that after more steps of training, we can get better result.</p><p name="27f2" id="27f2" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">2 : More reasonable hyperparameter values</strong></p><p name="302a" id="302a" class="graf graf--p graf-after--p">Because the number of training steps is not enough, we cannot clearly see the effect of hyperparameters, but when the number of training steps increases, the impact of each hyperparameter on the results will be more significant, and we should make corresponding adjustments at that time.</p><p name="86ea" id="86ea" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">3 : More personalized agents and environments</strong></p><p name="03ef" id="03ef" class="graf graf--p graf-after--p">In this experiment, I fully used the agent and environment in the package, only changed some hyperparameters, which may cause the result to be not perfect. Maybe this can be solved by customizing the agent and environment in the next experiment. </p><h3 name="fe68" id="fe68" class="graf graf--h3 graf-after--p">Reference</h3><p name="49bc" id="49bc" class="graf graf--p graf-after--h3"><strong class="markup--strong markup--p-strong">[1] :</strong> <em class="markup--em markup--p-em">Deep learning guide </em><a href="https://zhuanlan.zhihu.com/p/498713060" data-href="https://zhuanlan.zhihu.com/p/498713060" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank"><em class="markup--em markup--p-em">https://zhuanlan.zhihu.com/p/498713060</em></a></p><p name="3b35" id="3b35" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">[2]: <em class="markup--em markup--p-em">D</em></strong><em class="markup--em markup--p-em">ense layer in Keras </em><a href="https://blog.csdn.net/weixin_44551646/article/details/112911215" data-href="https://blog.csdn.net/weixin_44551646/article/details/112911215" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank"><em class="markup--em markup--p-em">https://blog.csdn.net/weixin_44551646/article/details/112911215</em></a></p><p name="b853" id="b853" class="graf graf--p graf-after--p"><strong class="markup--strong markup--p-strong">[3]:</strong> Deep Reinforcement Learning for Atari Games Python Tutorial | AI Plays Space Invaders <a href="https://www.youtube.com/watch?v=hCeJeq8U0lo" data-href="https://www.youtube.com/watch?v=hCeJeq8U0lo" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">https://www.youtube.com/watch?v=hCeJeq8U0lo</a></p><p name="20c3" id="20c3" class="graf graf--p graf-after--p">[4]: About installing gym.Atari at windows 10 <a href="https://zhuanlan.zhihu.com/p/523895071" data-href="https://zhuanlan.zhihu.com/p/523895071" class="markup--anchor markup--p-anchor" rel="nofollow noopener" target="_blank">https://zhuanlan.zhihu.com/p/523895071</a></p><h3 name="3b5a" id="3b5a" class="graf graf--h3 graf-after--p">MIT License</h3><p name="124b" id="124b" class="graf graf--p graf-after--h3">Copyright &lt;2022&gt; Peichen Han</p><p name="9c39" id="9c39" class="graf graf--p graf-after--p">Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:</p><p name="347c" id="347c" class="graf graf--p graf-after--p">The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.</p><p name="2135" id="2135" class="graf graf--p graf-after--p">THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.</p><p name="131f" id="131f" class="graf graf--p graf--empty graf-after--p"><br></p><p name="d071" id="d071" class="graf graf--p graf--empty graf-after--p"><br></p><p name="dd14" id="dd14" class="graf graf--p graf--empty graf-after--p"><br></p><p name="fe48" id="fe48" class="graf graf--p graf--empty graf-after--p graf--trailing"><br></p></div></div></section>
</section>
<footer><p><a href="https://medium.com/p/1ced227d4ade">View original.</a></p><p>Exported from <a href="https://medium.com">Medium</a> on December 16, 2022.</p></footer></article></body></html>

0 comments on commit 3a3b02f

Please sign in to comment.