Skip to content

abdulqadirs/atari-pong-reinforcement-learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Playing Atari Pong With Reinforcement Learning

This is the PyTorch implementation of deef reinforcement learning algorithm to play Atari Pong game using OpenAI Gym.

Setup

Insall the requirements:

pip install -r requirements.txt

Usage

python3 pong/train.py

Description

State

A state in reinforcement learning is the observation that the agent receives from the environment.

Policy

A policy is the mapping from the perceived states of the environment to the actions to be taken when in those states.

Reward Signal

A reward signal is the goal in reinforcement learning. The agent tries to maximize the total reward in long run.

Value Function

The reward signal indicates what is good in immediate sense, whereas the value function measures what is good in long run. Each state of environment is assigned a value which is the total amount of reward an agent is expected to receive, starting from that state.

Model

A model in reinforcemnt learning mimics the behavior of the environment.

Deep Q-Learning Training Process

  1. Target Network: A copy of policy network.
  2. Initialize the Replay Memory: Used for storing the experience SARS'(state, action, reward, next-state).
  3. For each episode:
    1. Reset the environment to get starting state.
    2. Calculate exploration rate.
    3. For each time step:
      1. Select an action using exploration or exploitation.
      2. Take the action, get reward from the envionment and move to the next-state.
      3. Store SARS'(state, action, reward, next-state) in the replay memory.
      4. Sample a batch of data (SARS') from replay memory.
      5. Preprocess the sampled batch of states.
      6. Pass the sampled batch of states through policy network to calculate the q-values.
      7. Calculate the q-values for next-states using target network.
      8. Calculate: Expected q-values = reward + next-states-q-values * gamma.
      9. Calculate the loss beteen q-values of policy network and expected q-values.
      10. Update the weights of policy network to minimize the loss.
    4. After 'u' episodes, update the weights of target network using the weights of policy network.

Todo List

  • Fixing the local optimum problem.
  • Calculating the moving average of scores.
  • Plotting the scores using TensorBoard.

Releases

No releases published

Packages

No packages published

Languages