This is the PyTorch implementation of deef reinforcement learning algorithm to play Atari Pong game using OpenAI Gym.
Insall the requirements:
pip install -r requirements.txt
python3 pong/train.py
A state in reinforcement learning is the observation that the agent receives from the environment.
A policy is the mapping from the perceived states of the environment to the actions to be taken when in those states.
A reward signal is the goal in reinforcement learning. The agent tries to maximize the total reward in long run.
The reward signal indicates what is good in immediate sense, whereas the value function measures what is good in long run. Each state of environment is assigned a value which is the total amount of reward an agent is expected to receive, starting from that state.
A model in reinforcemnt learning mimics the behavior of the environment.
- Target Network: A copy of policy network.
- Initialize the Replay Memory: Used for storing the experience SARS'(state, action, reward, next-state).
- For each episode:
- Reset the environment to get starting state.
- Calculate exploration rate.
- For each time step:
- Select an action using exploration or exploitation.
- Take the action, get reward from the envionment and move to the next-state.
- Store SARS'(state, action, reward, next-state) in the replay memory.
- Sample a batch of data (SARS') from replay memory.
- Preprocess the sampled batch of states.
- Pass the sampled batch of states through policy network to calculate the q-values.
- Calculate the q-values for next-states using target network.
- Calculate: Expected q-values = reward + next-states-q-values * gamma.
- Calculate the loss beteen q-values of policy network and expected q-values.
- Update the weights of policy network to minimize the loss.
- After 'u' episodes, update the weights of target network using the weights of policy network.
- Fixing the local optimum problem.
- Calculating the moving average of scores.
- Plotting the scores using TensorBoard.