Project Solutions for my Deep Reinforcement Learning Nanodegree at Udacity
- Clone this repository with
git clone git@github.com:squall-1002/deep_rl_nanodegree.git
- Set up the conda environment
drlnd.yml
with Anaconda:conda env create -f drl.yml
- Activate the conda environment with
conda activate drl
-
Download the Unity Environment from one of the applicable links below, place the file in
p1_navigation
and unzip the file: -
In order to train your yellow banana picking agent, navigate into
p1_navigation
and start your JuPyter Notebook server withjupyter notebook
within the activated environment. -
Open
navigation_solution.ipynb
which is the notebook that guides you through the steps involved to- set everything up,
- train the agent
- and evaluate its performance.
You will see the a small window popping up that shows the agent quickly navigating through the environment and - hopefully - collecting the right colored bananas. You will also observe it getting better over time. Therefore, observe the average score counter in the notebook that tells you the average scores the agent achieved across the recent 100 episodes.
The environment is considered solved when the average score of the most recent 100 episodes surpasses +13
.
The state for the quadratic environment, which contains purple and yellow bananas, is represented with a real-valued vector of size 37. The agent can act upon this environment by moving forward, backward as well as turning left and right. These four actions constitute the action space as follows:
0
: move forward1
: move backward2
: turn left3
: turn right
Rewards for collectings bananas are as follows:
- yellow banana:
+1
- purple banana:
-1
You may change the following hyperparameters in the dictionary hyperparams
that is used to create an Agent instance:
eps_start
: Start probability to choose a random action when following an epsilon-greedy strategyeps_min
: Minimum probability for epsilon-greedy strategyeps_decay
: Decay Factor forepsilon
applied every episodelearn_rate
: portion of the gradient to use for updating the parameters of the neural network that is used to approximate the action values during trainingbatch_size
: Number of single step experiences (state, action, reward, next state) to constitute a minibatch that is used for the gradient descent updategamma
: discount factor used within the TD-target as part of the parameter update within (Deep) Q-Learningupdate_interval
: number of steps to perform before updating the target networktau
: interpolation parameter for target network update
Besides these hyperparameters you may also change the number of episodes n_episodes
and the maximum number of steps max_steps
before we force an otherwise unfinished episode to end. Both parameters are used for the perform_dqn_training
method that trains our agent.
-
Download the environment from one of the links below. You need only select the environment that matches your operating system:
-
Version 1: One (1) Agent
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
-
Version 2: Twenty (20) Agents
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
-
-
In order to train version 2 with 20 agents, navigate into
p2_continuous_control
and start your JuPyter Notebook server withjupyter notebook
within the activated environment. -
Open
continuous_control_solution.ipynb
which is the notebook that guides you through the steps involved to- set everything up,
- train the agent
- and evaluate its performance.
The environment is considered solved when the average score of the episode-wise 20-agent averages surpasses +30
for the most recent 100 episodes.
The state for each agent, which is a double-joint arm, is represented with a real-valued vector of size 33 that correspond to position, rotation, velocity, and angular velocities of the two arm Rigidbodies. The agent can act upon this environment by moving forward, backward as well as turning left and right. The continuous action space is of size four and corresponds to torque applicable to two joints with valid values in the interval [-1, 1]
The agent's goal is to reach the sphere that move around and keep its hand with the sphere. For each timestep the agent adheres to this target it receives a reward of 0.1
.
You may change the following hyperparameters in the dictionary hyperparams
that is used to create an Agent instance:
buffer_size
: Maximum number of samples that can be stored in the replay buffer queuebatch_size
: Number of single step experiences (state, action, reward, next state) to constitute a minibatch that is used for an agent updateupdate_step
: How many steps to sample before conducting an agent updateagent_seed
: Random seed used to initialize the neural network parameters and sampling generatorsenv_seed
: Random seed to initialize the environmentgamma
: discount factor used within the TD-target as part of the parameter update within (Deep) Q-Learningtau
: interpolation parameter for soft target network updatelr_actor
: Learning rate for the Adam Optimizer used for updating the network parameters of the actorlr_critic
: Learning rate for the Adam Optimizer used for updating the network parameters of the critic
Besides these hyperparameters you may also change the number of episodes n_episodes
and the maximum number of steps max_steps
before we force an otherwise unfinished episode to end. Both parameters are used for the perform_ddpg_training
method that trains our agent.
-
Download the environment from one of the links below. You need only select the environment that matches your operating system:
- Linux: click here
- Mac OSX: click here
- Windows (32-bit): click here
- Windows (64-bit): click here
-
Place the file in the
p3_collaborate_and_compete/
folder and unzip it. -
Choose between training or demonstrating the pretrained agent:
-
Training:
Open
collaborate_and_compete_solution.ipynb
with JuPyter Notebook to guides you through the steps involved to- set everything up,
- train the agent
- and evaluate its performance.
-
Demonstrating
Open
watch_trained_agent.ipynb
with JuPyter Notebook and execute the cells to see the trained agents successfully playing tennis.
-
The environment is considered solved when the average score surpasses +0.5
for the most recent 100 episodes.
In this environment, two agents control rackets to bounce a ball over a net. If an agent hits the ball over the net, it receives a reward of +0.1
. If an agent lets a ball hit the ground or hits the ball out of bounds, it receives a reward of -0.01
. Thus, the goal of each agent is to keep the ball in play.
The observation space consists of 8 variables corresponding to the position and velocity of the ball and racket in the following order:
racket position x
racket position y
racket velocity x
racket velocity y
ball position x
ball position y
ball velocity x
ball velocity y
For the adapted environment here, for each time step and agent its observation contains three stacked states, the current and the two most recent ones. Thus, stacking three observations of 8 variables results in an agent-specific observation with 3*8 = 24 values for a time step.
Each agent receives its own, local observation. Two continuous actions are available, corresponding to movement toward (or away from) the net, and jumping. They are set to the interval [-1, 1]
.
The task is episodic, and in order to solve the environment, your agents must get an average score of +0.5 (over 100 consecutive episodes, after taking the maximum over both agents). Specifically,
- After each episode, we add up the rewards that each agent received (without discounting), to get a score for each agent. This yields 2 (potentially different) scores. We then take the maximum of these 2 scores.
- This yields a single score for each episode.
You may change the following hyperparameters in the code:
buffer_size
: Maximum number of samples that can be stored in the replay buffer queuebatch_size
: Number of single step experiences (state, action, reward, next state) to constitute a minibatch that is used for an agent updaten_random_episodes
: Number of episodes of random play to prefill thre Replay Buffern_episodes
: Number of episodes to trainmax_steps
: Maximum number of Steps to perform before manually interrupting an episodeupdate_step
: How many steps to sample before conducting an agent updatesolution_threshold
: Boundary for score average acrosseval_window_length
episodes to cross to consider the environment solvedeval_window_length
: Number of most recent episodes to consider for an aggregate metric, e.g. moving average of most recent 100 scoresnum_agents
: How many agents the multi-agent environment comprisesagent_seed
: Random seed used to initialize the neural network parameters and sampling generatorsenv_seed
: Random seed to initialize the environmentbuffer_seed
: Random seed to initialize the Replay Buffer Samplinggamma
: discount factor used within the TD-target as part of the parameter update within (Deep) Q-Learningtau
: interpolation parameter for soft target network updatefirst_hidden_units
: Number of hidden units for the first hidden layer of the actor/critic deep neural networkssecond_hidden_units
: Number of hidden units for the second hidden layer of the actor/critic deep neural networkslr_actor
: Learning rate for the Adam Optimizer used for updating the network parameters of the actorlr_critic
: Learning rate for the Adam Optimizer used for updating the network parameters of the criticcritic_weight_decay
: Weight decay to use for critic network weightsadd_noise
: whether to add or refrain from using noisenoise_sigma
: Sigma to use for Ornstein-Uhlenbeck processnoise_scale_start
: Initial scale for noisenoise_scale_min
: Minimum noise scalenoise_scale_decay
: Decay factor for noise scale in each episode