Advantage Actor-Critic method are close cousin of Policy Gradient class algorithm. The difference is that they use two neural networks instead of one: the actor who has the responsibility of finding the best action given a observation and the critic who has the responsibility of assessing if the actor does a good job.
The two main goals of this essay were to first, get a deeper understanding of Actor-Critic method theoric aspect and second, to acquire a practical understanding of it’s beavior, limitation and requirement in order to work. In order to reach this second goal, I felt it was nescessary to implement multiple design & architectural variation commonly found in the litterature.
With this in mind, I’ve focused on the following practical aspect:
- Algorithm type: batch vs online;
- Computation graph: split network vs split network (with shared lower layer) vs shared network;
- Critic target: Monte-Carlo vs bootstrap estimate;
- Math computation: element wise vs graph computed;
- Various Data collection strategy;
In parallel, I writen a second essay A reflexion on design, architecture and implementation details where I go further in my study of somme aspect of DRL algortihm from a software engineering perspective applied to research by covering question like:
Does implementation details realy matters? Which one does, when & why?
I've also complemented my reading with the following ressources:
- The classic book Reinforcement learning: An introduction 2nd ed. by Sutton & Barto (ed MIT Press)
- CS 294--112 Deep Reinforcement Learning: lecture on Policy Gradient and Actor-Critic by Sergey Levine from University Berkeley;
- OpenAI: Spinning Up: Intro to Policy Optimization, by Josh Achiam;
- and Lil' Log blog:Policy Gradient Algorithms by Lilian Weng, research intern at OpenAI
- Asynchronous Methods for Deep Reinforcement Learning. by Mnih et al.
- Reinforcement learning that matters by Henderson et al.
- TD or not TD: Analyzing the Role of Temporal Differencing in Deep Reinforcement Learning by Amiranashvili, Dosovitskiy, Koltun & Brox
- High-Dimensional Continuous Control Using Generalized Advantage Estimation by Schulman, Moritz, Levine, Jordan & Abbeel
Download the essay pdf:
- Deep Reinforcement Learning – Actor-Critic
- A reflexion on design, architecture and implementation details
Watch recorded agent
Note: You can check explanation on how to use the package by using the --help
flag
cd DRLimplementation
python -m ActorCritic --play[Lunar or Cartpole] [--record] [--play_for]=max trajectories (default=10)
cd DRLimplementation
python -m ActorCritic --trainExperimentSpecification [--rerun] [--renderTraining]
Choose --trainExperimentSpecification
between the following:
- CartPole-v0 environment:
--trainSplitMC
: Train a Batch Actor-Critic agent with Monte Carlo TD target--trainSplitBootstrap
: Train a Batch Actor-Critic agent with bootstrap estimate TD target--trainSharedBootstrap
: Train a Batch Actor-Critic agent with shared network--trainOnlineSplit
: Train a Online Actor-Critic agent with split network--trainOnlineSplit3layer
: Train a Online Actor-Critic agent with split network--trainOnlineShared3layer
: Train a Online Actor-Critic agent with Shared network--trainOnlineSplitTwoInputAdvantage
: Train a Online Actor-Critic agent with split Two input Advantage network
- LunarLander-v2 environment:
--trainOnlineLunarLander
: Train on LunarLander a Online Actor-Critic agent with split Two input Advantage network--trainBatchLunarLander
: Train on LunarLander a Batch Actor-Critic agent
cd DRLimplementation
tensorboard --logdir=ActorCritic/graph