Skip to content

Latest commit

 

History

History

Deep Reinforcement Learning

TaxonomyActorCritic

:: Advantage Actor-Critic

Advantage Actor-Critic method are close cousin of Policy Gradient class algorithm. The difference is that they use two neural networks instead of one: the actor who has the responsibility of finding the best action given a observation and the critic who has the responsibility of assessing if the actor does a good job.

The two main goals of this essay were to first, get a deeper understanding of Actor-Critic method theoric aspect and second, to acquire a practical understanding of it’s beavior, limitation and requirement in order to work. In order to reach this second goal, I felt it was nescessary to implement multiple design & architectural variation commonly found in the litterature.

With this in mind, I’ve focused on the following practical aspect:

  • Algorithm type: batch vs online;
  • Computation graph: split network vs split network (with shared lower layer) vs shared network;
  • Critic target: Monte-Carlo vs bootstrap estimate;
  • Math computation: element wise vs graph computed;
  • Various Data collection strategy;

In parallel, I writen a second essay A reflexion on design, architecture and implementation details where I go further in my study of somme aspect of DRL algortihm from a software engineering perspective applied to research by covering question like:

Does implementation details realy matters? Which one does, when & why?

I've also complemented my reading with the following ressources:


Download the essay pdf:

Watch recorded agent


The Actor-Critic implementations:

Note: You can check explanation on how to use the package by using the --help flag

To watch the trained algorithm

cd DRLimplementation
python -m ActorCritic --play[Lunar or Cartpole] [--record] [--play_for]=max trajectories (default=10) 

To execute the training loop

cd DRLimplementation
python -m ActorCritic --trainExperimentSpecification [--rerun] [--renderTraining] 

Choose --trainExperimentSpecification between the following:

  • CartPole-v0 environment:
    • --trainSplitMC: Train a Batch Actor-Critic agent with Monte Carlo TD target
    • --trainSplitBootstrap: Train a Batch Actor-Critic agent with bootstrap estimate TD target
    • --trainSharedBootstrap: Train a Batch Actor-Critic agent with shared network
    • --trainOnlineSplit: Train a Online Actor-Critic agent with split network
    • --trainOnlineSplit3layer: Train a Online Actor-Critic agent with split network
    • --trainOnlineShared3layer: Train a Online Actor-Critic agent with Shared network
    • --trainOnlineSplitTwoInputAdvantage: Train a Online Actor-Critic agent with split Two input Advantage network
  • LunarLander-v2 environment:
    • --trainOnlineLunarLander: Train on LunarLander a Online Actor-Critic agent with split Two input Advantage network
    • --trainBatchLunarLander: Train on LunarLander a Batch Actor-Critic agent

To navigate trough the computation graph in TensorBoard

cd DRLimplementation
tensorboard --logdir=ActorCritic/graph

Trained agent in action