Two Actors at the Price of One

or Training PPO with DQN as a critic with ReLAx

This repository contains an implementation of PPO+DDQN training loop for discrete control tasks.

Overall Idea

PPO needs and estimation of advantages to run the training process. Typically advantages for PPO are estimated with GAE-lambda algorithm.

This notebook explores the possibility of training PPO in pair with DDQN critic.

While PPO is trained in on-policy mode using transitions sampled with its policy network, DQN is trained on off-policy data stored in replay buffer (which is filled with train batches sampled with PPO actor)

Theoretically, such procedure should allow us to train 2 agents (PPO+DQN and ArgmaxQValue+DQN) on the same samples.

Plot below shows smoothed training runs (evaluated on a separate environments) for PPO+DQN and ArgmaxQValue+DQN:

As we can see, PPO+DQN outperforms ArgmaxQValue+DQN over the entire course of training:

Trained policies

PPO

ppo_actor.mp4

DQN

dqn_actor.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
.ipynb_checkpoints		.ipynb_checkpoints
content/video		content/video
monitor_train_logs		monitor_train_logs
tensorboard_logs/ppo_ddqn_Atlantis-ram-v0		tensorboard_logs/ppo_ddqn_Atlantis-ram-v0
trained_models		trained_models
videos		videos
README.md		README.md
ppo_dqn.ipynb		ppo_dqn.ipynb
ppo_dqn_training.png		ppo_dqn_training.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Two Actors at the Price of One

Overall Idea

About

Releases

Packages

Languages

nslyubaykin/ppo_with_dqn_critic

Folders and files

Latest commit

History

Repository files navigation

Two Actors at the Price of One

Overall Idea

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages