or Training PPO with DQN as a critic with ReLAx
This repository contains an implementation of PPO+DDQN training loop for discrete control tasks.
PPO needs and estimation of advantages to run the training process. Typically advantages for PPO are estimated with GAE-lambda algorithm.
This notebook explores the possibility of training PPO in pair with DDQN critic.
While PPO is trained in on-policy mode using transitions sampled with its policy network, DQN is trained on off-policy data stored in replay buffer (which is filled with train batches sampled with PPO actor)
Theoretically, such procedure should allow us to train 2 agents (PPO
+DQN
and ArgmaxQValue
+DQN
) on the same samples.
Plot below shows smoothed training runs (evaluated on a separate environments) for PPO
+DQN
and ArgmaxQValue
+DQN
:
As we can see, PPO
+DQN
outperforms ArgmaxQValue
+DQN
over the entire course of training:
Trained policies
PPO
ppo_actor.mp4
DQN