You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
KL divergence calculation in KLPENPPOLoss is always zero, causing the contribution to the loss to be 0.
A clear and concise description of what the bug is.
It seems that the way the KLPENPPOLoss calculates the previous and current distributions end up using the same values/parameters (?), causing the KL divergence to always be zero.
To Reproduce
o Steps to reproduce the behavior.
Replace ClipPPOLoss with KLENPPOLoss in the ppo mujoco example code and remove the clip_epsilon argument:
Note with debugger that previous_dist and current_dist are always the same. This means that on line 973, kl is always 0.
Expected behavior
This should be log prob calculated with old and new parameters.
Screenshots
N/A
System info
Python 3.11.7, torchrl-nightly.
Additional context
N/A
Reason and Possible fixes
The idea is to use cached logits, I think, but the call to
log_weight, dist=self._log_weight(tensordict)
runs the fresh params on the tensordict and overwrites the old logits, which causes the calls to return the same values. A hack that works for me is to save the incoming tensordict (in KLPENPPOLoss.forward) before it is cloned, and use it explicitly in the call to build_dist_from_params, but I'm not sure what the ideal solution is.
Checklist
[x ] I have checked that there is no similar issue in the repo (required)
dennismalmgren
changed the title
[BUG] kl divergence calculation in PPO is always zero
[BUG] kl divergence calculation in KLPENPPOLoss is always zero
Feb 16, 2024
Almost done but I'm struggling with the non-tensordict input to KLPENPPOLoss which now needs to accept arbitrary keys for the distribution construction. Will keep you posted on the progress!
Describe the bug
KL divergence calculation in KLPENPPOLoss is always zero, causing the contribution to the loss to be 0.
A clear and concise description of what the bug is.
It seems that the way the KLPENPPOLoss calculates the previous and current distributions end up using the same values/parameters (?), causing the KL divergence to always be zero.
To Reproduce
o Steps to reproduce the behavior.
Replace ClipPPOLoss with KLENPPOLoss in the ppo mujoco example code and remove the clip_epsilon argument:
Run the example. The code path used is
(line 969).
Note with debugger that previous_dist and current_dist are always the same. This means that on line 973, kl is always 0.
Expected behavior
This should be log prob calculated with old and new parameters.
Screenshots
N/A
System info
Python 3.11.7, torchrl-nightly.
Additional context
N/A
Reason and Possible fixes
The idea is to use cached logits, I think, but the call to
runs the fresh params on the tensordict and overwrites the old logits, which causes the calls to return the same values. A hack that works for me is to save the incoming tensordict (in KLPENPPOLoss.forward) before it is cloned, and use it explicitly in the call to build_dist_from_params, but I'm not sure what the ideal solution is.
Checklist
The text was updated successfully, but these errors were encountered: