[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

dennismalmgren · 2024-02-16T20:50:12Z

Describe the bug

KL divergence calculation in KLPENPPOLoss is always zero, causing the contribution to the loss to be 0.

A clear and concise description of what the bug is.
It seems that the way the KLPENPPOLoss calculates the previous and current distributions end up using the same values/parameters (?), causing the KL divergence to always be zero.

To Reproduce

o Steps to reproduce the behavior.
Replace ClipPPOLoss with KLENPPOLoss in the ppo mujoco example code and remove the clip_epsilon argument:

    loss_module = KLPENPPOLoss(
        actor_network=actor,
        critic_network=critic,
        #clip_epsilon=cfg.loss.clip_epsilon,
        loss_critic_type=cfg.loss.loss_critic_type,
        entropy_coef=cfg.loss.entropy_coef,
        critic_coef=cfg.loss.critic_coef,
        normalize_advantage=True,
    )

Run the example. The code path used is

            kl = torch.distributions.kl.kl_divergence(previous_dist, current_dist)

(line 969).

Note with debugger that previous_dist and current_dist are always the same. This means that on line 973, kl is always 0.

Expected behavior

This should be log prob calculated with old and new parameters.

Screenshots

N/A

System info

Python 3.11.7, torchrl-nightly.

Additional context

N/A

Reason and Possible fixes

The idea is to use cached logits, I think, but the call to

log_weight, dist = self._log_weight(tensordict)

runs the fresh params on the tensordict and overwrites the old logits, which causes the calls to return the same values. A hack that works for me is to save the incoming tensordict (in KLPENPPOLoss.forward) before it is cloned, and use it explicitly in the call to build_dist_from_params, but I'm not sure what the ideal solution is.

Checklist

[x ] I have checked that there is no similar issue in the repo (required)
[x ] I have read the documentation (required)
[ x] I have provided a minimal working example to reproduce the bug (required)

vmoens · 2024-02-16T21:27:33Z

Almost done but I'm struggling with the non-tensordict input to KLPENPPOLoss which now needs to accept arbitrary keys for the distribution construction. Will keep you posted on the progress!

Thanks for reporting the bug :)

dennismalmgren added the bug Something isn't working label Feb 16, 2024

dennismalmgren assigned vmoens Feb 16, 2024

dennismalmgren changed the title ~~[BUG] kl divergence calculation in PPO is always zero~~ [BUG] kl divergence calculation in KLPENPPOLoss is always zero Feb 16, 2024

This was referenced Feb 16, 2024

[BugFix, Refactor] More reliable Sequential.get_dist pytorch/tensordict#678

Merged

[BugFix] Fix KLPENPPOLoss KL computation #1922

Merged

vmoens closed this as completed in #1922 Feb 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

dennismalmgren commented Feb 16, 2024 •

edited

Loading

vmoens commented Feb 16, 2024

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

Comments

dennismalmgren commented Feb 16, 2024 • edited Loading

Describe the bug

To Reproduce

Expected behavior

Screenshots

System info

Additional context

Reason and Possible fixes

Checklist

vmoens commented Feb 16, 2024

dennismalmgren commented Feb 16, 2024 •

edited

Loading