Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

Closed
dennismalmgren opened this issue Feb 16, 2024 · 1 comment · Fixed by #1922
Closed

[BUG] kl divergence calculation in KLPENPPOLoss is always zero #1920

dennismalmgren opened this issue Feb 16, 2024 · 1 comment · Fixed by #1922
Assignees
Labels
bug Something isn't working

Comments

@dennismalmgren
Copy link
Contributor

dennismalmgren commented Feb 16, 2024

Describe the bug

KL divergence calculation in KLPENPPOLoss is always zero, causing the contribution to the loss to be 0.

A clear and concise description of what the bug is.
It seems that the way the KLPENPPOLoss calculates the previous and current distributions end up using the same values/parameters (?), causing the KL divergence to always be zero.

To Reproduce

o Steps to reproduce the behavior.
Replace ClipPPOLoss with KLENPPOLoss in the ppo mujoco example code and remove the clip_epsilon argument:

    loss_module = KLPENPPOLoss(
        actor_network=actor,
        critic_network=critic,
        #clip_epsilon=cfg.loss.clip_epsilon,
        loss_critic_type=cfg.loss.loss_critic_type,
        entropy_coef=cfg.loss.entropy_coef,
        critic_coef=cfg.loss.critic_coef,
        normalize_advantage=True,
    )

Run the example. The code path used is

            kl = torch.distributions.kl.kl_divergence(previous_dist, current_dist)

(line 969).

Note with debugger that previous_dist and current_dist are always the same. This means that on line 973, kl is always 0.

Expected behavior

This should be log prob calculated with old and new parameters.

Screenshots

N/A

System info

Python 3.11.7, torchrl-nightly.

Additional context

N/A

Reason and Possible fixes

The idea is to use cached logits, I think, but the call to

log_weight, dist = self._log_weight(tensordict)

runs the fresh params on the tensordict and overwrites the old logits, which causes the calls to return the same values. A hack that works for me is to save the incoming tensordict (in KLPENPPOLoss.forward) before it is cloned, and use it explicitly in the call to build_dist_from_params, but I'm not sure what the ideal solution is.

Checklist

  • [x ] I have checked that there is no similar issue in the repo (required)
  • [x ] I have read the documentation (required)
  • [ x] I have provided a minimal working example to reproduce the bug (required)
@dennismalmgren dennismalmgren added the bug Something isn't working label Feb 16, 2024
@dennismalmgren dennismalmgren changed the title [BUG] kl divergence calculation in PPO is always zero [BUG] kl divergence calculation in KLPENPPOLoss is always zero Feb 16, 2024
@vmoens
Copy link
Contributor

vmoens commented Feb 16, 2024

Almost done but I'm struggling with the non-tensordict input to KLPENPPOLoss which now needs to accept arbitrary keys for the distribution construction. Will keep you posted on the progress!

Thanks for reporting the bug :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants