Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dvclive callback: warn instead of fail when logging non-scalars #27608

Merged
merged 2 commits into from
Nov 21, 2023

Conversation

dberenbaum
Copy link
Contributor

What does this PR do?

Fixes #27352 (comment). This will warn instead of fail when trying to log non-scalars as metrics.

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@muellerz Could you please take a look?

@dberenbaum
Copy link
Contributor Author

@muellerz This makes the tests pass, but I'm not sure if it's intended that the test here logs the learning rate as a list rather than as a scalar (which will fail under several of the existing loggers, but only with a warning like in this PR).

logs["learning_rate"] = self.lr_scheduler._last_lr

self.lr_scheduler._last_lr is a list. Should a scalar value be extracted like self.lr_scheduler._last_lr[0]? That's the value being tested later as ["learning_rate"][0]:

for i, log in enumerate(logs[:-1]): # Compare learning rate to next epoch's
loss = log["eval_loss"]
just_decreased = False
if loss > best_loss:
bad_epochs += 1
if bad_epochs > patience:
self.assertLess(logs[i + 1]["learning_rate"][0], log["learning_rate"][0])
just_decreased = True
bad_epochs = 0
else:
best_loss = loss
bad_epochs = 0
if not just_decreased:
self.assertEqual(logs[i + 1]["learning_rate"][0], log["learning_rate"][0])

Everywhere else in the codebase, it looks like a scalar is extracted:

def _get_learning_rate(self):
if self.is_deepspeed_enabled:
# with deepspeed's fp16 and dynamic loss scale enabled the optimizer/scheduler steps may
# not run for the first few dozen steps while loss scale is too large, and thus during
# that time `get_last_lr` will fail if called during that warm up stage, so work around it:
try:
last_lr = self.lr_scheduler.get_last_lr()[0]
except AssertionError as e:
if "need to call step" in str(e):
logger.warning("tried to get lr value before scheduler/optimizer started stepping, returning lr=0")
last_lr = 0
else:
raise
else:
if isinstance(self.lr_scheduler, torch.optim.lr_scheduler.ReduceLROnPlateau):
last_lr = self.optimizer.param_groups[0]["lr"]
else:
last_lr = self.lr_scheduler.get_last_lr()[0]
if torch.is_tensor(last_lr):
last_lr = last_lr.item()
return last_lr

tensorboard_logs = {"loss": loss, "rate": lr_scheduler.get_last_lr()[-1]}

@muellerzr
Copy link
Contributor

In the future it's @muellerzr @dberenbaum, don't want to be pinging random people :)

Yes, let's go with [0] as the one being extracted/the scalar.

Copy link
Contributor

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix! If we can log just the scalar too as part of this that would be great too. Otherwise this PR LG2M. Appreciate the quickfix :)

@dberenbaum
Copy link
Contributor Author

@muellerzr Apologies to you and the other person who was pinged here Zach! Added the change to the test in the last commit. The current test failures look unrelated.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 😉

@ArthurZucker ArthurZucker merged commit 8eb9e29 into huggingface:main Nov 21, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants