Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics to collect during training #3

Closed
wanchaol opened this issue Jan 23, 2024 · 4 comments · Fixed by #56, #57, #60 or #151
Closed

Add metrics to collect during training #3

wanchaol opened this issue Jan 23, 2024 · 4 comments · Fixed by #56, #57, #60 or #151

Comments

@wanchaol
Copy link
Contributor

see https://github.com/pytorch-labs/torchtrain/blob/main/train.py#L87

we should have the following metrics associated with the train steps:

  1. gpu memory usage
  2. wps
  3. loss
@wanchaol wanchaol changed the title Add metrics to collective during training Add metrics to collect during training Jan 23, 2024
@lessw2020
Copy link
Contributor

I can work on this one, with gpu stats first.
I think we also want mfu.

@lessw2020
Copy link
Contributor

and also flop counter

@gnadathur
Copy link
Contributor

@lessw2020 , @tianyu-l -- Can we close this issue ?

@gnadathur
Copy link
Contributor

closing as per conversation w/ @lessw2020

jinsun-yoo pushed a commit to jinsun-yoo/torchtitan that referenced this issue Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment