Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[new feature] train finishing ETA! #242

Open
stas00 opened this issue Jan 26, 2022 · 2 comments
Open

[new feature] train finishing ETA! #242

stas00 opened this issue Jan 26, 2022 · 2 comments
Labels
Good First Issue Good for newcomers

Comments

@stas00
Copy link
Contributor

stas00 commented Jan 26, 2022

I have to calculate the ETA for finishing training often enough that I think it should be a feature.

How about we log the ETA along elapsed time per iteration?

This is just current elapsed_time_per_iteration * (total_iterations-current_iteration) / (60*60*24) - I think in days is most practical.

I don't remember if we have total_iterations - perhaps then total_samples-consumed_samples. and then divided by bs.

But given that we support Curriculum Learning it should be ``total_tokens-consumed_tokesthen divided by seqlen*bs`

Could also keep a running average for say the last 10-100 iterations or something of sorts. Note that the first iteration on resume is usually much slower than the rest. Also during BS rampup things are quite inefficient, so history is not a good indicator of a future. But even estimating based on the last iteration is fine, since that number is usually very steady through the same run. But will change from run to run since JZ is inconsistent.

Could also log it separately once in say 50 or 100 iterations, since logging it in every iteration could be too noisy. Not sure.

Those are just different ideas, either of those will be better than manual calculation.

Hope someone will be interested in experimenting and making a PR.

Thank you!

p.s. if you have seen the current log, it looks like this (the token counters aren't logged and only go to TB, but they are there).

 iteration   104000/  391390 | consumed samples:     48168384 | consumed tokens:  98648850432 | elapsed time per iteration (ms): 3694.1 | learning rate: 1.000E-05 | global batch size:   512 | lm loss: 2.889725E+00 | loss scale: 262144.0 | grad norm: 57850.024 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.139 | TFLOPs: 21.44 |
 iteration   104200/  391390 | consumed samples:     48270784 | consumed tokens:  98858565632 | elapsed time per iteration (ms): 3693.5 | learning rate: 1.000E-05 | global batch size:   512 | lm loss: 2.891522E+00 | loss scale: 262144.0 | grad norm: 56371.017 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.139 | TFLOPs: 21.44 |
[2022-01-26 06:16:06,643] [INFO] [stage_1_and_2.py:1648:step] [deepscale] OVERFLOW! Rank 0 Skipping step. Attempted loss scale: 262144.0, reducing to 131072.0
 iteration   104400/  391390 | consumed samples:     48373184 | consumed tokens:  99068280832 | elapsed time per iteration (ms): 3690.9 | learning rate: 1.000E-05 | global batch size:   512 | lm loss: 2.889314E+00 | loss scale: 131072.0 | grad norm: 28946.559 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.139 | TFLOPs: 21.46 |
 iteration   104600/  391390 | consumed samples:     48475584 | consumed tokens:  99277996032 | elapsed time per iteration (ms): 3700.0 | learning rate: 1.000E-05 | global batch size:   512 | lm loss: 2.888646E+00 | loss scale: 131072.0 | grad norm: 28253.955 | num zeros: 0.0 | number of skipped iterations:   0 | number of nan iterations:   0 | samples per second: 0.138 | TFLOPs: 21.41
@stas00 stas00 added the Good First Issue Good for newcomers label Jan 26, 2022
@mayank31398
Copy link
Collaborator

I want to work on this

@mayank31398
Copy link
Collaborator

mayank31398 commented Jul 15, 2022

From what I see, this has a lot of intricacies: ramp_up_batch_size, curriculum_learning (changine sequence length)
so my suggestion is to compute as time_per_iter * (total_tokens - consumed_tokens) /86400

However, I am not sure on how to get total number of tokens

Also, I see an argument --train-tokens but I don't think this is being used anywhere. (I could be wrong)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Good First Issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants