When to stop training #136

dkokron · 2025-02-09T15:47:01Z

From issue
#80

"and we trained with 300k steps batch 32 each, which corresponds to about ~180 epochs"

What metric did you monitor to know when to stop training?
Did you ever redo figure 3 from https://arxiv.org/pdf/2312.15796 for a model having less than 300K steps?

andrewlkd · 2025-02-09T15:58:18Z

"and we trained with 300k steps batch 32 each, which corresponds to about ~180 epochs"

Just to clarify, the 300k steps is wrt GraphCast. As Table D1 in https://arxiv.org/pdf/2312.15796 suggests, GenCast is trained for 2 million steps at 1deg and then a further 64k steps at 0.25deg. At batch size 32, this corresponds to ~1200 epochs and ~38 epochs respectively.

What metric did you monitor to know when to stop training?

The training steps were chosen empirically from sweeps.

Did you ever redo figure 3 from https://arxiv.org/pdf/2312.15796 for a model having less than 300K steps?

Unfortunately not.

dkokron · 2025-02-09T17:25:56Z

I took the 2M value in D1 to be the "decay_steps" value in the AdamW configuration. Am I correct to understand that you used 2M for "decay_steps" as well as 2M sampling cycles with each sample cycle having a batch size of 32 samples from the 54K total samples in the dataset?

lr_schedule = optax.schedules.warmup_cosine_decay_schedule(
init_value = 0.0,
peak_value = .003,
warmup_steps = 1000,
decay_steps = 2000000,
end_value = 0.0,
exponent = 0.1,
)

andrewlkd · 2025-02-10T07:17:38Z

As stated in the table, 2M is "Total train steps". Since there are 1k warm up steps, this means there are (2M - 1k) decay steps.

Since warmup/decay only relates to the learning rate being applied, this indeed leaves "2M sampling cycles with each sample cycle having a batch size of 32 samples from the 54K total samples in the dataset".

Hope this helps

-- Andrew

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

When to stop training #136

When to stop training #136

dkokron commented Feb 9, 2025

andrewlkd commented Feb 9, 2025

dkokron commented Feb 9, 2025

andrewlkd commented Feb 10, 2025

When to stop training #136

When to stop training #136

Comments

dkokron commented Feb 9, 2025

andrewlkd commented Feb 9, 2025

dkokron commented Feb 9, 2025

andrewlkd commented Feb 10, 2025