Raise error if max duration is in epochs and dataloader is infinite #1942

dakinggg · 2023-02-03T23:48:15Z

What does this PR do?

Raises an error if the train dataloader is infinite and max duration is specified in epochs.

UX:

In [11]: trainer.fit()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 trainer.fit()

File ~/github/composer/composer/trainer/trainer.py:1693, in Trainer.fit(self, train_dataloader, train_dataloader_label, train_subset_num_batches, duration, reset_time, schedulers, scale_schedule_ratio, step_schedulers_every_batch, eval_dataloader, eval_subset_num_batches, eval_interval, grad_accum, device_train_microbatch_size, precision)
   1690     _raise_missing_argument_exception('max_duration')
   1692 if self.state.dataloader_len is None and self.state.max_duration.unit == TimeUnit.EPOCH:
-> 1693     raise ValueError(
   1694         ('max_duration cannot be specified in epochs when using an infinite dataloader. Please either '
   1695          'provide a dataloader with a length, specify max_duration in batches, samples, or tokens, or provide '
   1696          'train_subset_num_batches.'))
   1698 if self.state.max_duration <= self.state.timestamp.get(self.state.max_duration.unit) and not reset_time:
   1699     raise ValueError(
   1700         (f'The max_duration ({self.state.max_duration}) is less than or equal to the elapsed training duration '
   1701          f'({self.state.timestamp.get(self.state.max_duration.unit)}). No training would occur. '
   1702          'Please provide the `duration` or specify `reset_time=True` in Trainer.fit().'))

ValueError: max_duration cannot be specified in epochs when using an infinite dataloader. Please either provide a dataloader with a length, specify max_duration in batches, samples, or tokens, or provide train_subset_num_batches.

What issue(s) does this change relate to?

Closes CO-1738

Before submitting

Have you read the contributor guidelines?
Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
Did you update any related docs and document your change?
Did you update any related tests and add any new tests related to your change? (see testing)
Did you run the tests locally to make sure they pass?
Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

mvpatel2000

Lgtm. Maybe extend test to ensure infinite dataloader works in other cases described in error msg by parameterizing test

eracah

LGTM

dakinggg · 2023-02-04T02:06:18Z

Good call, added some more tests.

dakinggg added 3 commits February 3, 2023 14:45

add test for infinite data loader

fe60059

typos

4506ffc

add error to prevent infinite training spin

dd5a061

dakinggg changed the title ~~Max dur no len~~ Raise error if max duration is in epochs and dataloader is infinite Feb 3, 2023

dakinggg requested review from mvpatel2000 and eracah February 3, 2023 23:48

mvpatel2000 approved these changes Feb 4, 2023

View reviewed changes

eracah approved these changes Feb 4, 2023

View reviewed changes

add more tests

48dd736

dakinggg enabled auto-merge (squash) February 4, 2023 02:06

dakinggg merged commit bb856ad into mosaicml:dev Feb 4, 2023

dakinggg deleted the max_dur_no_len branch September 9, 2023 22:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Raise error if max duration is in epochs and dataloader is infinite #1942

Raise error if max duration is in epochs and dataloader is infinite #1942

dakinggg commented Feb 3, 2023 •

edited by jira bot

Loading

mvpatel2000 left a comment

eracah left a comment

dakinggg commented Feb 4, 2023

Raise error if max duration is in epochs and dataloader is infinite #1942

Raise error if max duration is in epochs and dataloader is infinite #1942

Conversation

dakinggg commented Feb 3, 2023 • edited by jira bot Loading

What does this PR do?

What issue(s) does this change relate to?

Before submitting

mvpatel2000 left a comment

Choose a reason for hiding this comment

eracah left a comment

Choose a reason for hiding this comment

dakinggg commented Feb 4, 2023

dakinggg commented Feb 3, 2023 •

edited by jira bot

Loading