Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Raise error if max duration is in epochs and dataloader is infinite #1942

Merged
merged 4 commits into from
Feb 4, 2023

Conversation

dakinggg
Copy link
Contributor

@dakinggg dakinggg commented Feb 3, 2023

What does this PR do?

Raises an error if the train dataloader is infinite and max duration is specified in epochs.

UX:

In [11]: trainer.fit()
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
Cell In[11], line 1
----> 1 trainer.fit()

File ~/github/composer/composer/trainer/trainer.py:1693, in Trainer.fit(self, train_dataloader, train_dataloader_label, train_subset_num_batches, duration, reset_time, schedulers, scale_schedule_ratio, step_schedulers_every_batch, eval_dataloader, eval_subset_num_batches, eval_interval, grad_accum, device_train_microbatch_size, precision)
   1690     _raise_missing_argument_exception('max_duration')
   1692 if self.state.dataloader_len is None and self.state.max_duration.unit == TimeUnit.EPOCH:
-> 1693     raise ValueError(
   1694         ('max_duration cannot be specified in epochs when using an infinite dataloader. Please either '
   1695          'provide a dataloader with a length, specify max_duration in batches, samples, or tokens, or provide '
   1696          'train_subset_num_batches.'))
   1698 if self.state.max_duration <= self.state.timestamp.get(self.state.max_duration.unit) and not reset_time:
   1699     raise ValueError(
   1700         (f'The max_duration ({self.state.max_duration}) is less than or equal to the elapsed training duration '
   1701          f'({self.state.timestamp.get(self.state.max_duration.unit)}). No training would occur. '
   1702          'Please provide the `duration` or specify `reset_time=True` in Trainer.fit().'))

ValueError: max_duration cannot be specified in epochs when using an infinite dataloader. Please either provide a dataloader with a length, specify max_duration in batches, samples, or tokens, or provide train_subset_num_batches.

What issue(s) does this change relate to?

Closes CO-1738

Before submitting

  • Have you read the contributor guidelines?
  • Was this change discussed/approved in a GitHub issue first? It is much more likely to be merged if so.
  • Did you update any related docs and document your change?
  • Did you update any related tests and add any new tests related to your change? (see testing)
  • Did you run the tests locally to make sure they pass?
  • Did you run pre-commit on your change? (see the pre-commit section of prerequisites)

@dakinggg dakinggg changed the title Max dur no len Raise error if max duration is in epochs and dataloader is infinite Feb 3, 2023
Copy link
Contributor

@mvpatel2000 mvpatel2000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. Maybe extend test to ensure infinite dataloader works in other cases described in error msg by parameterizing test

Copy link
Contributor

@eracah eracah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dakinggg
Copy link
Contributor Author

dakinggg commented Feb 4, 2023

Good call, added some more tests.

@dakinggg dakinggg enabled auto-merge (squash) February 4, 2023 02:06
@dakinggg dakinggg merged commit bb856ad into mosaicml:dev Feb 4, 2023
@dakinggg dakinggg deleted the max_dur_no_len branch September 9, 2023 22:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants