Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

lr scheduler defined in config cannot be overwritten by lr scheduler defined in code and pass to deepspeed.initialize [BUG] #5726

Closed
xiyang-aads-lilly opened this issue Jul 5, 2024 · 3 comments
Labels
bug Something isn't working training

Comments

@xiyang-aads-lilly
Copy link
Contributor

xiyang-aads-lilly commented Jul 5, 2024

Describe the issue
note: this is not a bug but a inconsistent design.

In the current implementation of the learning rate (LR) scheduler configuration in DeepSpeed, the LR scheduler is always initialized from the configuration file if it is defined there, regardless of whether a scheduler is provided programmatically. This behavior leads to inconsistency with the optimizer, which can be overwritten programmatically even if it is defined in the configuration file.

Here is the current implementation of the LR scheduler configuration:

    def _configure_lr_scheduler(self, client_lr_scheduler):
        # First check for scheduler in json configuration
        lr_scheduler = self._scheduler_from_config(self.optimizer)
        if lr_scheduler:
            log_dist(f"DeepSpeed using configured LR scheduler = {self.scheduler_name()}", ranks=[0])
            self.lr_scheduler = lr_scheduler
        else:
            if isinstance(client_lr_scheduler, Callable):
                log_dist('DeepSpeed using client callable to create LR scheduler', ranks=[0])
                self.lr_scheduler = client_lr_scheduler(self.basic_optimizer)
            else:
                log_dist('DeepSpeed using client LR scheduler', ranks=[0])
                self.lr_scheduler = client_lr_scheduler

        log_dist(f'DeepSpeed LR Scheduler = {self.lr_scheduler}', ranks=[0])

Expected behavior
We should be able to overwrite the lr scheduler defined in config.
Ideally would prefer something like:

    def _configure_lr_scheduler(self, client_lr_scheduler):
        # First check for scheduler in json configuration
       if client_lr_scheduler:
            if isinstance(client_lr_scheduler, Callable):
                log_dist('DeepSpeed using client callable to create LR scheduler', ranks=[0])
                self.lr_scheduler = client_lr_scheduler(self.basic_optimizer)
            else:
                log_dist('DeepSpeed using client LR scheduler', ranks=[0])
                self.lr_scheduler = client_lr_scheduler
        else:
            lr_scheduler = self._scheduler_from_config(self.optimizer)
            log_dist(f"DeepSpeed using configured LR scheduler = {self.scheduler_name()}", ranks=[0])
            self.lr_scheduler = lr_scheduler
        

        log_dist(f'DeepSpeed LR Scheduler = {self.lr_scheduler}', ranks=[0])

Why does the current design enforce the initialization of the LR scheduler from the configuration file if it is defined there, while allowing the optimizer to be overwritten programmatically?

@xiyang-aads-lilly xiyang-aads-lilly added bug Something isn't working training labels Jul 5, 2024
@tjruwase
Copy link
Contributor

tjruwase commented Aug 3, 2024

@xiyang-aads-lilly, this is a good catch. Your proposed solution looks reasonable. Are you able to provide a PR? Thanks!

@xiyang-aads-lilly
Copy link
Contributor Author

@xiyang-aads-lilly, this is a good catch. Your proposed solution looks reasonable. Are you able to provide a PR? Thanks!

Sure. I will create PR next week.

github-merge-queue bot pushed a commit that referenced this issue Aug 14, 2024
This PR is based on #5726.

The current lr scheduler initialization always prioritize config over
manual defined scheduler in the code. However, the optimizer
initialization implementation prioritize manual defined optimizer over
config. This PR aims to make initialization behavior for both optimizer
and lr scheduler consistent where if lr scheduler is defined in the
code, then it will overwrite config.

---------

Co-authored-by: Olatunji Ruwase <olruwase@microsoft.com>
@tohtana
Copy link
Contributor

tohtana commented Sep 12, 2024

Closing as this issue was resolved by #5846. Please feel free to reopen when needed.

@tohtana tohtana closed this as completed Sep 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working training
Projects
None yet
Development

No branches or pull requests

3 participants