Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Support deepspeed integration #627

Open
nijkah opened this issue Oct 18, 2022 · 3 comments
Open

[Feature Request] Support deepspeed integration #627

nijkah opened this issue Oct 18, 2022 · 3 comments
Assignees

Comments

@nijkah
Copy link
Contributor

nijkah commented Oct 18, 2022

Describe the feature

Motivation
Nowadays, deepspeed became a fundamental framework that facilitates training and inference for large-scale or foundation models.
We are developing a feature for deepspeed integration into mmengine with support for a deepspeed-specified runner and optim_wrapper.

Does MMEngine have a plan to support deepspeed?
Then we can contribute to MMEngine with our implementation :)

Please let me know any guide, plan or opinion about this. :)

@C1rN09
Copy link
Collaborator

C1rN09 commented Oct 18, 2022

Hi, @nijkah We welcome any kind of contribution, and deepspeed integration is definitely what we desire!
However, could you make it clearer about "deepspeed-specified runner and optim_wrapper"? If you are going to write a new runner that only serves deepspeed models, it seems not quite reasonable and we might need more discussion on it ^_^

@C1rN09
Copy link
Collaborator

C1rN09 commented Nov 2, 2022

Hi, @nijkah Have you got any new progress on deepspeed integration? Hope we can discuss on it before you post a PR because it might not be a small & easy one. If you have any ideas/problems/progress, we are always open to have a discussion, either in this issue, or our discussion board.

@nijkah
Copy link
Contributor Author

nijkah commented Nov 4, 2022

Hi, @C1rN09. Our integration development is almost done although there are still several choices left to consider.

Our current implementation supports

  1. Enable ZeRO1, ZeRO2, ZeRO3
  2. Saving a monolithic weight logic (DeepSpeed saves its model weights and optimizer's state in separate files, and the number of saved files are multiplied by the world_size.)

doesn't support yet

  1. FP16 (There is a method to support it! But the solution is quite messy.)
  2. Mixture of Experts
  3. Pipeline Parallelism (It requires the logic to sequentialize MM models.

There are several reasons why we try to write a new deepspeed-dedicated runner.
Although we try to follow most of mmengine's Runner logic, there should be some modifications to support deepspeed.

Main logic of DeepSpeedRunner is like below,

        >>> self.model = self.build_model(model)
        >>> self.optim_wrapper = self.build_optim_wrapper(optim_wrapper)
        >>> ds_config = json.load(open(cfg.deepspeed_config))
        >>> self.model, optimizer = deepspeed.initialize(
        >>>    model=self.model,
        >>>    optimizer=self.optim_wrapper.optimizer,
        >>>    model_parameters=self.model.parameters(),
        >>>    config=ds_config)
        >>> self.optim_wrapper.optimizer = optimizer
        >>> self.inject_base_model_methods()

First, the order of logic should be changed when using deepspeed. There was a similar modification in your FSDP PR. It may be ignored in the future.
And, to use deepspeed, it seems better to use DeepSpeedEngine's inner logic for optimizers. Then we should give the optimizer variable to deepspeed.initialize or DeepSpeedEngine.

Moreover, DeepSpeedEngine requires users to update parameters by engine.step() which includes optimizer.step and related logic. It made us write a new class for DeepSpeedOptimWrapper.

I think it is better to share our prototype code when we are ready instead of explaining by writing.
We can share the link of our repo containing the code before posting the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants