Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

Closed
BoyeGuillaume opened this issue Oct 19, 2024 · 9 comments

Comments

@BoyeGuillaume
Copy link

Hello,

We have a multimodal model that is composed of multiple small "embedding" models followed by an large LLM. Because of scale of the training, we need for a multinode setup and we would like to use Zero-3 to reduce the memory footprint of the optimizer state.

Because of this, the input/output of the model may vary in size (and the same can be said for the model architecture). This prohibit us from using Zero-3 all together. Do you know if there is a way to start using Zero-3 in the middle of the model (aka. the boundary between the LLM and the embedded inputs). Notice that we still need gradients to be backpropagated to those embedders and as such we cannot simply consider the embeddings as the input.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

Do you have any idea or suggestion on how we could achieve this (if possible) ?

Thanks for you time and help !

@tjruwase
Copy link
Contributor

@BoyeGuillaume, thanks for your question. I think getting more details would be helpful to understand your specific need.

Yes, I think that using a ZeRO-3_LLM in the middle of our model should work to forward loss and propagate back gradients, as below:

input -> SLM -> embed -> ZeRO-3_LLM -> loss
SLM <- grad <- ZeRO-3_LLM <- grad

Can you please try that and share any issues?

In case you are unaware, HF multimodal IDEFICS-80B was trained with ZeRO-3. The following links might be useful.

  1. https://x.com/StasBekman/status/1694004904761987249
  2. https://huggingface.co/HuggingFaceM4/idefics2-8b/discussions/30

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

You are correct that deepspeed.utils.set_z3_leaf_modules is irrelevant for this case.

@BoyeGuillaume
Copy link
Author

Thank you for your help, I'll check if this can fix the issue

@BoyeGuillaume
Copy link
Author

I maybe wrong but it seems that in the case of IDEFICS-80B the image projection and the text embeddings are consider as input of the network (by that I mean that you cannot train the SLM as there is no gradient past the embedding)

@BoyeGuillaume
Copy link
Author

My question is about whether it would be possible to apply Zero-3 optimization on only a portion of the "model" (aka. the LLM part) that is always the same.

@tjruwase
Copy link
Contributor

Thanks for the clarification of your scenario. Yes, ZeRO-3 can be applied to only the LLM portion of the model. In our RLHF work, the actor, critic, reward, and reference models are configured with different ZeRO-* optimizations, as in this example script.

I am curious if the above rlhf example matches your scenario. Can you share your scenario code or pseudo-code, so we can discuss more concretely.

@BoyeGuillaume
Copy link
Author

Thanks I'll check this out,

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities. We then use the huggingface Trainer (a modified version as we have additional masking to do) and launch the entire pipeline with pytorch (for distributed training). The huggingface trainer takes the deepspeed configuration directly.

@BoyeGuillaume
Copy link
Author

Or it seems to be doable, however we will probably end up getting rid of the huggingface trainer (as it seems it is doing a lot of dirty things in the background 🙃)

Thanks for the help

@tjruwase
Copy link
Contributor

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities

In that case, another option could be to use stage3_param_persistence_threshold configuration to restrict ZeRO-3 optimization to only the large parameters of the model.
https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training
Image

You can examine the following line in your log to observe the effectiveness of this approach
https://github.com/microsoft/DeepSpeed/blob/6e6563d3c8d7527713cc48d4a3adce51f22e83a2/deepspeed/runtime/zero/parameter_offload.py#L253-L255

@tohtana
Copy link
Contributor

tohtana commented Nov 8, 2024

Closing as the issue has been resolved. Please feel free to reopen if needed.

@tohtana tohtana closed this as completed Nov 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants