-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642
Comments
@BoyeGuillaume, thanks for your question. I think getting more details would be helpful to understand your specific need. Yes, I think that using a ZeRO-3_LLM in the middle of our model should work to forward loss and propagate back gradients, as below: input -> SLM -> embed -> ZeRO-3_LLM -> loss
SLM <- grad <- ZeRO-3_LLM <- grad Can you please try that and share any issues? In case you are unaware, HF multimodal IDEFICS-80B was trained with ZeRO-3. The following links might be useful.
You are correct that |
Thank you for your help, I'll check if this can fix the issue |
I maybe wrong but it seems that in the case of IDEFICS-80B the image projection and the text embeddings are consider as input of the network (by that I mean that you cannot train the SLM as there is no gradient past the embedding) |
My question is about whether it would be possible to apply Zero-3 optimization on only a portion of the "model" (aka. the LLM part) that is always the same. |
Thanks for the clarification of your scenario. Yes, ZeRO-3 can be applied to only the LLM portion of the model. In our RLHF work, the actor, critic, reward, and reference models are configured with different ZeRO-* optimizations, as in this example script. I am curious if the above rlhf example matches your scenario. Can you share your scenario code or pseudo-code, so we can discuss more concretely. |
Thanks I'll check this out, Concerning our architecture, our entire model fit within a single |
Or it seems to be doable, however we will probably end up getting rid of the huggingface trainer (as it seems it is doing a lot of dirty things in the background 🙃) Thanks for the help |
In that case, another option could be to use You can examine the following line in your log to observe the effectiveness of this approach |
Closing as the issue has been resolved. Please feel free to reopen if needed. |
Hello,
We have a multimodal model that is composed of multiple small "embedding" models followed by an large LLM. Because of scale of the training, we need for a multinode setup and we would like to use Zero-3 to reduce the memory footprint of the optimizer state.
Because of this, the input/output of the model may vary in size (and the same can be said for the model architecture). This prohibit us from using Zero-3 all together. Do you know if there is a way to start using Zero-3 in the middle of the model (aka. the boundary between the LLM and the embedded inputs). Notice that we still need gradients to be backpropagated to those embedders and as such we cannot simply consider the embeddings as the input.
I know of the
deepspeed.utils.set_z3_leaf_modules
method introduced in #4966, however it doesn't fit our use cases.Do you have any idea or suggestion on how we could achieve this (if possible) ?
Thanks for you time and help !
The text was updated successfully, but these errors were encountered: