[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

BoyeGuillaume · 2024-10-19T19:09:24Z

Hello,

We have a multimodal model that is composed of multiple small "embedding" models followed by an large LLM. Because of scale of the training, we need for a multinode setup and we would like to use Zero-3 to reduce the memory footprint of the optimizer state.

Because of this, the input/output of the model may vary in size (and the same can be said for the model architecture). This prohibit us from using Zero-3 all together. Do you know if there is a way to start using Zero-3 in the middle of the model (aka. the boundary between the LLM and the embedded inputs). Notice that we still need gradients to be backpropagated to those embedders and as such we cannot simply consider the embeddings as the input.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

Do you have any idea or suggestion on how we could achieve this (if possible) ?

Thanks for you time and help !

The text was updated successfully, but these errors were encountered:

tjruwase · 2024-10-23T18:18:24Z

@BoyeGuillaume, thanks for your question. I think getting more details would be helpful to understand your specific need.

Yes, I think that using a ZeRO-3_LLM in the middle of our model should work to forward loss and propagate back gradients, as below:

input -> SLM -> embed -> ZeRO-3_LLM -> loss
SLM <- grad <- ZeRO-3_LLM <- grad

Can you please try that and share any issues?

In case you are unaware, HF multimodal IDEFICS-80B was trained with ZeRO-3. The following links might be useful.

I know of the deepspeed.utils.set_z3_leaf_modules method introduced in #4966, however it doesn't fit our use cases.

You are correct that deepspeed.utils.set_z3_leaf_modules is irrelevant for this case.

BoyeGuillaume · 2024-10-24T09:20:03Z

Thank you for your help, I'll check if this can fix the issue

BoyeGuillaume · 2024-10-24T09:27:35Z

I maybe wrong but it seems that in the case of IDEFICS-80B the image projection and the text embeddings are consider as input of the network (by that I mean that you cannot train the SLM as there is no gradient past the embedding)

BoyeGuillaume · 2024-10-24T09:29:50Z

My question is about whether it would be possible to apply Zero-3 optimization on only a portion of the "model" (aka. the LLM part) that is always the same.

tjruwase · 2024-10-24T10:55:52Z

Thanks for the clarification of your scenario. Yes, ZeRO-3 can be applied to only the LLM portion of the model. In our RLHF work, the actor, critic, reward, and reference models are configured with different ZeRO-* optimizations, as in this example script.

I am curious if the above rlhf example matches your scenario. Can you share your scenario code or pseudo-code, so we can discuss more concretely.

BoyeGuillaume · 2024-10-24T11:51:27Z

Thanks I'll check this out,

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities. We then use the huggingface Trainer (a modified version as we have additional masking to do) and launch the entire pipeline with pytorch (for distributed training). The huggingface trainer takes the deepspeed configuration directly.

BoyeGuillaume · 2024-10-24T11:58:11Z

Or it seems to be doable, however we will probably end up getting rid of the huggingface trainer (as it seems it is doing a lot of dirty things in the background 🙃)

Thanks for the help

tjruwase · 2024-10-24T13:32:23Z

Concerning our architecture, our entire model fit within a single pytorch.Module that consists of the LLM and all of the models for embedding modalities

In that case, another option could be to use stage3_param_persistence_threshold configuration to restrict ZeRO-3 optimization to only the large parameters of the model.
https://www.deepspeed.ai/docs/config-json/#zero-optimizations-for-fp16-training

You can examine the following line in your log to observe the effectiveness of this approach
https://github.com/microsoft/DeepSpeed/blob/6e6563d3c8d7527713cc48d4a3adce51f22e83a2/deepspeed/runtime/zero/parameter_offload.py#L253-L255

tohtana · 2024-11-08T21:56:37Z

Closing as the issue has been resolved. Please feel free to reopen if needed.

tohtana closed this as completed Nov 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

BoyeGuillaume commented Oct 19, 2024

tjruwase commented Oct 23, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

tjruwase commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

tjruwase commented Oct 24, 2024

tohtana commented Nov 8, 2024

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

[HELP] Zero-3 on partial model to fix the input/output constant constraint #6642

Comments

BoyeGuillaume commented Oct 19, 2024

tjruwase commented Oct 23, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

tjruwase commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

BoyeGuillaume commented Oct 24, 2024

tjruwase commented Oct 24, 2024

tohtana commented Nov 8, 2024