Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] torch equivalent api model.no_sync() #1902

Closed
tangzhy opened this issue Apr 20, 2022 · 8 comments · Fixed by #6675
Closed

[REQUEST] torch equivalent api model.no_sync() #1902

tangzhy opened this issue Apr 20, 2022 · 8 comments · Fixed by #6675
Labels
enhancement New feature or request

Comments

@tangzhy
Copy link

tangzhy commented Apr 20, 2022

Hi, I'm adding deepspeed for my distributed model training framework.

When using pytorch native apis, everything is fine. For distributed training, originally I would wrapp model in an object of nn.parallel.DistributedDataParallel, and use model.no_sync() api to avoid unnecessary sync ops.

I cannot find the equivalent api in deepspeed. Can you offer some help?

@tangzhy tangzhy added the enhancement New feature or request label Apr 20, 2022
@jeffra
Copy link
Collaborator

jeffra commented Apr 20, 2022

Out of curiosity what sort of things do you want to do when sync is turned off? Is it gradient accumulation? If so, the deepspeed engine will make sure to not sync/communicate in between grad accumulation boundaries.

@tangzhy
Copy link
Author

tangzhy commented Apr 20, 2022

@jeffra In contrastive learning, which is super popular in research community currently, we usually rely on large batch size to compute the NCE softmax loss.

Though the remarkable GPU RAM saved by deepspeed, I would have to figure out how to scale the batch size from a small number like 32 to super large like 2048, or even more.

To do this, I would have to break the inputs into chunks, and compute the gradients chunk by chunk from the final logits to the backbone model.

With pytorch model.no_sync, I could first compute the gradients locally until the last chunk to perform sync globally.

For a more general purpose, I think deepspeed should consider expose this flexibility to users. Since this scenario is common for super large models, which is also deepspeed's main target.

@jeffra
Copy link
Collaborator

jeffra commented Apr 20, 2022

Can you show a small snippet of code on how you’re doing this in native pytorch? This sounds very much like gradient accumulation.

Does gradient_accumulation_steps do what you want? See: https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters

Having deepspeed support a similar pytorch style for no sync sounds beneficial. But curious if our existing method works for you for now?

@tangzhy
Copy link
Author

tangzhy commented Apr 20, 2022

I'm sure that gradient_accumulation cannot support my purpose, because gradient_accumulation only accumulates the gradients at the training example level.

For contrastive NCE softmax loss, I have to break the inputs into chunks and acuumulate the gradients at chunk level.

Here's the code,

        # Construct the gradient cache
        chunked_inputs = self.split_tensor_dict(inputs)
        for c in chunked_inputs:
            c['output_hidden_states'] = True
        cls_hiddens, rnd_states = self.gc.forward_no_grad(self.model.lm, chunked_inputs)
        if self.args.local_rank > -1:
            cls_hiddens = self.gather_tensors(cls_hiddens.contiguous())[0]
        grad_cache, total_loss = self.gc.build_cache(cls_hiddens)
        grad_cache = grad_cache[0]
        if self.args.local_rank > -1:
            total_loss = total_loss / dist.get_world_size()

        inputs['labels'] = labels
        chunked_inputs = self.split_tensor_dict(inputs)

        # Compute the full loss with cached gradients
        for local_chunk_id, chunk in enumerate(chunked_inputs):
            device_offset = max(0, self.args.local_rank) * self.args.per_device_train_batch_size * 2
            local_offset = local_chunk_id * self.args.cache_chunk_size
            chunk_offset = device_offset + local_offset
            with rnd_states[local_chunk_id]:
                if self.use_amp:
                    with autocast():
                        lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)
                else:
                    lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)

            if self.args.gradient_accumulation_steps > 1:
                raise ValueError

            ddp_no_sync = self.args.local_rank > -1 and (local_chunk_id + 1 < len(chunked_inputs))
            with model.no_sync() if ddp_no_sync else nullcontext():
                if self.use_amp:
                    (self.scaler.scale(lm_loss) + surrogate).backward()
                elif self.use_apex:
                    raise ValueError
                elif self.deepspeed:
                    raise ValueError
                else:
                    (lm_loss + surrogate).backward()
            total_loss += lm_loss

Without model.no_sync, the code would sync gradients for each chunk and dramatically increase the backward time cost. And I only want to sync the gradients until hitting the last chunk:)

@tangzhy
Copy link
Author

tangzhy commented Apr 21, 2022

@jeffra Wonder if you will have a plan to add this feature? If so, does there exist a time expectation?

@gzerveas
Copy link

To provide some context and an overview, @tangzhy is referring to the gradient caching technique, implemented here: https://github.com/luyug/GradCache (link to the paper is in the README). You can basically see it as "gradient accumulation for contrastive learning". The reason why vanilla gradient accumulation cannot be used directly is that (e.g. using 1 GPU) computing the contrastive loss requires all samples across one batch (to be used as negatives), while gradient accumulation only allows us to use the subset of samples inside the microbatch. In case of distributed training, we'd like to use all samples across all batches on all GPUs, in which case model.no_sync would be useful during the backward pass (the code and paper I linked will make this clear).
I guess the question is, does a model wrapped by deepspeed by default do syncing of the gradients when executing .backward(), and if so, is there a way to prevent this? Thank you!

@tangzhy
Copy link
Author

tangzhy commented Aug 22, 2022

@gzerveas Yes, thanks for clarification! @jeffra Deepspeed is critical for us to employ billion-scale models in contrastive learning. Looking forward to your thoughts:)

@memray
Copy link

memray commented Sep 8, 2023

@tangzhy have you figured out how to use DeepSpeed for GradCache?
Thanks!

github-merge-queue bot pushed a commit that referenced this issue Nov 14, 2024
Fix #1902

---------

Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants