[REQUEST] torch equivalent api model.no_sync() #1902

tangzhy · 2022-04-20T01:00:09Z

Hi, I'm adding deepspeed for my distributed model training framework.

When using pytorch native apis, everything is fine. For distributed training, originally I would wrapp model in an object of nn.parallel.DistributedDataParallel, and use model.no_sync() api to avoid unnecessary sync ops.

I cannot find the equivalent api in deepspeed. Can you offer some help?

jeffra · 2022-04-20T02:27:00Z

Out of curiosity what sort of things do you want to do when sync is turned off? Is it gradient accumulation? If so, the deepspeed engine will make sure to not sync/communicate in between grad accumulation boundaries.

tangzhy · 2022-04-20T02:45:43Z

@jeffra In contrastive learning, which is super popular in research community currently, we usually rely on large batch size to compute the NCE softmax loss.

Though the remarkable GPU RAM saved by deepspeed, I would have to figure out how to scale the batch size from a small number like 32 to super large like 2048, or even more.

To do this, I would have to break the inputs into chunks, and compute the gradients chunk by chunk from the final logits to the backbone model.

With pytorch model.no_sync, I could first compute the gradients locally until the last chunk to perform sync globally.

For a more general purpose, I think deepspeed should consider expose this flexibility to users. Since this scenario is common for super large models, which is also deepspeed's main target.

jeffra · 2022-04-20T02:54:40Z

Can you show a small snippet of code on how you’re doing this in native pytorch? This sounds very much like gradient accumulation.

Does gradient_accumulation_steps do what you want? See: https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters

Having deepspeed support a similar pytorch style for no sync sounds beneficial. But curious if our existing method works for you for now?

tangzhy · 2022-04-20T03:08:24Z

I'm sure that gradient_accumulation cannot support my purpose, because gradient_accumulation only accumulates the gradients at the training example level.

For contrastive NCE softmax loss, I have to break the inputs into chunks and acuumulate the gradients at chunk level.

Here's the code,

        # Construct the gradient cache
        chunked_inputs = self.split_tensor_dict(inputs)
        for c in chunked_inputs:
            c['output_hidden_states'] = True
        cls_hiddens, rnd_states = self.gc.forward_no_grad(self.model.lm, chunked_inputs)
        if self.args.local_rank > -1:
            cls_hiddens = self.gather_tensors(cls_hiddens.contiguous())[0]
        grad_cache, total_loss = self.gc.build_cache(cls_hiddens)
        grad_cache = grad_cache[0]
        if self.args.local_rank > -1:
            total_loss = total_loss / dist.get_world_size()

        inputs['labels'] = labels
        chunked_inputs = self.split_tensor_dict(inputs)

        # Compute the full loss with cached gradients
        for local_chunk_id, chunk in enumerate(chunked_inputs):
            device_offset = max(0, self.args.local_rank) * self.args.per_device_train_batch_size * 2
            local_offset = local_chunk_id * self.args.cache_chunk_size
            chunk_offset = device_offset + local_offset
            with rnd_states[local_chunk_id]:
                if self.use_amp:
                    with autocast():
                        lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)
                else:
                    lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)

            if self.args.gradient_accumulation_steps > 1:
                raise ValueError

            ddp_no_sync = self.args.local_rank > -1 and (local_chunk_id + 1 < len(chunked_inputs))
            with model.no_sync() if ddp_no_sync else nullcontext():
                if self.use_amp:
                    (self.scaler.scale(lm_loss) + surrogate).backward()
                elif self.use_apex:
                    raise ValueError
                elif self.deepspeed:
                    raise ValueError
                else:
                    (lm_loss + surrogate).backward()
            total_loss += lm_loss

Without model.no_sync, the code would sync gradients for each chunk and dramatically increase the backward time cost. And I only want to sync the gradients until hitting the last chunk:)

tangzhy · 2022-04-21T11:10:11Z

@jeffra Wonder if you will have a plan to add this feature? If so, does there exist a time expectation?

gzerveas · 2022-08-17T00:34:38Z

To provide some context and an overview, @tangzhy is referring to the gradient caching technique, implemented here: https://github.com/luyug/GradCache (link to the paper is in the README). You can basically see it as "gradient accumulation for contrastive learning". The reason why vanilla gradient accumulation cannot be used directly is that (e.g. using 1 GPU) computing the contrastive loss requires all samples across one batch (to be used as negatives), while gradient accumulation only allows us to use the subset of samples inside the microbatch. In case of distributed training, we'd like to use all samples across all batches on all GPUs, in which case model.no_sync would be useful during the backward pass (the code and paper I linked will make this clear).
I guess the question is, does a model wrapped by deepspeed by default do syncing of the gradients when executing .backward(), and if so, is there a way to prevent this? Thank you!

tangzhy · 2022-08-22T03:11:14Z

@gzerveas Yes, thanks for clarification! @jeffra Deepspeed is critical for us to employ billion-scale models in contrastive learning. Looking forward to your thoughts:)

memray · 2023-09-08T04:47:51Z

@tangzhy have you figured out how to use DeepSpeed for GradCache?
Thanks!

Fix #1902 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

tangzhy added the enhancement New feature or request label Apr 20, 2022

tjruwase mentioned this issue Oct 27, 2024

Add no_sync context manager #6675

Merged

github-merge-queue bot pushed a commit that referenced this issue Nov 14, 2024

Add no_sync context manager (#6675)

fc4e733

Fix #1902 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>

loadams closed this as completed in #6675 Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REQUEST] torch equivalent api model.no_sync() #1902

[REQUEST] torch equivalent api model.no_sync() #1902

tangzhy commented Apr 20, 2022 •

edited

Loading

jeffra commented Apr 20, 2022

tangzhy commented Apr 20, 2022 •

edited

Loading

jeffra commented Apr 20, 2022

tangzhy commented Apr 20, 2022 •

edited

Loading

tangzhy commented Apr 21, 2022 •

edited

Loading

gzerveas commented Aug 17, 2022

tangzhy commented Aug 22, 2022

memray commented Sep 8, 2023

[REQUEST] torch equivalent api model.no_sync() #1902

[REQUEST] torch equivalent api model.no_sync() #1902

Comments

tangzhy commented Apr 20, 2022 • edited Loading

jeffra commented Apr 20, 2022

tangzhy commented Apr 20, 2022 • edited Loading

jeffra commented Apr 20, 2022

tangzhy commented Apr 20, 2022 • edited Loading

tangzhy commented Apr 21, 2022 • edited Loading

gzerveas commented Aug 17, 2022

tangzhy commented Aug 22, 2022

memray commented Sep 8, 2023

tangzhy commented Apr 20, 2022 •

edited

Loading

tangzhy commented Apr 20, 2022 •

edited

Loading

tangzhy commented Apr 20, 2022 •

edited

Loading

tangzhy commented Apr 21, 2022 •

edited

Loading