-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REQUEST] torch equivalent api model.no_sync() #1902
Comments
Out of curiosity what sort of things do you want to do when sync is turned off? Is it gradient accumulation? If so, the deepspeed engine will make sure to not sync/communicate in between grad accumulation boundaries. |
@jeffra In contrastive learning, which is super popular in research community currently, we usually rely on large batch size to compute the NCE softmax loss. Though the remarkable GPU RAM saved by deepspeed, I would have to figure out how to scale the batch size from a small number like 32 to super large like 2048, or even more. To do this, I would have to break the inputs into chunks, and compute the gradients chunk by chunk from the final logits to the backbone model. With pytorch model.no_sync, I could first compute the gradients locally until the last chunk to perform sync globally. For a more general purpose, I think deepspeed should consider expose this flexibility to users. Since this scenario is common for super large models, which is also deepspeed's main target. |
Can you show a small snippet of code on how you’re doing this in native pytorch? This sounds very much like gradient accumulation. Does gradient_accumulation_steps do what you want? See: https://www.deepspeed.ai/docs/config-json/#batch-size-related-parameters Having deepspeed support a similar pytorch style for no sync sounds beneficial. But curious if our existing method works for you for now? |
I'm sure that gradient_accumulation cannot support my purpose, because gradient_accumulation only accumulates the gradients at the training example level. For contrastive NCE softmax loss, I have to break the inputs into chunks and acuumulate the gradients at chunk level. Here's the code, # Construct the gradient cache
chunked_inputs = self.split_tensor_dict(inputs)
for c in chunked_inputs:
c['output_hidden_states'] = True
cls_hiddens, rnd_states = self.gc.forward_no_grad(self.model.lm, chunked_inputs)
if self.args.local_rank > -1:
cls_hiddens = self.gather_tensors(cls_hiddens.contiguous())[0]
grad_cache, total_loss = self.gc.build_cache(cls_hiddens)
grad_cache = grad_cache[0]
if self.args.local_rank > -1:
total_loss = total_loss / dist.get_world_size()
inputs['labels'] = labels
chunked_inputs = self.split_tensor_dict(inputs)
# Compute the full loss with cached gradients
for local_chunk_id, chunk in enumerate(chunked_inputs):
device_offset = max(0, self.args.local_rank) * self.args.per_device_train_batch_size * 2
local_offset = local_chunk_id * self.args.cache_chunk_size
chunk_offset = device_offset + local_offset
with rnd_states[local_chunk_id]:
if self.use_amp:
with autocast():
lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)
else:
lm_loss, surrogate = self.compute_loss(model, chunk, grad_cache, chunk_offset)
if self.args.gradient_accumulation_steps > 1:
raise ValueError
ddp_no_sync = self.args.local_rank > -1 and (local_chunk_id + 1 < len(chunked_inputs))
with model.no_sync() if ddp_no_sync else nullcontext():
if self.use_amp:
(self.scaler.scale(lm_loss) + surrogate).backward()
elif self.use_apex:
raise ValueError
elif self.deepspeed:
raise ValueError
else:
(lm_loss + surrogate).backward()
total_loss += lm_loss Without |
@jeffra Wonder if you will have a plan to add this feature? If so, does there exist a time expectation? |
To provide some context and an overview, @tangzhy is referring to the gradient caching technique, implemented here: https://github.com/luyug/GradCache (link to the paper is in the README). You can basically see it as "gradient accumulation for contrastive learning". The reason why vanilla gradient accumulation cannot be used directly is that (e.g. using 1 GPU) computing the contrastive loss requires all samples across one batch (to be used as negatives), while gradient accumulation only allows us to use the subset of samples inside the microbatch. In case of distributed training, we'd like to use all samples across all batches on all GPUs, in which case |
@tangzhy have you figured out how to use DeepSpeed for GradCache? |
Fix #1902 --------- Co-authored-by: Logan Adams <114770087+loadams@users.noreply.github.com>
Hi, I'm adding deepspeed for my distributed model training framework.
When using pytorch native apis, everything is fine. For distributed training, originally I would wrapp model in an object of nn.parallel.DistributedDataParallel, and use model.no_sync() api to avoid unnecessary sync ops.
I cannot find the equivalent api in deepspeed. Can you offer some help?
The text was updated successfully, but these errors were encountered: