-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for local SGD. #1350
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all your work on this! Looks excellent to me. cc @sgugger if you see any nits in here I may have missed
Thank you @muellerzr I am actually having some second thoughts modulo our recent conversation. This does change the way we would treat per device batch size. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your PR! It looks simple enough in term of API for a first integration, just left some small comments around the doc and code-styling.
Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). Parallel SGD: When does averaging help?. arXiv preprint | ||
arXiv:1606.07365. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Put a link here (Markdown syntax is supported)
Stich, Sebastian Urban. "Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on | ||
Learning Representations. No. CONF. 2019. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here too
]: | ||
raise NotImplementedError("LocalSGD is supported only for CPUs and GPUs (no DeepSpeed or MegatronLM)") | ||
self.enabled = enabled and accelerator.distributed_type != DistributedType.NO | ||
self.step_qty = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use a more descriptive variable names. It's also more of a num_steps
than a step quantity IMO.
qty = float(dist.get_world_size()) | ||
for prm in self.model.parameters(): | ||
dist.all_reduce(prm.data, op=torch.distributed.ReduceOp.SUM) | ||
prm.data /= qty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use num_steps
and param
here.
Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). | ||
Parallel SGD: When does averaging help?. arXiv preprint arXiv:1606.07365. | ||
|
||
A term Local SGD was used in, e.g., the following paper: | ||
|
||
Stich, Sebastian Urban. "Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on Learning Representations. No. CONF. 2019. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put links here to the relevant papers.
|
||
# Using Local SGD with 🤗 Accelerate | ||
|
||
Local SGD is a technique for distributed training where the model weights and/or gradients are not synchronized every step. This improves communication efficiency and can lead to substantial training speed up especially when a computer lacks a faster interconnect such as NVLink. Unlike gradient accumulation (where improving communication efficiency requires increasing the effective batch size), Local SGD does not require changing a batch size or a learning rate / schedule. However, if necessary, Local SGD can be combined with gradient accumulation as well. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit misleading. In regular distributed training the model weights are never synchronized. Only the gradients are (the model is the same at first and then gets the same updates on all processes). I would say that:
- gradients are not synchronized at every steps
- thus the model on every process get different weights
- after a given number of steps, the different model weights are averaged across processes
Thank you for the feedback. I will make the changes and I want to re-run the tests too. Actually, it looks like the number for the gradient accumulation are so weird, b/c the number of steps per epoch (for historical reasons) is divided by the number of accumulation steps (which wasn't noticed until recently) and I suspect this is not right. Also as noted by @muellerzr it is clearly being trained using less data, but the reason IMHO isn't the drift in the batch size, but just the reduction of the training set. |
Hi,
As discussed with @muellerzr , I am adding a non-breaking functionality to support local SGD, which enables efficient multi-GPU training in the case when no fast interconnect is possible. It is a more stable and reliable alternative to gradient accumulation (they can be used jointly when needed too). I ran rather extensive tests to compare both. Unfortunately, even with all the proper code, gradient accumulation is quite finicky and you do not want to use it as a speed up approach.
Here is a link to the summary of results and benchmarking code.
It is not clear what kind of unit test can test this kind of functionality. The key ingredient (disabling gradient synchronization is being tested already).