Adding support for local SGD. #1350

searchivarius · 2023-04-23T03:16:34Z

Hi,

As discussed with @muellerzr , I am adding a non-breaking functionality to support local SGD, which enables efficient multi-GPU training in the case when no fast interconnect is possible. It is a more stable and reliable alternative to gradient accumulation (they can be used jointly when needed too). I ran rather extensive tests to compare both. Unfortunately, even with all the proper code, gradient accumulation is quite finicky and you do not want to use it as a speed up approach.

Here is a link to the summary of results and benchmarking code.

It is not clear what kind of unit test can test this kind of functionality. The key ingredient (disabling gradient synchronization is being tested already).

HuggingFaceDocBuilderDev · 2023-04-23T03:28:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

muellerzr

Thanks for all your work on this! Looks excellent to me. cc @sgugger if you see any nits in here I may have missed

searchivarius · 2023-04-24T13:47:49Z

Thank you @muellerzr I am actually having some second thoughts modulo our recent conversation. This does change the way we would treat per device batch size.

sgugger

Thanks for your PR! It looks simple enough in term of API for a first integration, just left some small comments around the doc and code-styling.

sgugger · 2023-04-24T13:30:14Z

src/accelerate/local_sgd.py

+    Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). Parallel SGD: When does averaging help?. arXiv preprint
+    arXiv:1606.07365.


Put a link here (Markdown syntax is supported)

sgugger · 2023-04-24T13:30:19Z

src/accelerate/local_sgd.py

+    Stich, Sebastian Urban. "Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on
+    Learning Representations. No. CONF. 2019.


sgugger · 2023-04-24T13:31:22Z

src/accelerate/local_sgd.py

+        ]:
+            raise NotImplementedError("LocalSGD is supported only for CPUs and GPUs (no DeepSpeed or MegatronLM)")
+        self.enabled = enabled and accelerator.distributed_type != DistributedType.NO
+        self.step_qty = 0


Please use a more descriptive variable names. It's also more of a num_steps than a step quantity IMO.

sgugger · 2023-04-24T13:31:46Z

src/accelerate/local_sgd.py

+            qty = float(dist.get_world_size())
+            for prm in self.model.parameters():
+                dist.all_reduce(prm.data, op=torch.distributed.ReduceOp.SUM)
+                prm.data /= qty


Use num_steps and param here.

sgugger · 2023-04-24T13:59:40Z

docs/source/usage_guides/local_sgd.mdx

+Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). 
+Parallel SGD: When does averaging help?. arXiv preprint arXiv:1606.07365.
+
+A term Local SGD was used in, e.g., the following paper:
+
+Stich, Sebastian Urban. "Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on Learning Representations. No. CONF. 2019.


Please put links here to the relevant papers.

sgugger · 2023-04-24T14:02:52Z

docs/source/usage_guides/local_sgd.mdx

+
+# Using Local SGD with 🤗 Accelerate
+
+Local SGD is a technique for distributed training where the model weights and/or gradients are not synchronized every step. This improves communication efficiency and can lead to substantial training speed up especially when a computer lacks a faster interconnect such as NVLink. Unlike gradient accumulation (where improving communication efficiency requires increasing the effective batch size), Local SGD does not require changing a batch size or a learning rate / schedule. However, if necessary, Local SGD can be combined with gradient accumulation as well.


This is a bit misleading. In regular distributed training the model weights are never synchronized. Only the gradients are (the model is the same at first and then gets the same updates on all processes). I would say that:

gradients are not synchronized at every steps

thus the model on every process get different weights

after a given number of steps, the different model weights are averaged across processes

searchivarius · 2023-04-24T14:23:55Z

Thank you for the feedback. I will make the changes and I want to re-run the tests too. Actually, it looks like the number for the gradient accumulation are so weird, b/c the number of steps per epoch (for historical reasons) is divided by the number of accumulation steps (which wasn't noticed until recently) and I suspect this is not right. Also as noted by @muellerzr it is clearly being trained using less data, but the reason IMHO isn't the drift in the batch size, but just the reduction of the training set.

Adding support for local SGD.

bbbe0b2

muellerzr approved these changes Apr 24, 2023

View reviewed changes

sgugger reviewed Apr 24, 2023

View reviewed changes

searchivarius closed this Apr 24, 2023

searchivarius mentioned this pull request May 1, 2023

Adding support for local SGD. #1378

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for local SGD. #1350

Adding support for local SGD. #1350

searchivarius commented Apr 23, 2023

HuggingFaceDocBuilderDev commented Apr 23, 2023

muellerzr left a comment

searchivarius commented Apr 24, 2023

sgugger left a comment

sgugger Apr 24, 2023

sgugger Apr 24, 2023

sgugger Apr 24, 2023

sgugger Apr 24, 2023

sgugger Apr 24, 2023

sgugger Apr 24, 2023

searchivarius commented Apr 24, 2023

		Zhang, J., De Sa, C., Mitliagkas, I., & Ré, C. (2016). Parallel SGD: When does averaging help?. arXiv preprint
		arXiv:1606.07365.

		Stich, Sebastian Urban. "Local SGD Converges Fast and Communicates Little." ICLR 2019-International Conference on
		Learning Representations. No. CONF. 2019.


		# Using Local SGD with 🤗 Accelerate

		Local SGD is a technique for distributed training where the model weights and/or gradients are not synchronized every step. This improves communication efficiency and can lead to substantial training speed up especially when a computer lacks a faster interconnect such as NVLink. Unlike gradient accumulation (where improving communication efficiency requires increasing the effective batch size), Local SGD does not require changing a batch size or a learning rate / schedule. However, if necessary, Local SGD can be combined with gradient accumulation as well.

Adding support for local SGD. #1350

Adding support for local SGD. #1350

Conversation

searchivarius commented Apr 23, 2023

HuggingFaceDocBuilderDev commented Apr 23, 2023

muellerzr left a comment

Choose a reason for hiding this comment

searchivarius commented Apr 24, 2023

sgugger left a comment

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

sgugger Apr 24, 2023

Choose a reason for hiding this comment

searchivarius commented Apr 24, 2023