-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding support for local SGD. #1378
Conversation
The documentation is not available anymore as the PR was closed or merged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I've made a few suggestions to simply the code, and one request re; attribution. The suggestions can be done here or I can do it in a follow up PR, whichever is easier. Thanks again!
@@ -34,6 +34,7 @@ | |||
EXCLUDE_EXAMPLES = [ | |||
"cross_validation.py", | |||
"gradient_accumulation.py", | |||
"local_sgd.py", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is accurate, to test your script we should also include a test similar to
accelerate/tests/test_examples.py
Lines 215 to 217 in daf76a7
def test_gradient_accumulation(self): | |
testargs = ["examples/by_feature/gradient_accumulation.py"] | |
run_command(self._launch_args + testargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding this one, I am a bit unsure how this is invoked. I add a file examples/by_feature/local_sgd.py
, I can add a similar function test_local_sgd(self). A couple of Qs though:
- will it be called automatically?
- which testargs will be used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It gets called when the example tests are through our CI. testargs
are whatever need to be used to make your script run with the feature enabled. For instance this reads as:
accelerate launch examples/by_feature/gradient_accumulation.py
(accelerate launch comes from
self._launch_args`)
Yes, add a similar test_local_sgd()
file which does this, as we make sure all of our example tests get run. to see when they are fully evoked, see the Makefile
under test_examples
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @muellerzr running the tests (make test) and see the following (note the ignore part):
python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py
so tests need to be run explicitly? is this what you do yourself? asking b/c make test ignores examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need test_examples
, which was further down from where you looked at... https://github.com/huggingface/accelerate/blob/main/Makefile#L45-L46
Yes, the core tests ignore the examples as they often require GPUs or take a long time..
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
@searchivarius any particular reason for closing/reopening later rather than modifying on the PR branch and having a trail to follow back through on the PR comments? |
@muellerzr I just wanted to squash commits. If this is not an issue, I can reopen the PR & make additional commits, just let me know. |
Not an issue at all, we squash all commits upon merge anyways |
PS: @muellerzr not quite ready yet, I am still making changes and testing. |
The documentation is not available anymore as the PR was closed or merged. |
@muellerzr it should be ready, I am not sure why the quality check fails, |
@searchivarius you should try doing |
@muellerzr it doesn't make a difference. |
@searchivarius perhaps uninstall then reinstall everything in the pip uninstall black flake8 hf-doc-builder -y pip install -e .[quality] Note that you'll also need |
Hi @muellerzr this didn't help: moreover, BTW, regarding urllib3:
|
Perfect, that's the right version of urllib3. For the other deps, try installing these versions: pip install black==23.1 ruff==0.0.241 hf-doc-builder==0.4.0 |
@muellerzr thank you, now the file is reformatted properly. There is a chance that versions were not an issue, but I was rather working on a wrong branch. There's no way to know this for sure though (b/c I deleted and re-cloned the local repo). That said, it seems to be fine now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG2M! Thanks for all the back and forth, and hard work on this! 🤗
cc @sgugger for final review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just have a couple of nits in the Copyright years, but LGTM otherwise! Thanks!
@muellerzr @sgugger NP and thank you for the guidance. Do not forget to remove the division by the number of steps in examples like this one: |
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Hi,
As discussed with @muellerzr , I am adding a non-breaking functionality to support local SGD, which enables efficient multi-GPU training in the case when no fast interconnect is possible.
This replaces a previous pull request
I incorporated recommendation to change the per device batch size (in the separate benchmark) and I also removed the erroneous code (borrowed from the examples) that reduced the size of the dataset (as well as code-related recommendations) and retested everything. There is still some unexplained behavior, but it looks quite reasonable now:
Here is a link to the summary of results and benchmarking code.
Importantly @sgugger @muellerzr I think it is possibly every no-trainer example has this line that reduces the dataset size, but I think it's a bug. Please double check!
It is not clear what kind of unit test can test this kind of functionality. The key ingredient (disabling gradient synchronization is being tested already).