Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding support for local SGD. #1378

Merged
merged 9 commits into from
May 9, 2023
Merged

Conversation

searchivarius
Copy link
Contributor

@searchivarius searchivarius commented May 1, 2023

Hi,

As discussed with @muellerzr , I am adding a non-breaking functionality to support local SGD, which enables efficient multi-GPU training in the case when no fast interconnect is possible.

This replaces a previous pull request

I incorporated recommendation to change the per device batch size (in the separate benchmark) and I also removed the erroneous code (borrowed from the examples) that reduced the size of the dataset (as well as code-related recommendations) and retested everything. There is still some unexplained behavior, but it looks quite reasonable now:

Here is a link to the summary of results and benchmarking code.

Importantly @sgugger @muellerzr I think it is possibly every no-trainer example has this line that reduces the dataset size, but I think it's a bug. Please double check!

It is not clear what kind of unit test can test this kind of functionality. The key ingredient (disabling gradient synchronization is being tested already).

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 1, 2023

The documentation is not available anymore as the PR was closed or merged.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I've made a few suggestions to simply the code, and one request re; attribution. The suggestions can be done here or I can do it in a follow up PR, whichever is easier. Thanks again!

@@ -34,6 +34,7 @@
EXCLUDE_EXAMPLES = [
"cross_validation.py",
"gradient_accumulation.py",
"local_sgd.py",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this is accurate, to test your script we should also include a test similar to

def test_gradient_accumulation(self):
testargs = ["examples/by_feature/gradient_accumulation.py"]
run_command(self._launch_args + testargs)
, to ensure your script runs fine. That would be all the tests that it would need I believe :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding this one, I am a bit unsure how this is invoked. I add a file examples/by_feature/local_sgd.py, I can add a similar function test_local_sgd(self). A couple of Qs though:

  1. will it be called automatically?
  2. which testargs will be used?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It gets called when the example tests are through our CI. testargs are whatever need to be used to make your script run with the feature enabled. For instance this reads as:

accelerate launch examples/by_feature/gradient_accumulation.py

(accelerate launch comes from self._launch_args`)

Yes, add a similar test_local_sgd() file which does this, as we make sure all of our example tests get run. to see when they are fully evoked, see the Makefile under test_examples

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @muellerzr running the tests (make test) and see the following (note the ignore part):

python -m pytest -s -v ./tests/ --ignore=./tests/test_examples.py

so tests need to be run explicitly? is this what you do yourself? asking b/c make test ignores examples.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need test_examples, which was further down from where you looked at... https://github.com/huggingface/accelerate/blob/main/Makefile#L45-L46

Yes, the core tests ignore the examples as they often require GPUs or take a long time..

src/accelerate/local_sgd.py Outdated Show resolved Hide resolved
src/accelerate/local_sgd.py Outdated Show resolved Hide resolved
src/accelerate/local_sgd.py Outdated Show resolved Hide resolved
searchivarius and others added 3 commits May 2, 2023 10:49
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
Co-authored-by: Zachary Mueller <muellerzr@gmail.com>
@muellerzr
Copy link
Collaborator

@searchivarius any particular reason for closing/reopening later rather than modifying on the PR branch and having a trail to follow back through on the PR comments?

@searchivarius
Copy link
Contributor Author

@muellerzr I just wanted to squash commits. If this is not an issue, I can reopen the PR & make additional commits, just let me know.

@muellerzr
Copy link
Collaborator

Not an issue at all, we squash all commits upon merge anyways

@searchivarius searchivarius reopened this May 2, 2023
@searchivarius
Copy link
Contributor Author

PS: @muellerzr not quite ready yet, I am still making changes and testing.

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented May 2, 2023

The documentation is not available anymore as the PR was closed or merged.

@searchivarius
Copy link
Contributor Author

@muellerzr it should be ready, I am not sure why the quality check fails, make quality or 'make style` do not suggest any changes when run locally. Perhaps, there's some version mismatch.

@muellerzr
Copy link
Collaborator

@searchivarius you should try doing pip install -e .[quality], as we pin what versions we use which could be part of it? :)

@searchivarius
Copy link
Contributor Author

@muellerzr it doesn't make a difference.

@muellerzr
Copy link
Collaborator

@searchivarius perhaps uninstall then reinstall everything in the docs dep? Otherwise I can fork + put a PR in to fix it. Aka:

pip uninstall black flake8 hf-doc-builder -y
pip install -e .[quality]

Note that you'll also need urllib3 to be < 2.0.0

@searchivarius
Copy link
Contributor Author

Hi @muellerzr this didn't help: moreover, make style and make quality do not reformat the code. I suspect it's something about the version of black and ruff that is being installed. Do you know which versions do you use?

BTW, regarding urllib3:

>>> urllib3.__version__
'1.26.14'

@muellerzr
Copy link
Collaborator

Perfect, that's the right version of urllib3.

For the other deps, try installing these versions:

pip install black==23.1 ruff==0.0.241 hf-doc-builder==0.4.0

@searchivarius
Copy link
Contributor Author

searchivarius commented May 8, 2023

@muellerzr thank you, now the file is reformatted properly. There is a chance that versions were not an issue, but I was rather working on a wrong branch. There's no way to know this for sure though (b/c I deleted and re-cloned the local repo). That said, it seems to be fine now.

Copy link
Collaborator

@muellerzr muellerzr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LG2M! Thanks for all the back and forth, and hard work on this! 🤗

cc @sgugger for final review

Copy link
Collaborator

@sgugger sgugger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just have a couple of nits in the Copyright years, but LGTM otherwise! Thanks!

docs/source/usage_guides/local_sgd.mdx Outdated Show resolved Hide resolved
examples/by_feature/local_sgd.py Outdated Show resolved Hide resolved
src/accelerate/local_sgd.py Outdated Show resolved Hide resolved
@searchivarius
Copy link
Contributor Author

@muellerzr @sgugger NP and thank you for the guidance. Do not forget to remove the division by the number of steps in examples like this one:
https://github.com/huggingface/transformers/blob/main/examples/pytorch/question-answering/run_qa_no_trainer.py#L758

searchivarius and others added 3 commits May 9, 2023 09:35
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@sgugger sgugger merged commit da39665 into huggingface:main May 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants