Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix beam search when using model parallel #24969

Merged

Conversation

pfldy2850
Copy link
Contributor

@pfldy2850 pfldy2850 commented Jul 21, 2023

What does this PR do?

This PR fixes a crash when running beam search on multiple GPUs. Similar issue is also observed and fixed on T5 #11717 and LLama #24224

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.
@ArthurZucker @younesbelkada

ArthurZucker
ArthurZucker previously approved these changes Jul 21, 2023
Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix!
This seem to apply to a lot of model, would be good to fix them all in one go!

@ArthurZucker ArthurZucker requested a review from amyeroberts July 21, 2023 06:59
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@pfldy2850
Copy link
Contributor Author

I agree with you!
Which do you think is better, fixing the rest of the model in this PR or creating a new PR fixing the rest?

I am willing to do further work based on your comments.

@ArthurZucker
Copy link
Collaborator

I think it will be easier to have everything in one PR, given how small and repetitive of a change it is!

@pfldy2850 pfldy2850 changed the title Fix GPTNeoX beam search when using parallelize Fix beam search when using model parallel Jul 21, 2023
@pfldy2850
Copy link
Contributor Author

@ArthurZucker

I have added fixed commits for every model that required correction, and I have also made modifications to the cookiecutter template.

And I have updated the PR title and content to align with the task.

As there were numerous models that needed correction, there is a possibility that some parts might have been overlooked. Therefore, I would appreciate it if you could review the changes again. Thank you for your attention to this matter.

@ArthurZucker ArthurZucker dismissed their stale review July 21, 2023 10:00

Code updated

@ArthurZucker
Copy link
Collaborator

There is no device in tensorflow, let's limit the changes to pytorch!

@pfldy2850 pfldy2850 force-pushed the fix-gpt-neox-parallelize-beam branch from 68a41f5 to 9e978c4 Compare July 21, 2023 10:08
@pfldy2850
Copy link
Contributor Author

Oh! Thank you for correcting the mistake.
As you suggested, I have dropped the modifications for the tf base model.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for fixing this!

cc @gante for reference

@pfldy2850
Copy link
Contributor Author

pfldy2850 commented Aug 1, 2023

@ArthurZucker

Is there anything else you'd like me to fix?
I want to use the merged main branch for my work.

@ArthurZucker
Copy link
Collaborator

Not at all sorry, let me have a final look and I'll merge this!

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! My last comment is: let's add a slow test for our nightly CI that has multiple GPU to make sure this is tested (beam search on multiple GPU)! It would go in test_modeling_common.py !

@pfldy2850
Copy link
Contributor Author

Hmm, by the way.
It seems like there's already a test in the script you provided to test beam search on multi GPUs.

https://github.com/pfldy2850/transformers/blob/e07126aac6840568b0db0b369d199f3a0cefa28f/tests/test_modeling_common.py#L2468-L2494

Why was this test not conducted for this issue beforehand?

@ArthurZucker
Copy link
Collaborator

Your are right! Might be the test_model_parallel, it is set False by default

@pfldy2850
Copy link
Contributor Author

@ArthurZucker

What do you think about setting the test_model_parallel=True in the existing modeling test file instead of creating a new test?

@ArthurZucker
Copy link
Collaborator

Great idea, the problem is that this might also trigger other tests, and there might be a reason why don't test them (maybe too slow / model doesn't need these test as it is not used that much). Pinging @amyeroberts for a final answer 🤗

@amyeroberts
Copy link
Collaborator

OK, I think this is hitting on some areas of our testing suite which are non-obvious and / or need updating.

AFAICT, there are only two models which have test_model_parallel=True, GPT2 and T5. The tests which check this flag, both use a deprecated method model.parallelize - 1, 2- and so this flag is to control testing for backwards compatibility features.

We have another test which tests for parallelism which doesn't check test_model_parallel: test_model_parallelism, which is an accelerate test and checks if the model has _no_split_modules implemented.

In addition, generate specific tests should be added to GenerationTesterMixin, rather than ModelTesterMixin, as only models with .generate methods should be tested.

What I would suggest is:

  • Moving test_model_parallel_beam_search to GenerationTesterMixin.
  • Update the modelings tests so that each of the model's updated in this PR have all_generative_model_classes added as attributes to their model tester.
  • Update test_model_parallel_beam_search to check if the model class has _no_split_modules implemented. If not, skip. If it does, then load using device_map="auto" instead of test_model_parallel.
  • Make sure to mark that the test requires multi gpus

@pfldy2850 pfldy2850 force-pushed the fix-gpt-neox-parallelize-beam branch from 6d0a7d7 to d6c50e2 Compare August 23, 2023 17:11
@pfldy2850
Copy link
Contributor Author

pfldy2850 commented Aug 23, 2023

@amyeroberts @ArthurZucker

I'm sorry, it has been delayed due to busy work.

I've done the fix as you said.

  1. test_model_parallel_beam_search function has been moved from ModelTextMixin to GenerationMixin.
  2. Based on whether _no_split_modules is implemented, skip logic has been written.
  3. Parallelized to multiple devices using device_map="auto".
  4. Errors caused by the new test have been fixed.

Copy link
Collaborator

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating and working on updating & improving the docs!

Changes look good to me - just a comment on the default values in the test. Once that done and @ArthurZucker has approved we can merge :)

@dev-cotyledon
Copy link

dev-cotyledon commented Sep 4, 2023

Is there any update?
I need this feature for my academic project.

@amyeroberts
Copy link
Collaborator

Friendly ping @ArthurZucker :)

@amyeroberts
Copy link
Collaborator

@dev-cotyledon If you need this for your project, it's possible to work from this branch until it's been merged into main:

  • Clone the repo: git clone git@github.com:huggingface/transformers.git
  • Create a local environment e.g. python -m venv my-env
  • Install packages from source in dev mode cd transformers && pip install -e .
  • Add fork to remote git remote add pfldy2850 https://github.com/pfldy2850/transformers.git
  • Fetch this branch git fetch pfldy2850 fix-gpt-neox-parallelize-beam
  • Checkout this branch git checkout fix-gpt-neox-parallelize-beam

Your environment will now be running the version of transformers of this branch.

@dev-cotyledon
Copy link

How kind of you! Thanks a lot sir 🙏

@pfldy2850 pfldy2850 force-pushed the fix-gpt-neox-parallelize-beam branch from d98a286 to d9f7c55 Compare September 7, 2023 05:36
@pfldy2850 pfldy2850 force-pushed the fix-gpt-neox-parallelize-beam branch from d9f7c55 to 88089af Compare September 7, 2023 05:41
@pfldy2850
Copy link
Contributor Author

I had to modify the author of the commit, so I amend and then force push.

@amyeroberts
Copy link
Collaborator

@ArthurZucker Are you happy for us to merge?

@ArthurZucker
Copy link
Collaborator

Yep! Sorry must have missed the ping 😢

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! (I'm just running the slow tests to make sure we don´t have a bad surprise)

@ArthurZucker ArthurZucker merged commit 8881f38 into huggingface:main Sep 14, 2023
@ArthurZucker
Copy link
Collaborator

Thanks for the fix @pfldy2850, tests are all green so merged the PR 😉

parambharat pushed a commit to parambharat/transformers that referenced this pull request Sep 26, 2023
* Fix GPTNeoX beam search when using parallelize

* Fix beam search idx device when using model parallel

* remove onnx related stuff

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

* fix: move test_beam_search_on_multi_gpu to GenerationTesterMixin

* fix: add right item to _no_split_modules of MegaPreTrainedModel

* fix: add num_beams within parallelized beam_search test

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>

---------

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants