Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPMD] Expose apply_backward_optimization_barrier #7477

Merged
merged 2 commits into from
Jun 26, 2024
Merged

Conversation

alanwaketan
Copy link
Collaborator

Summary:
This PR exposes apply_backward_optimization_barrier to the spmd namespace.

Test Plan:
N.A.

@alanwaketan
Copy link
Collaborator Author

Thanks, Jack!

@alanwaketan alanwaketan merged commit 6894a08 into master Jun 26, 2024
23 checks passed
@alanwaketan alanwaketan deleted the alanwaketan/spmd branch June 26, 2024 01:15
@bhavya01
Copy link
Collaborator

bhavya01 commented Jul 17, 2024

@alanwaketan Should we also merge this change in the 2.4 release? I saw test failure with latest 2.4 wheel

[2024-07-17, 16:47:21 UTC] {logging_mixin.py:150} WARNING - Traceback (most recent call last):
  File "/home/ml-auto-solutions/transformers/examples/pytorch/language-modeling/run_clm.py", line 873, in <module>
    main()
  File "/home/ml-auto-solutions/transformers/examples/pytorch/language-modeling/run_clm.py", line 644, in main
[2024-07-17, 16:47:21 UTC] {logging_mixin.py:150} WARNING -     xs.apply_backward_optimization_barrier(model.model.layers[i])
AttributeError: module 'torch_xla.distributed.spmd' has no attribute 'apply_backward_optimization_barrier'
    ```

@JackCaoG
Copy link
Collaborator

which model failed?

@bhavya01
Copy link
Collaborator

bhavya01 commented Jul 17, 2024

llama2-train-spmd

Still looking into why this passed in the previous test run
image

@bhavya01
Copy link
Collaborator

I realized that this is failing because of pytorch-tpu/transformers@ccf5b15

I replaced torch_xla.experimental.xla_sharding with torch_xla.distributed.spmd in the test and 2.4 release doesn't expose apply_optimization_barrier through the latter. The test passes locally with the fix.

@alanwaketan
Copy link
Collaborator Author

Sure, can I still backport things? I might have 2-3 PRs need to be back ported still?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants