Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bf16xint16_gemm operator: add --transpose option #2466

Closed

Conversation

davidberard98
Copy link
Contributor

--transpose will make this benchmark test a int16 x bf16 mm instead of a bf16 x int16.

This matters for H100, because the wgmma instruction can take registers only on the LHS. So int16 x bf16 is probably the easier one to support efficiently.

@facebook-github-bot
Copy link
Contributor

@davidberard98 has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Summary:
`--transpose` will make this benchmark test a int16 x bf16 mm instead of a bf16 x int16.

This matters for H100, because the wgmma instruction can take registers only on the LHS. So int16 x bf16 is probably the easier one to support efficiently.

Pull Request resolved: pytorch#2466

Test Plan:
In OSS: ran `python run_benchmark.py triton --op bf16xint16_gemm --transpose`

Internally, ran `buck2 run mode/opt //pytorch/benchmark:triton -- --op bf16xint16_gemm --transpose`

Internally, we run into the issue fixed by triton-lang/triton#4695; but otherwise, they both run.

Differential Revision: D63294109

Pulled By: davidberard98
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D63294109

@facebook-github-bot
Copy link
Contributor

@davidberard98 merged this pull request in 0ab0e47.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants