Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for building CUDA extension on Windows #396

Merged
merged 7 commits into from
Jun 18, 2024

Conversation

gau-nernst
Copy link
Collaborator

@gau-nernst gau-nernst commented Jun 18, 2024

Continue work from #305

To make sure FP6-LLM kernel is compiled correctly, run

python benchmarks/benchmark_fp6_llm.py

Expected outputs up to m=8 (ran on 4070Ti SUPER, Windows 11, PyTorch 2.3, MSVC 14.38.33130, CUDA Toolkit 12.3)

m k n fp6_latency (ms) fp16_latency (ms) speedup (d/s) correct
1 8192 8192 0.0387494 0.223442 5.76635 1
1 10240 8192 0.113143 0.288529 2.55014 1
1 57344 8192 0.598838 1.53449 2.56244 1
1 8192 28672 0.289421 0.789677 2.72847 1
2 8192 8192 0.0430889 0.226758 5.26255 1
2 10240 8192 0.107766 0.280107 2.59921 1
2 57344 8192 0.569473 1.52654 2.68062 1
2 8192 28672 0.29369 0.759794 2.58706 1
4 8192 8192 0.0469865 0.228517 4.86347 1
4 10240 8192 0.108724 0.280926 2.58386 1
4 57344 8192 0.567743 1.53572 2.70496 1
4 8192 28672 0.300316 0.763842 2.54346 1
8 8192 8192 0.0527722 0.228795 4.33551 1
8 10240 8192 0.110667 0.280704 2.53649 1
8 57344 8192 0.570488 1.52786 2.67815 1
8 8192 28672 0.310384 0.776489 2.50171 1

The speedup is slightly worse than that on Ubuntu for k=8192 n=8192. To make sure there is no regression on Ubuntu, this is the outputs ran on the same machine, but on Ubuntu 22.04

m k n fp6_latency (ms) fp16_latency (ms) speedup (d/s) correct
1 8192 8192 0.0257488 0.216757 8.41815 1
1 10240 8192 0.105257 0.267286 2.53936 1
1 57344 8192 0.597281 1.58615 2.65562 1
1 8192 28672 0.286471 0.753127 2.62898 1
2 8192 8192 0.0290186 0.222586 7.67045 1
2 10240 8192 0.105451 0.27609 2.61818 1
2 57344 8192 0.560466 1.51298 2.69951 1
2 8192 28672 0.290177 0.740288 2.55116 1
4 8192 8192 0.0364796 0.226285 6.20305 1
4 10240 8192 0.107137 0.27615 2.57755 1
4 57344 8192 0.605757 1.60981 2.65751 1
4 8192 28672 0.319282 0.758238 2.37483 1
8 8192 8192 0.0521174 0.225405 4.32494 1
8 10240 8192 0.110171 0.279006 2.53248 1
8 57344 8192 0.569247 1.52264 2.67484 1
8 8192 28672 0.308445 0.7467 2.42085 1

There is no significant difference from the previous results at #223

Copy link

pytorch-bot bot commented Jun 18, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/396

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 02fd6ad with merge base f5b6ec9 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 18, 2024
@gau-nernst gau-nernst marked this pull request as ready for review June 18, 2024 14:53
@msaroufim msaroufim merged commit d0af941 into pytorch:main Jun 18, 2024
13 checks passed
@gau-nernst gau-nernst deleted the windows branch June 18, 2024 21:01
This was referenced Jun 19, 2024
dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* Enable FP6-LLM kernel build on Windows

* fix benchmark script

* update setup.py

* update

* fix indent

* add -t=0 for linux

---------

Co-authored-by: Matthew Douglas <38992547+matthewdouglas@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants