[MRG] relax the FP8 CUDA arch limitation to SM89 #549

leeeizhang · 2024-08-21T08:39:14Z

closes: #548

Nvidia Ada Lovelace GPUs (e.g., RTX 4090, L20, L40) with SM89 version are also support FP8 MMA, and hence, it is recommended to relax the CUDA architecture limitations to enable FP8 training on a broader range of devices.

and the CUDA 12.0 announcement says that it supports Lovelace architecture:
'CUDA 12.0 exposes programmable functionality for many features of the NVIDIA Hopper and NVIDIA Ada Lovelace architectures: ...32x Ultra xMMA (including FP8 and FP16)'

https://developer.nvidia.com/blog/cuda-toolkit-12-0-released-for-general-availability/

https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html

https://github.com/NVIDIA/cutlass/blob/c4e3e122e266644c61b4af33d0cc09f4c391a64b/include/cutlass/arch/mma_sm89.h#L57

After relaxing the CUDA architecture limitations for FP8, my environment with 4 x L40 GPUs (SM89) can still successfully train llama under float8 precision.

facebook-github-bot · 2024-08-21T08:39:20Z

Hi @leeeizhang!

Thank you for your pull request and welcome to our community.

Action Required

In order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you.

Process

In order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA.

Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with CLA signed. The tagging process may take up to 1 hour after signing. Please give it that time before contacting us about it.

If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks!

facebook-github-bot · 2024-08-21T08:41:44Z

Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks!

torchtitan/float8.py

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>

awgu

Thank you!

relax the CUDA arch limit to SM89

a2a62aa

leeeizhang changed the title ~~[WIP] relax the FP8 CUDA arch limitation to SM89~~ [MRG] relax the FP8 CUDA arch limitation to SM89 Aug 21, 2024

facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Aug 21, 2024

awgu reviewed Aug 21, 2024

View reviewed changes

torchtitan/float8.py Outdated Show resolved Hide resolved

refactor cuda arch comments

f2da9d0

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>

awgu approved these changes Aug 21, 2024

View reviewed changes

awgu merged commit 90c889e into pytorch:main Aug 21, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MRG] relax the FP8 CUDA arch limitation to SM89 #549

[MRG] relax the FP8 CUDA arch limitation to SM89 #549

leeeizhang commented Aug 21, 2024 •

edited

Loading

facebook-github-bot commented Aug 21, 2024

facebook-github-bot commented Aug 21, 2024

awgu left a comment

[MRG] relax the FP8 CUDA arch limitation to SM89 #549

[MRG] relax the FP8 CUDA arch limitation to SM89 #549

Conversation

leeeizhang commented Aug 21, 2024 • edited Loading

facebook-github-bot commented Aug 21, 2024

Action Required

Process

facebook-github-bot commented Aug 21, 2024

awgu left a comment

Choose a reason for hiding this comment

leeeizhang commented Aug 21, 2024 •

edited

Loading