-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add minimum capability requirement for AWQ #1064
Conversation
@WoosukKwon I think setup.py file might need add compute capability check whether to build quant kernel or not. |
@esmeetu Thanks for the suggestion! I've instead added a guard to prevent compilation for the supported GPUs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments.
namespace vllm { | ||
namespace awq { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this namespace needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are optional, but for better coding convention. From google cpp style guide:
With few exceptions, place code in a namespace.
Namespace prevents naming conflicts, so it's pretty useful for external code like the AWQ kernels.
capability = torch.cuda.get_device_capability() | ||
capability = capability[0] * 10 + capability[1] | ||
if capability < quant_config.get_min_capability(): | ||
raise ValueError( | ||
f"The quantization method {model_config.quantization} is not " | ||
"supported for the current GPU. " | ||
f"Minimum capability: {quant_config.get_min_capability()}. " | ||
f"Current capability: {capability}.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need the assert false
in C++ if we have the check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just saw the comments in the PR. Can we just change setup.py
instead of the C++ files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's problematic when we want to build the wheel for all GPU architectures (e.g., for pypi publication or building docker image). In such a case, we cannot selectively include the extension according to the architecture. Therefore, I believe this is an easier solution, and in fact we already used this kind of guard for bfloat16 attention kernels, which do not support Turing and Volta GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the fix!
Closes #1063