chore: use scalar type to dispatch to different gptq_marlin
kernels
#689
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use
ScalarType
instead ofnum_bits
(in combination withhas_zp
) to preform the dispatching forgptq_marlin
this sets the stage for foldingfp8_marlin.cu
intogptq_marlin.cu
since now we can movedequant_8bit
infp8_marlin.cu
as adequant<T, vllm::kFE4M3fn.id()>(int q)
specialization. I did not foldfp8_marlin.cu
intogptq_marlin.cu
in this PR to avoid excessive compile times forgptq_marlin.cu
for now.In-order to support passing scalar type as a template parameter in C++17, it has to be serialized to something that can be passed as a template parameter. This per introduces the concept of serializing the type into a 64 bit int
id
(that can be passed as a parameter) alongside a deserialization routine (from_id
if the template needs to access the traits of the type). If/when we move to make C++20 the lowest standard support this serialization/deserialization can be removed as C++20 introduces passing literal class types as template parameters (see: https://en.cppreference.com/w/cpp/language/template_parameters)