chore: use scalar type to dispatch to different `gptq_marlin` kernels #689

AlpinDale · 2024-09-11T08:04:17Z

Use ScalarType instead of num_bits (in combination with has_zp) to preform the dispatching for gptq_marlin this sets the stage for folding fp8_marlin.cu into gptq_marlin.cu since now we can move dequant_8bit in fp8_marlin.cu as a dequant<T, vllm::kFE4M3fn.id()>(int q) specialization. I did not fold fp8_marlin.cu into gptq_marlin.cu in this PR to avoid excessive compile times for gptq_marlin.cu for now.

In-order to support passing scalar type as a template parameter in C++17, it has to be serialized to something that can be passed as a template parameter. This per introduces the concept of serializing the type into a 64 bit int id (that can be passed as a parameter) alongside a deserialization routine (from_id if the template needs to access the traits of the type). If/when we move to make C++20 the lowest standard support this serialization/deserialization can be removed as C++20 introduces passing literal class types as template parameters (see: https://en.cppreference.com/w/cpp/language/template_parameters)

chore: use scalar type to dispatch to different gptq_marlin kernels

b746a59

AlpinDale merged commit 6144150 into main Sep 11, 2024
5 checks passed

AlpinDale deleted the gptq-scalar-type branch September 11, 2024 08:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: use scalar type to dispatch to different `gptq_marlin` kernels #689

chore: use scalar type to dispatch to different `gptq_marlin` kernels #689

AlpinDale commented Sep 11, 2024

chore: use scalar type to dispatch to different gptq_marlin kernels #689

chore: use scalar type to dispatch to different gptq_marlin kernels #689

Conversation

AlpinDale commented Sep 11, 2024

chore: use scalar type to dispatch to different `gptq_marlin` kernels #689

chore: use scalar type to dispatch to different `gptq_marlin` kernels #689