Does the official plan support Turing Nvidia GPU ？ #712

jason-ji-227 · 2025-02-27T06:19:17Z

Firstly, I am pleased to see the author working so hard to maintain this project v0.2.2rc1, which was just released two days ago, has been updated and is now being used normally on Ampere architecture's Nvidia GPU. However, I believe that the original intention of the ktransformer project is to enable individual enthusiasts to experience private deployment of local large model projects at the lowest cost. Therefore, there is a small request for the author to ensure compatibility in the lower left corner of future versions, which is whether it is possible to officially update that support the Turing Nvidia GPU architecture. I have also tried to modify some of the inference code according to issues #374 and #452 , but when KTransformers started, the video memory load exceeded 22GB, causing OOM in my Nvidia Quadro RTX 6000 environment and rendering it unusable.

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-R1 --optimize_rule_path /workspace/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/download/DeepSeek-R1-GGUF-Q2_K_XS ...... ...... loading blk.41.attn_kv_a_norm.weight to cuda:0 loading blk.41.attn_kv_b.weight to cuda:0 Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/root/workspace/ktransformers/ktransformers/local_chat.py", line 183, in <module> fire.Fire(local_chat) File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/local_chat.py", line 110, in local_chat optimize_and_load_gguf(model, optimize_config_path, gguf_path, config) File "/root/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf load_weights(module, gguf_loader) File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 522, in load self.generate_linear.load(w=w) File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 149, in load if w is None: w = self.load_weight(device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 97, in load_weight tensors = self.load_multi(key, ["weight"], device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 107, in load_multi tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/util/custom_gguf.py", line 368, in load_gguf_tensor cur_values = GGML_DEQUANTIZE_GPU[ggml_name](data[blocks_begin*block_size : blocks_end*block_size], device, target_dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/util/custom_gguf.py", line 494, in dequantize_q2_k_gpu return KTransformersOps.dequantize_q2_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.15 GiB of which 7.19 MiB is free.

Finally, I wish the author a smooth work and happy life, and I wish the KTransformers project continued success, allowing more enthusiasts with limited funds to experience the joy of deploying large-scale models privately.

The text was updated successfully, but these errors were encountered:

Azure-Tang · 2025-02-27T11:15:32Z

Thank you for your interest in KTransformers and for raising this important question about Turing architecture GPU support. We appreciate your engagement with our project.

Technical Context:
The current limitation with Turing-series GPUs (e.g., NVIDIA T4/Turing GPUs) stems from their incompatibility with the Marlin operator optimization. This requires fallback to standard PyTorch operators, which increases VRAM consumption by approximately 400% compared to Ampere/Ada Lovelace architectures.

Implementation Challenges:
While integrating alternative quantization operators like GGML could theoretically alleviate this, we face significant technical hurdles:

The GGML GPU kernels in llama.cpp are tightly coupled with its custom memory management system. Decoupling these components is really time consuming.
Our current roadmap prioritizes stability improvements and provides more speed up ( so that KT would be a more usable tool other than a toy).

So i'm sorry that this may not be achieved in the near future. Thank you for understanding, and please don't hesitate to reach out with other technical inquiries.

Wish you a good day.

youde2000 mentioned this issue Feb 28, 2025

RuntimeError: CUDA error: invalid device function #425

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the official plan support Turing Nvidia GPU ？ #712

Does the official plan support Turing Nvidia GPU ？ #712

jason-ji-227 commented Feb 27, 2025

Azure-Tang commented Feb 27, 2025

Does the official plan support Turing Nvidia GPU ？ #712

Does the official plan support Turing Nvidia GPU ？ #712

Comments

jason-ji-227 commented Feb 27, 2025

Azure-Tang commented Feb 27, 2025