Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does the official plan support Turing Nvidia GPU ? #712

Open
jason-ji-227 opened this issue Feb 27, 2025 · 1 comment
Open

Does the official plan support Turing Nvidia GPU ? #712

jason-ji-227 opened this issue Feb 27, 2025 · 1 comment

Comments

@jason-ji-227
Copy link

Firstly, I am pleased to see the author working so hard to maintain this project v0.2.2rc1, which was just released two days ago, has been updated and is now being used normally on Ampere architecture's Nvidia GPU. However, I believe that the original intention of the ktransformer project is to enable individual enthusiasts to experience private deployment of local large model projects at the lowest cost. Therefore, there is a small request for the author to ensure compatibility in the lower left corner of future versions, which is whether it is possible to officially update that support the Turing Nvidia GPU architecture. I have also tried to modify some of the inference code according to issues #374 and #452 , but when KTransformers started, the video memory load exceeded 22GB, causing OOM in my Nvidia Quadro RTX 6000 environment and rendering it unusable.

python -m ktransformers.local_chat --model_path deepseek-ai/DeepSeek-R1 --optimize_rule_path /workspace/ktransformers/ktransformers/optimize/optimize_rules/DeepSeek-V3-Chat.yaml --gguf_path /workspace/download/DeepSeek-R1-GGUF-Q2_K_XS ...... ...... loading blk.41.attn_kv_a_norm.weight to cuda:0 loading blk.41.attn_kv_b.weight to cuda:0 Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/root/workspace/ktransformers/ktransformers/local_chat.py", line 183, in <module> fire.Fire(local_chat) File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 135, in Fire component_trace = _Fire(component, args, parsed_flag_args, context, name) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 468, in _Fire component, remaining_args = _CallAndUpdateTrace( ^^^^^^^^^^^^^^^^^^^^ File "/root/miniconda3/envs/ktransformers/lib/python3.11/site-packages/fire/core.py", line 684, in _CallAndUpdateTrace component = fn(*varargs, **kwargs) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/local_chat.py", line 110, in local_chat optimize_and_load_gguf(model, optimize_config_path, gguf_path, config) File "/root/workspace/ktransformers/ktransformers/optimize/optimize.py", line 131, in optimize_and_load_gguf load_weights(module, gguf_loader) File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/base_operator.py", line 63, in load utils.load_weights(child, self.gguf_loader, self.key+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 109, in load_weights load_weights(child, gguf_loader, prefix+name+".") File "/root/workspace/ktransformers/ktransformers/util/utils.py", line 111, in load_weights module.load() File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 522, in load self.generate_linear.load(w=w) File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 149, in load if w is None: w = self.load_weight(device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 97, in load_weight tensors = self.load_multi(key, ["weight"], device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/operators/linear.py", line 107, in load_multi tensors[k] = self.gguf_loader.load_gguf_tensor(key + "." + k, device=device) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/util/custom_gguf.py", line 368, in load_gguf_tensor cur_values = GGML_DEQUANTIZE_GPU[ggml_name](data[blocks_begin*block_size : blocks_end*block_size], device, target_dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/workspace/ktransformers/ktransformers/util/custom_gguf.py", line 494, in dequantize_q2_k_gpu return KTransformersOps.dequantize_q2_k(c_pointer, data.size, block_size, ele_per_blk, device, target_dtype) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 20.00 MiB. GPU 0 has a total capacity of 22.15 GiB of which 7.19 MiB is free.

Finally, I wish the author a smooth work and happy life, and I wish the KTransformers project continued success, allowing more enthusiasts with limited funds to experience the joy of deploying large-scale models privately.

@Azure-Tang
Copy link
Contributor

Thank you for your interest in KTransformers and for raising this important question about Turing architecture GPU support. We appreciate your engagement with our project.

Technical Context:
The current limitation with Turing-series GPUs (e.g., NVIDIA T4/Turing GPUs) stems from their incompatibility with the Marlin operator optimization. This requires fallback to standard PyTorch operators, which increases VRAM consumption by approximately 400% compared to Ampere/Ada Lovelace architectures.

Implementation Challenges:
While integrating alternative quantization operators like GGML could theoretically alleviate this, we face significant technical hurdles:

  1. The GGML GPU kernels in llama.cpp are tightly coupled with its custom memory management system. Decoupling these components is really time consuming.
  2. Our current roadmap prioritizes stability improvements and provides more speed up ( so that KT would be a more usable tool other than a toy).

So i'm sorry that this may not be achieved in the near future. Thank you for understanding, and please don't hesitate to reach out with other technical inquiries.

Wish you a good day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants