Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error Loading DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf in oobabooga Web UI: Unknown Pre-Tokenizer Type 'deepseek-r1-qwen' #6710

Open
redflagrul opened this issue Jan 29, 2025 · 8 comments

Comments

@redflagrul
Copy link

Hi everyone! I’ve been experimenting with running low-quantity models on my CPU using the oobabooga text-generation-webui, and I recently came across the DeepSeek-R1-Distill-Qwen-1.5B-uncensored model. I saw that there’s a GGUF version available (this one), so I decided to give it a try. Unfortunately, I’m running into an issue when trying to load the model, and I’m not sure what’s going wrong.

Here’s what I’ve done so far:

Downloaded the DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf model from Hugging Face.

Placed the model in the models folder of my text-generation-webui directory.

Started the server using python server.py --cpu.

When I try to load the model, I get this error:

llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
ValueError: Failed to load model from file: models/DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf

I’m not super experienced with this stuff, so I’m not sure what the issue is. Is the deepseek-r1-qwen pre-tokenizer supported by llama.cpp? Or is there something else I need to do to get this model working?

Here’s some info about my setup:

OS: Ubuntu 22.04.5 LTS

CPU: Intel i3-3225 (4 cores, 3.3 GHz)

RAM: 8 GB

Environment: Running everything on CPU with llama.cpp in the oobabooga Web UI.

python server.py --cpu
21:54:13-523178 INFO Starting Text generation web UI

Running on local URL: http://127.0.0.1:7860/

21:54:37-758021 INFO Loading
"DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf"
21:54:38-248012 INFO llama.cpp weights detected:
"models/DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.g
guf"
llama_model_loader: loaded meta data with 35 key-value pairs and 339 tensors from models/DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 1.5B Instruct
llama_model_loader: - kv 3: general.organization str = Qwen
llama_model_loader: - kv 4: general.finetune str = Instruct
llama_model_loader: - kv 5: general.basename str = Qwen2.5
llama_model_loader: - kv 6: general.size_label str = 1.5B
llama_model_loader: - kv 7: qwen2.block_count u32 = 28
llama_model_loader: - kv 8: qwen2.context_length u32 = 131072
llama_model_loader: - kv 9: qwen2.embedding_length u32 = 1536
llama_model_loader: - kv 10: qwen2.feed_forward_length u32 = 8960
llama_model_loader: - kv 11: qwen2.attention.head_count u32 = 12
llama_model_loader: - kv 12: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 13: qwen2.rope.freq_base f32 = 10000.000000
llama_model_loader: - kv 14: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = deepseek-r1-qwen
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.bos_token_id u32 = 151646
llama_model_loader: - kv 21: tokenizer.ggml.eos_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 24: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {% if not add_generation_prompt is de...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 18
llama_model_loader: - kv 28: general.url str = https://huggingface.co/mradermacher/D...
llama_model_loader: - kv 29: mradermacher.quantize_version str = 2
llama_model_loader: - kv 30: mradermacher.quantized_by str = mradermacher
llama_model_loader: - kv 31: mradermacher.quantized_at str = 2025-01-23T16:25:41+01:00
llama_model_loader: - kv 32: mradermacher.quantized_on str = back
llama_model_loader: - kv 33: general.source.url str = https://huggingface.co/thirdeyeai/Dee...
llama_model_loader: - kv 34: mradermacher.convert_type str = hf
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q6_K: 198 tensors
llama_model_load: error loading model: error loading model vocabulary: unknown pre-tokenizer type: 'deepseek-r1-qwen'
llama_model_load_from_file: failed to load model
21:54:38-606632 ERROR Failed to load the model.
Traceback (most recent call last):
File "/home/lts/text-generation-webui/modules/ui_model_menu.py", line 214, in load_model_wrapper
shared.model, shared.tokenizer = load_model(selected_model, loader)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lts/text-generation-webui/modules/models.py", line 90, in load_model
output = load_func_maploader
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lts/text-generation-webui/modules/models.py", line 280, in llamacpp_loader
model, tokenizer = LlamaCppModel.from_pretrained(model_file)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/lts/text-generation-webui/modules/llamacpp_model.py", line 111, in from_pretrained
result.model = Llama(**params)
^^^^^^^^^^^^^^^
File "/home/lts/miniconda3/envs/textgen/lib/python3.11/site-packages/llama_cpp/llama.py", line 369, in init
internals.LlamaModel(
File "/home/lts/miniconda3/envs/textgen/lib/python3.11/site-packages/llama_cpp/_internals.py", line 56, in init
raise ValueError(f"Failed to load model from file: {path_model}")
ValueError: Failed to load model from file: models/DeepSeek-R1-Distill-Qwen-1.5B-uncensored.Q6_K.gguf

Exception ignored in: <function LlamaCppModel.del at 0x7fb234b5cc20>
Traceback (most recent call last):
File "/home/lts/text-generation-webui/modules/llamacpp_model.py", line 62, in del
del self.model
^^^^^^^^^^
AttributeError: 'LlamaCppModel' object has no attribute 'model'

@redflagrul
Copy link
Author

mradermacher
Owner
18 minutes ago

Almost certainly you need a new enough version of oobabooga to run this model. Check for an update.

((I don't know if this is the problem))

@DanBoyDan
Copy link

I cannot run any deepseek R1 gguf distills, even that I have the latest version.

@jepjoo
Copy link

jepjoo commented Jan 29, 2025

There is an existing issue for this:
#6679

The problem is llama-cpp-python need to be updated, that's what we are waiting for. Looks like it's being worked on currently and possibly new wheels soon.

@jmzhaojq
Copy link

Yes, I am waiting for the update and looking forward to it.

@homoluden
Copy link

[offtopic] Is there a chance that multimodal extension will become compatible with GGUF VL-capable models?
I've read that llama.cpp update is required for VL GGUF support.

@dan-just
Copy link

dan-just commented Jan 29, 2025

Just updated on the last commit (thx Ooba for the quick response). Testing now but should work in llama cpp

@YakuzaSuske
Copy link

YakuzaSuske commented Jan 30, 2025

I updated everything already and i still get an error while using the Q6_K_L of the 14B

I tried lowering the context size, the gpu layers, nothing...

this is Bartowski's gguf

Image

EDIT/Update:

After removing the folder called "installer_files" and letting it reinstall everything, it fixed itself and i can now load the model. However for some reason now the UI as soon as it starts generating, about a few seconds into the generation it suddenly freezes and stops. The CPU usage goes to 0 and the model just stops generating. The UI still says it is typing/generating but the Ui becomes unresponsive. Cannot unload model, change it or etc. It happens to any model now, if i restore the previous installer_files i had it works fine. I tried setting the threads of CPU to 12 to see if lowering it to half of my CPU makes it better. I get this:

Image

Turning off text streaming doesn't do anything to fix that either. I think Llama-cpp 3.7 is bugged right now. I also tried a fresh install of the whole ui and it still gets stuck when generating. I tried also reinstalling with both CUDA 12.1 and 11.8. on both the issue persists.

@Recruiter2
Copy link

Recruiter2 commented Feb 4, 2025

I cannot run any deepseek R1 gguf distills, even that I have the latest version.

DeepSeek-R1-Distill-Qwen-7B-Q6_K.gguf
works here now...
with 8 gb card
loaded up to 32B q4 but that's likely overflow into ram or used ssd
also I am not using the
python server.py
I use the webui

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants