Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error running speculative inference #441

Open
pwilkin opened this issue Feb 17, 2025 · 6 comments
Open

Error running speculative inference #441

pwilkin opened this issue Feb 17, 2025 · 6 comments
Labels
more-info-needed Need more information to diagnose the problem

Comments

@pwilkin
Copy link

pwilkin commented Feb 17, 2025

I've tried to use the Speculative Inference function, but using it with qwen2.5-14b-instruct-1m with qwen2.5-coder-0.5b-instruct as the draft yields the following error:

2025-02-17 10:14:59 [DEBUG] 
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
2025-02-17 10:14:59 [DEBUG] 
common_init_from_params: failed to load model 'D:\models\lmstudio-community\Qwen2.5-Coder-0.5B-Instruct-GGUF\Qwen2.5-Coder-0.5B-Instruct-Q4_K_M.gguf'
2025-02-17 10:14:59 [DEBUG] 
[10388:0217/101459.695:ERROR:crashpad_client_win.cc(868)] not connected

which causes the following in the app:

[2025-02-17 10:15:00.030] [error] [LMSInternal][Client=LM Studio][Endpoint=sendMessage] Error in RPC handler: Error: Rehydrated error
aded or crashed.
    at _0x8702b8.<computed> (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:453:109001)
    at _0x2584f6.subscriber (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:126:1877)
    at _0x2584f6.notifier (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:287:139189)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
- Caused By: Error: Channel Error
    at <computed> (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:148:61183)
    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)
- Caused By: Error: Model has unloaded or crashed.
    at _0x3a48f8._0x38ddde (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:40:283917)
    at _0x3a48f8.emit (node:events:531:35)
    at _0x3a48f8.onChildExit (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:40:245731)
    at _0xc979e6.<anonymous> (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:40:244990)
    at _0xc979e6.emit (node:events:519:28)
    at ForkUtilityProcess.<anonymous> (C:\Program Files\LM Studio\resources\app\.webpack\main\index.js:201:7864)
    at ForkUtilityProcess.emit (node:events:519:28)
    at ForkUtilityProcess.a.emit (node:electron/js2c/browser_init:2:71438)

Running 0.3.10beta4 on Windows.

@yagil
Copy link
Member

yagil commented Feb 17, 2025

Hi, thanks for the bug report.

Does qwen2.5-coder-0.5b-instruct load successfully on its own when you pick it from the main model loader? (ctrl + L)

@yagil yagil transferred this issue from lmstudio-ai/lms Feb 17, 2025
@yagil yagil added the more-info-needed Need more information to diagnose the problem label Feb 18, 2025
@ka-admin
Copy link

llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 4080) - 0 MiB free
2025-02-20 15:46:06 [DEBUG]
llama_model_loader: loaded meta data with 26 key-value pairs and 435 tensors from F:\LMModels\Qwen\Qwen2.5-Coder-3B-Instruct-GGUF\qwen2.5-coder-3b-instruct-q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen2.5 Coder 3B Instruct AWQ
llama_model_loader: - kv 3: general.finetune str = Instruct-AWQ
llama_model_loader: - kv 4: general.basename str = Qwen2.5-Coder
llama_model_loader: - kv 5: general.size_label str = 3B
llama_model_loader: - kv 6: qwen2.block_count u32 = 36
llama_model_loader: - kv 7: qwen2.context_length u32 = 32768
llama_model_loader: - kv 8: qwen2.embedding_length u32 = 2048
llama_model_loader: - kv 9: qwen2.feed_forward_length u32 = 11008
llama_model_loader: - kv 10: qwen2.attention.head_count u32 = 16
llama_model_loader: - kv 11: qwen2.attention.head_count_kv u32 = 2
llama_model_loader: - kv 12: qwen2.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 13: qwen2.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 14: general.file_type u32 = 7
llama_model_loader: - kv 15: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 16: tokenizer.ggml.pre str = qwen2
2025-02-20 15:46:06 [DEBUG]
llama_model_loader: - kv 17: tokenizer.ggml.tokens arr[str,151936] = ["!", """, "#", "$", "%", "&", "'", ...
2025-02-20 15:46:06 [DEBUG]
llama_model_loader: - kv 18: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
2025-02-20 15:46:06 [DEBUG]
llama_model_loader: - kv 19: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 20: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 21: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 22: tokenizer.ggml.bos_token_id u32 = 151643
llama_model_loader: - kv 23: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 24: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 25: general.quantization_version u32 = 2
llama_model_loader: - type f32: 181 tensors
llama_model_loader: - type q8_0: 254 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 3.36 GiB (8.50 BPW)
2025-02-20 15:46:07 [DEBUG]
init_tokenizer: initializing tokenizer for type 2
2025-02-20 15:46:07 [DEBUG]
load: control token: 151661 '<|fim_suffix|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151649 '<|box_end|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151647 '<|object_ref_end|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151654 '<|vision_pad|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151659 '<|fim_prefix|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151648 '<|box_start|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: control token: 151644 '<|im_start|>' is not marked as EOG
load: control token: 151646 '<|object_ref_start|>' is not marked as EOG
load: control token: 151650 '<|quad_start|>' is not marked as EOG
load: control token: 151651 '<|quad_end|>' is not marked as EOG
load: control token: 151652 '<|vision_start|>' is not marked as EOG
load: control token: 151653 '<|vision_end|>' is not marked as EOG
load: control token: 151655 '<|image_pad|>' is not marked as EOG
load: control token: 151656 '<|video_pad|>' is not marked as EOG
load: control token: 151660 '<|fim_middle|>' is not marked as EOG
2025-02-20 15:46:07 [DEBUG]
load: special tokens cache size = 22
2025-02-20 15:46:07 [DEBUG]
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 2048
print_info: n_layer = 36
print_info: n_head = 16
print_info: n_head_kv = 2
print_info: n_rot = 128
print_info: n_swa = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 8
print_info: n_embd_k_gqa = 256
print_info: n_embd_v_gqa = 256
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: n_ff = 11008
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 2
print_info: rope scaling = linear
2025-02-20 15:46:07 [DEBUG]
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 3B
print_info: model params = 3.40 B
print_info: general.name = Qwen2.5 Coder 3B Instruct AWQ
print_info: vocab type = BPE
print_info: n_vocab = 151936
print_info: n_merges = 151387
print_info: BOS token = 151643 '<|endoftext|>'
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 148848 'ÄĬ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
2025-02-20 15:46:07 [DEBUG]
llama_model_load: error loading model: invalid vector subscript
llama_model_load_from_file_impl: failed to load model
2025-02-20 15:46:07 [DEBUG]
common_init_from_params: failed to load model 'F:\LMModels\Qwen\Qwen2.5-Coder-3B-Instruct-GGUF\qwen2.5-coder-3b-instruct-q8_0.gguf'
2025-02-20 15:46:07 [DEBUG]
[13252:0220/154607.088:ERROR:crashpad_client_win.cc(868)] not connected

All models loaded successfully. But any attemp to send a promt brings to crash and unload of main model

@pwilkin
Copy link
Author

pwilkin commented Feb 20, 2025

On an interesting note, I tried with an LM Studio install on Linux (on the same box, with dual boot) and it works OK.

@pwilkin
Copy link
Author

pwilkin commented Feb 20, 2025

And yes, I even did the test with both models already loaded.

@pwilkin
Copy link
Author

pwilkin commented Feb 21, 2025

Can confirm that on Windows on 0.3.10b6 it still fails with the same error message (llama_model_load: error loading model: invalid vector subscript).

@Ryvix
Copy link

Ryvix commented Mar 3, 2025

I've had this happen as well. I've found if I set the context length to 16384 instead of the max 131072 that it allows setting automatically with the button when loading with custom settings then it I select the smaller model on the Inference tab it will load without errors.

Also, I have found that if I save the preset there and load it then it doesn't load the smaller model unless I manually change it and select it in the dropdown menu there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-info-needed Need more information to diagnose the problem
Projects
None yet
Development

No branches or pull requests

4 participants