-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Embedding fails to run on vulkan backend #7130
Comments
Line 11229 in b6aa670
It happens right here |
Also reproducible using the exe from the release page |
Does the same issue happen with the server? Or is it just isolated to main? |
Same error when I run |
Let me summarize the investigation so far
With my OS and PC setting, embedding computation always try to first allocate buffer with 0 size here: Line 11222 in b6aa670
Because of Lines 625 to 631 in b228aba
For vulkan backend, Lines 6031 to 6043 in b228aba
And because
Embedding works for a short prompt
Log 1main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715575791 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reservingsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | embedding 0: -0.078424 0.061774 0.122099 0.071252 -0.013703 -0.013969 0.057376 -0.043510 -0.059822 0.018061 0.005385 -0.043010 0.038214 -0.014732 0.027173 -0.001804 cosine similarity matrix: 1.00 0.22 llama_print_timings: load time = 104.76 ms But it doesn't work for a longer prompt
For debug build, an MSVC runtime error shows up: "Expression: can't dereference invalidated vector iterator", this is an error specific to this case though, I think I have seen it when I run llama.cpp main debug build Log 2main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715576013 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB ggml_gallocr_reserve_n: reallocating Vulkan0 buffer from size 0.00 MiB to 16.86 MiB ggml_gallocr_reserve_n: reallocating Vulkan_Host buffer from size 0.00 MiB to 3.50 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100 ggml_gallocr_needs_realloc: graph has different number of nodes ggml_gallocr_alloc_graph: cannot reallocate multi buffer graph automatically, call reserve ggml_backend_sched_alloc_splits: failed to allocate graph, reservingsystem_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | For release build, here is the error on the terminal: Log 3main: build = 2864 (cbf7589) main: built with Clang 18.1.4 for main: seed = 1715576579 llama_model_loader: loaded meta data with 24 key-value pairs and 101 tensors from ..\..\..\models\all-MiniLM-L6-v2-Q5_K_M.gguf (version GGUF V3 (latest)) llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output. llama_model_loader: - kv 0: general.architecture str = bert llama_model_loader: - kv 1: general.name str = all-MiniLM-L6-v2 llama_model_loader: - kv 2: bert.block_count u32 = 6 llama_model_loader: - kv 3: bert.context_length u32 = 512 llama_model_loader: - kv 4: bert.embedding_length u32 = 384 llama_model_loader: - kv 5: bert.feed_forward_length u32 = 1536 llama_model_loader: - kv 6: bert.attention.head_count u32 = 12 llama_model_loader: - kv 7: bert.attention.layer_norm_epsilon f32 = 0.000000 llama_model_loader: - kv 8: general.file_type u32 = 17 llama_model_loader: - kv 9: bert.attention.causal bool = false llama_model_loader: - kv 10: bert.pooling_type u32 = 1 llama_model_loader: - kv 11: tokenizer.ggml.token_type_count u32 = 2 llama_model_loader: - kv 12: tokenizer.ggml.bos_token_id u32 = 101 llama_model_loader: - kv 13: tokenizer.ggml.eos_token_id u32 = 102 llama_model_loader: - kv 14: tokenizer.ggml.model str = bert llama_model_loader: - kv 15: tokenizer.ggml.tokens arr[str,30522] = ["[PAD]", "[unused0]", "[unused1]", "... llama_model_loader: - kv 16: tokenizer.ggml.scores arr[f32,30522] = [-1000.000000, -1000.000000, -1000.00... llama_model_loader: - kv 17: tokenizer.ggml.token_type arr[i32,30522] = [3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ... llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32 = 100 llama_model_loader: - kv 19: tokenizer.ggml.seperator_token_id u32 = 102 llama_model_loader: - kv 20: tokenizer.ggml.padding_token_id u32 = 0 llama_model_loader: - kv 21: tokenizer.ggml.cls_token_id u32 = 101 llama_model_loader: - kv 22: tokenizer.ggml.mask_token_id u32 = 103 llama_model_loader: - kv 23: general.quantization_version u32 = 2 llama_model_loader: - type f32: 63 tensors llama_model_loader: - type f16: 1 tensors llama_model_loader: - type q5_1: 28 tensors llama_model_loader: - type q8_0: 3 tensors llama_model_loader: - type q5_K: 4 tensors llama_model_loader: - type q6_K: 2 tensors llm_load_vocab: mismatch in special tokens definition ( 7104/30522 vs 5/30522 ). llm_load_print_meta: format = GGUF V3 (latest) llm_load_print_meta: arch = bert llm_load_print_meta: vocab type = WPM llm_load_print_meta: n_vocab = 30522 llm_load_print_meta: n_merges = 0 llm_load_print_meta: n_ctx_train = 512 llm_load_print_meta: n_embd = 384 llm_load_print_meta: n_head = 12 llm_load_print_meta: n_head_kv = 12 llm_load_print_meta: n_layer = 6 llm_load_print_meta: n_rot = 32 llm_load_print_meta: n_embd_head_k = 32 llm_load_print_meta: n_embd_head_v = 32 llm_load_print_meta: n_gqa = 1 llm_load_print_meta: n_embd_k_gqa = 384 llm_load_print_meta: n_embd_v_gqa = 384 llm_load_print_meta: f_norm_eps = 1.0e-12 llm_load_print_meta: f_norm_rms_eps = 0.0e+00 llm_load_print_meta: f_clamp_kqv = 0.0e+00 llm_load_print_meta: f_max_alibi_bias = 0.0e+00 llm_load_print_meta: f_logit_scale = 0.0e+00 llm_load_print_meta: n_ff = 1536 llm_load_print_meta: n_expert = 0 llm_load_print_meta: n_expert_used = 0 llm_load_print_meta: causal attn = 0 llm_load_print_meta: pooling type = 1 llm_load_print_meta: rope type = 2 llm_load_print_meta: rope scaling = linear llm_load_print_meta: freq_base_train = 10000.0 llm_load_print_meta: freq_scale_train = 1 llm_load_print_meta: n_yarn_orig_ctx = 512 llm_load_print_meta: rope_finetuned = unknown llm_load_print_meta: ssm_d_conv = 0 llm_load_print_meta: ssm_d_inner = 0 llm_load_print_meta: ssm_d_state = 0 llm_load_print_meta: ssm_dt_rank = 0 llm_load_print_meta: model type = 22M llm_load_print_meta: model ftype = Q5_K - Medium llm_load_print_meta: model params = 22.57 M llm_load_print_meta: model size = 19.99 MiB (7.43 BPW) llm_load_print_meta: general.name = all-MiniLM-L6-v2 llm_load_print_meta: BOS token = 101 '[CLS]' llm_load_print_meta: EOS token = 102 '[SEP]' llm_load_print_meta: UNK token = 100 '[UNK]' llm_load_print_meta: SEP token = 102 '[SEP]' llm_load_print_meta: PAD token = 0 '[PAD]' llm_load_print_meta: CLS token = 101 '[CLS]' llm_load_print_meta: MASK token = 103 '[MASK]' llm_load_print_meta: LF token = 0 '[PAD]' ggml_vulkan: Found 1 Vulkan devices: Vulkan0: AMD Radeon(TM) 780M | uma: 1 | fp16: 1 | warp size: 64 llm_load_tensors: ggml ctx size = 0.05 MiB llm_load_tensors: offloading 0 repeating layers to GPU llm_load_tensors: offloaded 0/7 layers to GPU llm_load_tensors: CPU buffer size = 19.99 MiB ............................ llama_new_context_with_model: n_ctx = 512 llama_new_context_with_model: n_batch = 2048 llama_new_context_with_model: n_ubatch = 2048 llama_new_context_with_model: flash_attn = 0 llama_new_context_with_model: freq_base = 10000.0 llama_new_context_with_model: freq_scale = 1 llama_kv_cache_init: CPU KV buffer size = 4.50 MiB llama_new_context_with_model: KV self size = 4.50 MiB, K (f16): 2.25 MiB, V (f16): 2.25 MiB WARNING: failed to allocate 0.00 MB of pinned memory ggml_vulkan: Failed to allocate pinned memory. ggml_vulkan: Null Pointer: ErrorInitializationFailed llama_new_context_with_model: CPU output buffer size = 0.00 MiB llama_new_context_with_model: Vulkan0 compute buffer size = 16.86 MiB llama_new_context_with_model: Vulkan_Host compute buffer size = 3.50 MiB llama_new_context_with_model: graph nodes = 221 llama_new_context_with_model: graph splits = 100system_info: n_threads = 8 / 16 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | |
Thank you for the detailed report and the investigation and apologies for not getting back to you sooner. I'll look into it and let you know what I find. |
@Adriankhl Can you check whether #7360 fixes your issues? |
@0cc4m hi, if the prompt is long, I still get a similar VC++ error in debug build, in release build the run finish, but it gives nan vector:
|
Another interesting observation, if I set
|
I can see that NaN error, it only happens when no layers are offloaded. Otherwise it seems to work fine. The NaNs only happen on certain hardware and are caused by some clean-up issue that shows up in the Vulkan validation layer. I'll try to fix that soon. |
@Adriankhl I fixed the NaN issue on my end, can you try running #7360 again? |
@0cc4m seems working fine🎊I will do a bit more testing later on. One additional problem, I have figured out the cause of the debug build error, it happens here: Lines 625 to 646 in e23b974
Because of the MSVC bug, the vector size is detected wrongly in a debug build, even when |
Thank you for checking!
I can't, sorry. I don't use Windows, so I wouldn't be able to verify that, and it's outside the scope of my PR. If you think it's a useful addition you can open a separate PR for it. |
Thanks for this, and it also fixes the gibberish problem I encountered when the generated text exceeds the context size. |
System information: Windows 11, cpu amd 7840u with 780m apu
Vulkan build:
cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_VULKAN=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
CPU build:
cmake .. -GNinja -DCMAKE_C_COMPILER=clang-cl -DCMAKE_CXX_COMPILER=clang-cl -DCMAKE_EXPORT_COMPILE_COMMANDS=1 -DLLAMA_NATIVE=OFF -DCMAKE_BUILD_TYPE=Release
Model: https://huggingface.co/second-state/All-MiniLM-L6-v2-Embedding-GGUF/tree/main
I think something is wrong with the support of embedding models.
Observations:
main
runs fine on vulkan backend, with a normal LLM model such as llama 3embedding
works on CPU backend with embedding models such as All-MiniLMembedding
"works" on vulkan backend with a normal LLM model such as llama 3, though the output is not meaningfulembedding
fails to run on CPU backend with the following log with embedding models such as All-MiniLMThe text was updated successfully, but these errors were encountered: