-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513
Comments
have you tried This is wield because mostly dGPU will appear as the first device, but in your case is 3 and 5. Can you try the following and paste the output here? source /PATH/TO/ONEAPI/setvars.sh
sycl-ls I guess the issue is that you select OpenCL device in OneAPI, but we only fully verified on LevelZero (usually should be the first default device) |
Yes, I tried I got this:
and when I run the sycl device executable, I get this:
I don't think I'm selecting the OpenCL device in oneAPI; it's clearly mentioned in the logs that this is level_zero. In your PR #5208, you got the build on Windows working, but did you try running it on multiple Windows platforms to ensure that it's properly functioning on Windows ? |
The device selection is with issue when there are igpu & Arc GPU in same PC. Please try it:
Thank you! |
After the change, only level_zero devices are being shown but the issue is still there:
|
how about device_id=0? |
|
@NeoZhangJianyu On a different note: When I try latest llama.cpp on MTL iGPU Windows, the code hangs with no output. When I switch to jordankanter@8c4aa67, I can run llama2-7B-Q4_0.gguf fully on iGPU (only 5.8GB) and with really good text output quality:
|
@airMeng using latest branch with with #5624 changes eliminates the hang issue. However, I'm now in the same situation as using jordankanter@8c4aa67: When offloading all layers the model generates nothing, and if offloading few layers the generation is gibberish. For ARC A770, the GGML_ASSERT issue is still there though xd.. |
Cannot replicate this. I'm testing with 201294a on my Intel Arc a770 and everything works as expected |
I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ? |
I have an AMD CPU. |
@NeoZhangJianyu @airMeng I tried on 2 other systems, each one having an A770/A770M with Intel Iris Xe graphics igpu on Windows, and I successfully reproduced this issue. This needs deeper investigation to know what's going on here. All of our systems have an igpu, and this will become a blocker sooner or later.. Also, got access to an AMD Ryzen9 CPU with A770 card, and I can confirm it's running out of the box without this issue. |
@aahouzi |
@NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ? |
It is merged by mistake. the author will re-implement it with different methods but for same effects. You can try 5901 locally. |
@airMeng I think I'll wait until it's re-implemented |
Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue. GPU device: Arc 770 sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191] I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below: ollama run example "What is your favourite condiment?"
!##"##! "!▅
▅
"! $ #"# ## ▅"#! It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice? |
Hi @airMeng , it's still not work..., I think my bug is really wired (it could work well on arch linux but not ubuntu) ☹ time=2024-03-12T20:06:03.311+08:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-12T20:06:03.311+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-12T20:06:03.346+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 oneapi cpu]"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-12T20:06:03.350+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27191.42]"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GIN] 2024/03/12 - 20:06:08 | 200 | 18.859µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/12 - 20:06:10 | 200 | 30.154µs | 127.0.0.1 | HEAD "/api/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09"
[GIN] 2024/03/12 - 20:06:13 | 200 | 3.154779213s | 127.0.0.1 | POST "/api/create"
[GIN] 2024/03/12 - 20:06:24 | 200 | 10.269µs | 127.0.0.1 | HEAD "/"
[GIN] 2024/03/12 - 20:06:24 | 200 | 120.482µs | 127.0.0.1 | POST "/api/show"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama2419540237/oneapi/libext_server.so
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2419540237/oneapi/libext_server.so"
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
| | | |compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 1.3| 512| 1024| 32| 16225243136|
| 1| [opencl:gpu:0]| Intel(R) Arc(TM) A770 Graphics| 3.0| 512| 1024| 32| 16225243136|
| 2| [opencl:cpu:0]| 13th Gen Intel(R) Core(TM) i9-13900K| 3.0| 32| 8192| 64| 67143290880|
| 3| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 32|67108864| 64| 67143290880|
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/arda/.ollama/models/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.name str = LLaMA v2
llama_model_loader: - kv 2: llama.context_length u32 = 4096
llama_model_loader: - kv 3: llama.embedding_length u32 = 4096
llama_model_loader: - kv 4: llama.block_count u32 = 32
llama_model_loader: - kv 5: llama.feed_forward_length u32 = 11008
llama_model_loader: - kv 6: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 7: llama.attention.head_count u32 = 32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32 = 32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 10: general.file_type u32 = 2
llama_model_loader: - kv 11: tokenizer.ggml.model str = llama
llama_model_loader: - kv 12: tokenizer.ggml.tokens arr[str,32000] = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv 13: tokenizer.ggml.scores arr[f32,32000] = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv 14: tokenizer.ggml.token_type arr[i32,32000] = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv 15: tokenizer.ggml.bos_token_id u32 = 1
llama_model_loader: - kv 16: tokenizer.ggml.eos_token_id u32 = 2
llama_model_loader: - kv 17: tokenizer.ggml.unknown_token_id u32 = 0
llama_model_loader: - kv 18: tokenizer.ggml.add_bos_token bool = true
llama_model_loader: - kv 19: tokenizer.ggml.add_eos_token bool = false
llama_model_loader: - kv 20: general.quantization_version u32 = 2
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: n_embd_k_gqa = 4096
llm_load_print_meta: n_embd_v_gqa = 4096
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 0
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = Q4_0
llm_load_print_meta: model params = 6.74 B
llm_load_print_meta: model size = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL0 buffer size = 3577.56 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 2048
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL0 KV buffer size = 1024.00 MiB
llama_new_context_with_model: KV self size = 1024.00 MiB, K (f16): 512.00 MiB, V (f16): 512.00 MiB
llama_new_context_with_model: SYCL_Host input buffer size = 13.02 MiB
llama_new_context_with_model: SYCL0 compute buffer size = 164.00 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 8.00 MiB
llama_new_context_with_model: graph splits (measure): 2 |
@sgwhat |
The result is due to the error of OPs. |
Sry to bother you, may I ask What is this OPs about? Is it latest GGML lib same as building the latest llama.cpp? |
Suggestion:
Thank you! |
I failed in step2, and I opened a new issue for it #6036. |
@aahouzi |
@NeoZhangJianyu I'm tracking your PR, you still didn't merge #6073, so I don't think it will work. I see that it's been merged, I will do my tests and keep you updated ;-) |
C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2447 (c47cf414)
main: built with IntelLLVM 2024.0.2 for
main: seed = 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
| | | |Compute |Max compute|Max work|Max sub| |
|ID| Device Type| Name|capability|units |group |group |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]| Intel(R) Iris(R) Xe Graphics| 1.3| 96| 512| 32| 3097038848|
| 1|[level_zero:gpu:1]| Intel(R) Arc(TM) A770M Graphics| 1.3| 512| 1024| 32| 3819835392|
| 2| [opencl:gpu:0]| Intel(R) Arc(TM) A770M Graphics| 3.0| 512| 1024| 32| 3819835392|
| 3| [opencl:gpu:1]| Intel(R) Iris(R) Xe Graphics| 3.0| 96| 512| 32| 3097038848|
| 4| [opencl:cpu:0]| 12th Gen Intel(R) Core(TM) i7-12700H| 3.0| 20| 8192| 64| 3846729728|
| 5| [opencl:acc:0]| Intel(R) FPGA Emulation Device| 1.2| 20|67108864| 64| 3846729728|
...
ggml_backend_sycl_set_mul_device_mode: true
+ detect 1 SYCL GPUs: [1] with top Max compute units:512 (A770M and not iGPU)
llm_load_tensors: ggml ctx size = 0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors: SYCL1 buffer size = 3577.56 MiB
llm_load_tensors: CPU buffer size = 70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: n_batch = 512
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: SYCL1 KV buffer size = 256.00 MiB
llama_new_context_with_model: KV self size = 256.00 MiB, K (f16): 128.00 MiB, V (f16): 128.00 MiB
llama_new_context_with_model: SYCL_Host output buffer size = 62.50 MiB
llama_new_context_with_model: SYCL1 compute buffer size = 70.50 MiB
llama_new_context_with_model: SYCL_Host compute buffer size = 9.00 MiB
llama_new_context_with_model: graph splits: 2
system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1
Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher on search engines like Google when someone searches for information related to what you offer. This includes creating content that is optimized for SEO, making sure that each page has a meta description and keyword tags (if applicable), and ensuring that all images have alt text descriptions attached to them. You should also link out from other websites where appropriate—this helps build authority with search engines while simultaneously giving users relevant information about topics they might be interested in reading more about later on down
llama_print_timings: load time = 8562.90 ms
llama_print_timings: sample time = 47.18 ms / 400 runs ( 0.12 ms per token, 8478.53 tokens per second)
llama_print_timings: prompt eval time = 242.04 ms / 19 tokens ( 12.74 ms per token, 78.50 tokens per second)
llama_print_timings: eval time = 20663.94 ms / 399 runs ( 51.79 ms per token, 19.31 tokens per second)
llama_print_timings: total time = 21107.17 ms / 418 tokens
Log end
|
@aahouzi To set the GPU, please refer to the script:
|
@NeoZhangJianyu I'm using the latest code, and the issue is still there ;) |
@aahouzi Could you provide the whole log including cmd? |
@NeoZhangJianyu here is the whole log including cmd:
|
Could you please give some details about your config. It seems I have a similar system but in my case it is not working... I am using opensuse tumbleweed. thanks in advance, Sergio
|
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Current Behavior:
Built llama.cpp with sycl backend for Windows by following instructions in README-sycl.md.
The build completes successfully, the conversion and everything works fine.
When running the main, the code errors out with due to a GGML_ASSERT issue. Tried to debug it and seems like when this function get_device_index_by_id is being called the returned id is equal to -1, and then the error happens when assert statement GGML_ASSERT(res>=0); finds res=-1 . My device number is 5 as u can see in the logs.
@airMeng @NeoZhangJianyu cc here, tried all tricks for known issues in the README-sycl, but this didn't lead anywhere..
Steps To Reproduce:
Same steps in README-sycl.md
Environment:
The text was updated successfully, but these errors were encountered: