[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

aahouzi · 2024-02-15T17:52:38Z

Current Behavior:

Built llama.cpp with sycl backend for Windows by following instructions in README-sycl.md.
The build completes successfully, the conversion and everything works fine.
When running the main, the code errors out with due to a GGML_ASSERT issue. Tried to debug it and seems like when this function get_device_index_by_id is being called the returned id is equal to -1, and then the error happens when assert statement GGML_ASSERT(res>=0); finds res=-1 . My device number is 5 as u can see in the logs.
@airMeng @NeoZhangJianyu cc here, tried all tricks for known issues in the README-sycl, but this didn't lead anywhere..

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=5 && build\bin\main.exe -m %LLAMA2%\ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap -ngl 33 --ignore-eos
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 1708016072
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 5 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\intel\.cache\huggingface\hub\models--meta-llama--Llama-2-7b-chat-hf\snapshots\c1b0db933684edbfe29a06fa47eb19cc48025e93\ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
GGML_ASSERT: C:/Users/intel/Desktop/aahouzi/llama.cpp/ggml-sycl.cpp:9364: res>=0

Steps To Reproduce:

Same steps in README-sycl.md

Environment:

OS: Win11
HW: Intel ARC A770 dGPU

The text was updated successfully, but these errors were encountered:

airMeng · 2024-02-17T14:59:33Z

have you tried GGML_SYCL_DEVICE=3?

This is wield because mostly dGPU will appear as the first device, but in your case is 3 and 5. Can you try the following and paste the output here?

source /PATH/TO/ONEAPI/setvars.sh
sycl-ls

I guess the issue is that you select OpenCL device in OneAPI, but we only fully verified on LevelZero (usually should be the first default device)

aahouzi · 2024-02-19T09:03:36Z

Yes, I tried GGML_SYCL_DEVICE=3, but same issue here.

I got this:

C:\Users\intel>sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) UHD Graphics 770 OpenCL 3.0 NEO  [31.0.101.5186]
[opencl:gpu:3] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [31.0.101.5186]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) UHD Graphics 770 1.3 [1.3.28044]
[ext_oneapi_level_zero:gpu:1] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.28044]

and when I run the sycl device executable, I get this:

C:\Users\intel\Desktop\aahouzi\llama.cpp>build\bin\ls-sycl-device.exe
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392

I don't think I'm selecting the OpenCL device in oneAPI; it's clearly mentioned in the logs that this is level_zero. In your PR #5208, you got the build on Windows working, but did you try running it on multiple Windows platforms to ensure that it's properly functioning on Windows ?

NeoZhangJianyu · 2024-02-19T09:18:30Z

The device selection is with issue when there are igpu & Arc GPU in same PC.
It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

Please try it:

export ONEAPI_DEVICE_SELECTOR="level_zero:gpu"
export GGML_SYCL_DEVICE=0 or 1

Thank you!

aahouzi · 2024-02-19T10:21:33Z

After the change, only level_zero devices are being shown but the issue is still there:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set ONEAPI_DEVICE_SELECTOR="level_zero:gpu" && set GGML_SYCL_DEVICE=1 && build\bin\main.exe -m %LLAMA2%\ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --no-mmap -ngl 33 --ignore-eos
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 1708337936
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 2 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 1 (Intel(R) Arc(TM) A770 Graphics) as main device
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from C:\Users\intel\.cache\huggingface\hub\models--meta-llama--Llama-2-7b-chat-hf\snapshots\c1b0db933684edbfe29a06fa47eb19cc48025e93\ggml-model-q4_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
llm_load_tensors: ggml ctx size =    0.22 MiB
GGML_ASSERT: C:/Users/intel/Desktop/aahouzi/llama.cpp/ggml-sycl.cpp:9364: res>=0

NeoZhangJianyu · 2024-02-20T10:10:18Z

how about device_id=0?
I think it has been supported.
Maybe some new code break it.
could you try with old release. like jordankanter@8c4aa67.

aahouzi · 2024-02-21T14:58:55Z

how about device_id=0?
I think it has been supported.
Maybe some new code break it.

Tried with device 0, this time there is no GGML_ASSERT issue but the execution just hangs with no output. I tried adding --no-mmap option but the issue is still there:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0 --no-mmap
Log start
main: build = 2153 (0d417712)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_F16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =    10.01 MiB

My iGPU has 7.8GB memory, and I think loading a llama2-7B-Q4_0 will require 3.9GB, so I should be fine ? When I monitor activity, iGPU usage was near 1% and memory was occupied up to 4.2GB. Also, I tried changing the number of layers offloaded to iGPU but this didn't change anything.

could you try with old release. like jordankanter@8c4aa67.

With this release jordankanter@8c4aa67, the GGML_ASSERT issue is still there for A770. However, iGPU works this time but doesn't generate any text it just shows the prompt, and stop there (iGPU usage was 96%):

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    77.55 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1:
llama_print_timings:        load time =    8526.53 ms
llama_print_timings:      sample time =      21.66 ms /   400 runs   (    0.05 ms per token, 18462.96 tokens per second)
llama_print_timings: prompt eval time =    2613.48 ms /    19 tokens (  137.55 ms per token,     7.27 tokens per second)
llama_print_timings:        eval time =  124352.76 ms /   399 runs   (  311.66 ms per token,     3.21 tokens per second)
llama_print_timings:       total time =  127047.39 ms /   418 tokens
Log end

When trying to offload only few layers to iGPU (-ngl=15), the generation starts but the output is gibberish:

C:\Users\intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m %LLAMA2% -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 15 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 6 SYCL devices:
  Device 0: Intel(R) UHD Graphics 770,  compute capability 1.3,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 32,   max work group size 67108864,   max sub group size 64,  global mem size 3839483904
  Device 2: 13th Gen Intel(R) Core(TM) i9-13900K,       compute capability 3.0,
        max compute_units 32,   max work group size 8192,       max sub group size 64,  global mem size 3839483904
  Device 3: Intel(R) Arc(TM) A770 Graphics,     compute capability 3.0,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
  Device 4: Intel(R) UHD Graphics 770,  compute capability 3.0,
        max compute_units 32,   max work group size 512,        max sub group size 32,  global mem size 3093630976
  Device 5: Intel(R) Arc(TM) A770 Graphics,     compute capability 1.3,
        max compute_units 512,  max work group size 1024,       max sub group size 32,  global mem size 3819835392
Using device 0 (Intel(R) UHD Graphics 770) as main device
...
llm_load_tensors: offloading 15 repeating layers to GPU
llm_load_tensors: offloaded 15/33 layers to GPU
llm_load_tensors:            buffer size =  1628.91 MiB
llm_load_tensors:        CPU buffer size =  3647.87 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   120.00 MiB
llama_kv_cache_init:        CPU KV buffer size =   136.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    74.80 MiB
llama_new_context_with_model:        CPU compute buffer size =    77.55 MiB
llama_new_context_with_model: graph splits (measure): 5

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1:"
$▅▅!



 #
␦#
␦▅▅▅#""$
$
␦
$

!
!"▅
# [end of text]

llama_print_timings:        load time =    6108.94 ms
llama_print_timings:      sample time =      16.16 ms /   168 runs   (    0.10 ms per token, 10393.47 tokens per second)
llama_print_timings: prompt eval time =    1569.78 ms /    19 tokens (   82.62 ms per token,    12.10 tokens per second)
llama_print_timings:        eval time =   35977.66 ms /   167 runs   (  215.44 ms per token,     4.64 tokens per second)
llama_print_timings:       total time =   37609.83 ms /   186 tokens
Log end

aahouzi · 2024-02-21T16:16:10Z

@NeoZhangJianyu On a different note: When I try latest llama.cpp on MTL iGPU Windows, the code hangs with no output. When I switch to jordankanter@8c4aa67, I can run llama2-7B-Q4_0.gguf fully on iGPU (only 5.8GB) and with really good text output quality:

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m ..\llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2032 (8c4aa67)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
GGML_SYCL_DEBUG=0
ggml_init_sycl: GGML_SYCL_FP16:   no
ggml_init_sycl: SYCL_USE_XMX: yes
found 4 SYCL devices:
  Device 0: Intel(R) Arc(TM) Graphics,  compute capability 1.3,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 1132294144
  Device 1: Intel(R) FPGA Emulation Device,     compute capability 1.2,
        max compute_units 22,   max work group size 67108864,   max sub group size 64,  global mem size 3961389056
  Device 2: Intel(R) Core(TM) Ultra 7 165H,     compute capability 3.0,
        max compute_units 22,   max work group size 8192,       max sub group size 64,  global mem size 3961389056
  Device 3: Intel(R) Arc(TM) Graphics,  compute capability 3.0,
        max compute_units 128,  max work group size 1024,       max sub group size 32,  global mem size 1132294144
Using device 0 (Intel(R) Arc(TM) Graphics) as main device
...
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:            buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
.................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:            KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU input buffer size   =     9.01 MiB
llama_new_context_with_model:            compute buffer size =    77.55 MiB
llama_new_context_with_model:        CPU compute buffer size =     8.80 MiB
llama_new_context_with_model: graph splits (measure): 3

system_info: n_threads = 11 / 22 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 512, n_batch = 512, n_predict = 400, n_keep = 0


 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher in search engine results. This can be done by improving its SEO, which stands for “search engine optimization.” It’s the process of increasing a website’s visibility in online searches. You’ll want to do this because if people can’t find you when they type certain keywords into Google or Bing, then there’s no point in having them visit your site!
There are many ways you could go about optimizing your site for better SE
llama_print_timings:        load time =   17439.87 ms
llama_print_timings:      sample time =      68.66 ms /   400 runs   (    0.17 ms per token,  5826.06 tokens per second)
llama_print_timings: prompt eval time =    1441.73 ms /    19 tokens (   75.88 ms per token,    13.18 tokens per second)
llama_print_timings:        eval time =   52801.55 ms /   399 runs   (  132.33 ms per token,     7.56 tokens per second)
llama_print_timings:       total time =   54530.25 ms /   418 tokens
Log end

Based on this, I think that the theory that some code broke the support might be actually true..

airMeng · 2024-02-22T00:22:00Z

#5624

@aahouzi I think the "hanging" issues has been solved by the above PR, did you use this commit?

aahouzi · 2024-02-22T13:02:55Z

@airMeng using latest branch with with #5624 changes eliminates the hang issue. However, I'm now in the same situation as using jordankanter@8c4aa67: When offloading all layers the model generates nothing, and if offloading few layers the generation is gibberish.

For ARC A770, the GGML_ASSERT issue is still there though xd..

mudler · 2024-02-23T18:07:12Z

Cannot replicate this. I'm testing with 201294a on my Intel Arc a770 and everything works as expected

aahouzi · 2024-02-23T18:12:56Z

The device selection is with issue when there are igpu & Arc GPU in same PC.
It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

mudler · 2024-02-23T18:21:04Z

The device selection is with issue when there are igpu & Arc GPU in same PC.
It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

I have an AMD CPU.

aahouzi · 2024-02-26T12:31:32Z

@NeoZhangJianyu @airMeng I tried on 2 other systems, each one having an A770/A770M with Intel Iris Xe graphics igpu on Windows, and I successfully reproduced this issue. This needs deeper investigation to know what's going on here. All of our systems have an igpu, and this will become a blocker sooner or later..

Also, got access to an AMD Ryzen9 CPU with A770 card, and I can confirm it's running out of the box without this issue.

NeoZhangJianyu · 2024-03-06T00:49:42Z

@aahouzi
Could you try with latest code?
The multiple cards support is merged.

aahouzi · 2024-03-08T09:35:53Z

@NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ?

airMeng · 2024-03-08T10:26:20Z

@NeoZhangJianyu I saw u created a revert PR, is #5901 merged or there is no change yet ?

It is merged by mistake. the author will re-implement it with different methods but for same effects. You can try 5901 locally.

aahouzi · 2024-03-08T11:34:58Z

@airMeng I think I'll wait until it's re-implemented

sgwhat · 2024-03-12T08:31:25Z

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770
System: Ubuntu

sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]

I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:

ollama run example "What is your favourite condiment?"
 !##"##!       "!▅


        ▅
 "! $   #"# ##  ▅"#!

It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?

airMeng · 2024-03-12T08:55:53Z

@aahouzi @sgwhat Can you try #6006?

sgwhat · 2024-03-12T09:52:09Z

@aahouzi @sgwhat Can you try #6006?

Hi @airMeng , it's still not work..., I think my bug is really wired (it could work well on arch linux but not ubuntu) ☹

time=2024-03-12T20:06:03.311+08:00 level=INFO source=routes.go:1021 msg="Listening on 127.0.0.1:11434 (version 0.0.0)"
time=2024-03-12T20:06:03.311+08:00 level=INFO source=payload_common.go:107 msg="Extracting dynamic libraries..."
time=2024-03-12T20:06:03.346+08:00 level=INFO source=payload_common.go:146 msg="Dynamic LLM libraries [cpu_avx cpu_avx2 oneapi cpu]"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:105 msg="Detecting GPU type"
time=2024-03-12T20:06:03.346+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libnvidia-ml.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library librocm_smi64.so"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: []"
time=2024-03-12T20:06:03.347+08:00 level=INFO source=gpu.go:285 msg="Searching for GPU management library libze_intel_gpu.so"
time=2024-03-12T20:06:03.350+08:00 level=INFO source=gpu.go:331 msg="Discovered GPU libraries: [/usr/lib/x86_64-linux-gnu/libze_intel_gpu.so.1.3.27191.42]"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=gpu.go:130 msg="Intel GPU detected"
time=2024-03-12T20:06:03.358+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
[GIN] 2024/03/12 - 20:06:08 | 200 |      18.859µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/12 - 20:06:10 | 200 |      30.154µs |       127.0.0.1 | HEAD     "/api/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09"
[GIN] 2024/03/12 - 20:06:13 | 200 |  3.154779213s |       127.0.0.1 | POST     "/api/create"
[GIN] 2024/03/12 - 20:06:24 | 200 |      10.269µs |       127.0.0.1 | HEAD     "/"
[GIN] 2024/03/12 - 20:06:24 | 200 |     120.482µs |       127.0.0.1 | POST     "/api/show"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
time=2024-03-12T20:06:24.727+08:00 level=INFO source=cpu_common.go:11 msg="CPU has AVX2"
loading library /tmp/ollama2419540237/oneapi/libext_server.so
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:90 msg="Loading Dynamic llm server: /tmp/ollama2419540237/oneapi/libext_server.so"
time=2024-03-12T20:06:24.816+08:00 level=INFO source=dyn_ext_server.go:150 msg="Initializing llama server"
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 4 SYCL devices:
|  |                  |                                             |compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       1.3|        512|    1024|     32|    16225243136|
| 1|    [opencl:gpu:0]|               Intel(R) Arc(TM) A770 Graphics|       3.0|        512|    1024|     32|    16225243136|
| 2|    [opencl:cpu:0]|         13th Gen Intel(R) Core(TM) i9-13900K|       3.0|         32|    8192|     64|    67143290880|
| 3|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         32|67108864|     64|    67143290880|
llama_model_loader: loaded meta data with 21 key-value pairs and 291 tensors from /home/arda/.ollama/models/blobs/sha256:8bd3a3006c4f7aace054efdd717e4b86a05b521a83dc460a9640d7b2f179bf09 (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  19:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  20:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW) 
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [0] with top Max compute units:512
get_memory_info: [warning] ext_intel_free_memory is not supported (export/set ZES_ENABLE_SYSMAN=1 to support), use total memory as free memory
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL0 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL0 KV buffer size =  1024.00 MiB
llama_new_context_with_model: KV self size  = 1024.00 MiB, K (f16):  512.00 MiB, V (f16):  512.00 MiB
llama_new_context_with_model:  SYCL_Host input buffer size   =    13.02 MiB
llama_new_context_with_model:      SYCL0 compute buffer size =   164.00 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     8.00 MiB
llama_new_context_with_model: graph splits (measure): 2

NeoZhangJianyu · 2024-03-12T15:41:13Z

@sgwhat
The log above is not whole.
Here is a delay to load code.
Please wait for 1-2 mins.

NeoZhangJianyu · 2024-03-12T15:42:16Z

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770 System: Ubuntu
sycl-ls
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2  [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]
[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO  [23.35.27191.42]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:
ollama run example "What is your favourite condiment?"
 !##"##!       "!▅


        ▅
 "! $   #"# ##  ▅"#! 
It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?

The result is due to the error of OPs.
You could rebase with latest GGML lib in your project.

sgwhat · 2024-03-12T16:39:02Z

Hi all @airMeng @NeoZhangJianyu, I also get similar trouble with this issue.

GPU device: Arc 770 System: Ubuntu
sycl-ls

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix]

[opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i9-13900K OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix]

[opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics OpenCL 3.0 NEO [23.35.27191.42]

[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.27191]
I used ollama to run llama.cpp sycl inference with this PR https://github.com/ollama/ollama/pull/2458/files, but I got error below:
ollama run example "What is your favourite condiment?"

!##"##! "!▅
    ▅
"! $ #"# ## ▅"#!
It's wired that this PR works only on Archlinux, but cannot work as well on Ubuntu, could you please give me some debugging advice?

The result is due to the error of OPs.

You could rebase with latest GGML lib in your project.

Sry to bother you, may I ask What is this OPs about? Is it latest GGML lib same as building the latest llama.cpp?

NeoZhangJianyu · 2024-03-13T00:53:38Z

@sgwhat

It's hard to know which OPs lead to the error result without deeply check.
We only support llama.cpp issue.
Looks like your issue happen on project https://github.com/ollama/ollama.
We are not familiar with your project.

Suggestion:

report a new issue for your case.
please run llama.cpp example according to the guide: https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md.
It could be used to verify your hardware and software environment.
If step 2 is passed.
Please rebase your project with the llama.cpp/ggml version which is verified in step 2.
I guess your project should be passed too, if there is no more change of llama.cpp/ggml.
If step 2 is fault.
Please provide whole log file and we will check the issue.

Thank you!

sgwhat · 2024-03-13T08:24:54Z

@sgwhat

It's hard to know which OPs lead to the error result without deeply check.

We only support llama.cpp issue.
Looks like your issue happen on project https://github.com/ollama/ollama.
We are not familiar with your project.

Suggestion:

report a new issue for your case.

please run llama.cpp example according to the guide: https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md.
It could be used to verify your hardware and software environment.

If step 2 is passed.
Please rebase your project with the llama.cpp/ggml version which is verified in step 2.
I guess your project should be passed too, if there is no more change of llama.cpp/ggml.

If step 2 is fault.
Please provide whole log file and we will check the issue.

Thank you!

I failed in step2, and I opened a new issue for it #6036.

NeoZhangJianyu · 2024-03-15T01:14:52Z

@aahouzi
Is your issue present with latest code?

aahouzi · 2024-03-15T08:36:12Z

@NeoZhangJianyu I'm tracking your PR, you still didn't merge #6073, so I don't think it will work.

I see that it's been merged, I will do my tests and keep you updated ;-)

aahouzi · 2024-03-17T14:01:21Z

@NeoZhangJianyu @airMeng I did my tests, it's working but not as it should be. I have an iGPU + A770M, and whatever id I pick for GGML_SYCL_DEVICE it always uses my A770M, and therefore I can't even use my iGPU if I want.
For example, here I want to use my iGPU which has id=0. When I pick it, it automatically goes to A770M instead. The bigger problem is that whatever id I pick it always go to A770M instead ^^'

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2447 (c47cf414)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       1.3|         96|     512|     32|     3097038848|
| 1|[level_zero:gpu:1]|              Intel(R) Arc(TM) A770M Graphics|       1.3|        512|    1024|     32|     3819835392|
| 2|    [opencl:gpu:0]|              Intel(R) Arc(TM) A770M Graphics|       3.0|        512|    1024|     32|     3819835392|
| 3|    [opencl:gpu:1]|                 Intel(R) Iris(R) Xe Graphics|       3.0|         96|     512|     32|     3097038848|
| 4|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i7-12700H|       3.0|         20|    8192|     64|     3846729728|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         20|67108864|     64|     3846729728|
...
ggml_backend_sycl_set_mul_device_mode: true
+ detect 1 SYCL GPUs: [1] with top Max compute units:512 (A770M and not iGPU)
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL1 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =    70.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph splits: 2

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher on search engines like Google when someone searches for information related to what you offer. This includes creating content that is optimized for SEO, making sure that each page has a meta description and keyword tags (if applicable), and ensuring that all images have alt text descriptions attached to them. You should also link out from other websites where appropriate—this helps build authority with search engines while simultaneously giving users relevant information about topics they might be interested in reading more about later on down
llama_print_timings:        load time =    8562.90 ms
llama_print_timings:      sample time =      47.18 ms /   400 runs   (    0.12 ms per token,  8478.53 tokens per second)
llama_print_timings: prompt eval time =     242.04 ms /    19 tokens (   12.74 ms per token,    78.50 tokens per second)
llama_print_timings:        eval time =   20663.94 ms /   399 runs   (   51.79 ms per token,    19.31 tokens per second)
llama_print_timings:       total time =   21107.17 ms /   418 tokens
Log end

NeoZhangJianyu · 2024-03-18T05:39:11Z

@aahouzi
Good, above result approves work well.
In last week, the bug of set GPU is fixed. Please use latest code.

To set the GPU, please refer to the script:

./examples/sycl/run-llama2.sh 0
./examples/sycl/run-llama2.sh 1

aahouzi · 2024-03-18T08:30:00Z

@NeoZhangJianyu I'm using the latest code, and the issue is still there ;)

NeoZhangJianyu · 2024-03-19T09:03:01Z

@aahouzi Could you provide the whole log including cmd?

aahouzi · 2024-03-19T09:12:46Z

@NeoZhangJianyu here is the whole log including cmd:

C:\Users\Intel\Desktop\aahouzi\llama.cpp>set GGML_SYCL_DEVICE=0 && build\bin\main.exe -m llama-2-7b.Q4_0.gguf -p "Building a website can be done in 10 simple steps:\nStep 1:" -n 400 -e -ngl 33 -s 0
Log start
main: build = 2447 (c47cf414)
main: built with IntelLLVM 2024.0.2 for
main: seed  = 0
ggml_init_sycl: GGML_SYCL_DEBUG: 0
ggml_init_sycl: GGML_SYCL_F16: no
found 6 SYCL devices:
|  |                  |                                             |Compute   |Max compute|Max work|Max sub|               |
|ID|       Device Type|                                         Name|capability|units      |group   |group  |Global mem size|
|--|------------------|---------------------------------------------|----------|-----------|--------|-------|---------------|
| 0|[level_zero:gpu:0]|                 Intel(R) Iris(R) Xe Graphics|       1.3|         96|     512|     32|     3097038848|
| 1|[level_zero:gpu:1]|              Intel(R) Arc(TM) A770M Graphics|       1.3|        512|    1024|     32|     3819835392|
| 2|    [opencl:gpu:0]|              Intel(R) Arc(TM) A770M Graphics|       3.0|        512|    1024|     32|     3819835392|
| 3|    [opencl:gpu:1]|                 Intel(R) Iris(R) Xe Graphics|       3.0|         96|     512|     32|     3097038848|
| 4|    [opencl:cpu:0]|         12th Gen Intel(R) Core(TM) i7-12700H|       3.0|         20|    8192|     64|     3846729728|
| 5|    [opencl:acc:0]|               Intel(R) FPGA Emulation Device|       1.2|         20|67108864|     64|     3846729728|
llama_model_loader: loaded meta data with 19 key-value pairs and 291 tensors from llama-2-7b.Q4_0.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = LLaMA v2
llama_model_loader: - kv   2:                       llama.context_length u32              = 4096
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32              = 11008
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv   7:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32              = 32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  10:                          general.file_type u32              = 2
llama_model_loader: - kv  11:                       tokenizer.ggml.model str              = llama
llama_model_loader: - kv  12:                      tokenizer.ggml.tokens arr[str,32000]   = ["<unk>", "<s>", "</s>", "<0x00>", "<...
llama_model_loader: - kv  13:                      tokenizer.ggml.scores arr[f32,32000]   = [0.000000, 0.000000, 0.000000, 0.0000...
llama_model_loader: - kv  14:                  tokenizer.ggml.token_type arr[i32,32000]   = [2, 3, 3, 6, 6, 6, 6, 6, 6, 6, 6, 6, ...
llama_model_loader: - kv  15:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  16:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  17:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type q4_0:  225 tensors
llama_model_loader: - type q6_K:    1 tensors
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V2
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 4096
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: n_embd_k_gqa     = 4096
llm_load_print_meta: n_embd_v_gqa     = 4096
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 4096
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = Q4_0
llm_load_print_meta: model params     = 6.74 B
llm_load_print_meta: model size       = 3.56 GiB (4.54 BPW)
llm_load_print_meta: general.name     = LLaMA v2
llm_load_print_meta: BOS token        = 1 '<s>'
llm_load_print_meta: EOS token        = 2 '</s>'
llm_load_print_meta: UNK token        = 0 '<unk>'
llm_load_print_meta: LF token         = 13 '<0x0A>'
ggml_backend_sycl_set_mul_device_mode: true
detect 1 SYCL GPUs: [1] with top Max compute units:512
llm_load_tensors: ggml ctx size =    0.22 MiB
llm_load_tensors: offloading 32 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 33/33 layers to GPU
llm_load_tensors:      SYCL1 buffer size =  3577.56 MiB
llm_load_tensors:        CPU buffer size =    70.31 MiB
..................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:      SYCL1 KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:  SYCL_Host  output buffer size =    62.50 MiB
llama_new_context_with_model:      SYCL1 compute buffer size =    70.50 MiB
llama_new_context_with_model:  SYCL_Host compute buffer size =     9.00 MiB
llama_new_context_with_model: graph splits: 2

system_info: n_threads = 10 / 20 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
sampling:
        repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature
generate: n_ctx = 512, n_batch = 2048, n_predict = 400, n_keep = 1


 Building a website can be done in 10 simple steps:
Step 1: Get Domain and Hosting
The first step is to get your domain name and hosting account. Your domain will serve as the address of your site, while hosting will provide you with space on which to build your site and make it available for people to visit. When you purchase hosting, you’ll also have access to other services like a website builder (which we recommend), a WordPress installer, an SSL certificate, etc.
Once you have these, the next step is to create a website using the tools provided by your web host or by purchasing a third-party site builder, such as Squarespace or Wix. You can do this yourself if you’d like, but we don’t recommend it unless you already know how to code. If not, hire someone who does!
Step 3: Create Your Website Content
The next step is to create your website content by writing text, uploading images and videos, or creating multimedia elements such as slideshows and music tracks (if applicable). This process usually takes about a month if done properly. Once you’ve created all the necessary components of your site—including graphics for headers/footers, menus, etc.—you can begin setting up navigation links between pages using HTML code or through an online tool such as WordPress (or both!).
Step 4: Optimize Your Site to Appear Higher in Search Results
The next step is to optimize your site so that it appears higher on search engines like Google when someone searches for information related to what you offer. This includes creating content that is optimized for SEO, making sure that each page has a meta description and keyword tags (if applicable), and ensuring that all images have alt text descriptions attached to them. You should also link out from other websites where appropriate—this helps build authority with search engines while simultaneously giving users relevant information about topics they might be interested in reading more about later on down
llama_print_timings:        load time =    8855.14 ms
llama_print_timings:      sample time =      47.40 ms /   400 runs   (    0.12 ms per token,  8439.71 tokens per second)
llama_print_timings: prompt eval time =     250.88 ms /    19 tokens (   13.20 ms per token,    75.73 tokens per second)
llama_print_timings:        eval time =   20629.34 ms /   399 runs   (   51.70 ms per token,    19.34 tokens per second)
llama_print_timings:       total time =   21079.05 ms /   418 tokens
Log end

SergioVargasRamirez · 2024-05-02T13:18:44Z

Could you please give some details about your config. It seems I have a similar system but in my case it is not working... I am using opensuse tumbleweed.

thanks in advance,

Sergio

The device selection is with issue when there are igpu & Arc GPU in same PC.
It has been fixed in multiple GPU support feature. But this feature is ongoing and not merged.

I think my issue is probably related to what @NeoZhangJianyu mentioned above. @mudler are u in the same setting ?

I have an AMD CPU.

github-actions · 2024-06-17T01:07:18Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

aahouzi added the bug-unconfirmed label Feb 15, 2024

mudler mentioned this issue Feb 17, 2024

deps(llama.cpp): update mudler/LocalAI#1714

Merged

airMeng mentioned this issue Mar 6, 2024

[SYCL] fix error when set main gpu to non-zero #5901

Merged

github-actions bot added the stale label Apr 19, 2024

github-actions bot removed the stale label May 3, 2024

github-actions bot added the stale label Jun 2, 2024

github-actions bot closed this as completed Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

aahouzi commented Feb 15, 2024

airMeng commented Feb 17, 2024

aahouzi commented Feb 19, 2024

NeoZhangJianyu commented Feb 19, 2024

aahouzi commented Feb 19, 2024

NeoZhangJianyu commented Feb 20, 2024

aahouzi commented Feb 21, 2024

aahouzi commented Feb 21, 2024

airMeng commented Feb 22, 2024

aahouzi commented Feb 22, 2024

mudler commented Feb 23, 2024

aahouzi commented Feb 23, 2024

mudler commented Feb 23, 2024

aahouzi commented Feb 26, 2024 •

edited

Loading

NeoZhangJianyu commented Mar 6, 2024

aahouzi commented Mar 8, 2024

airMeng commented Mar 8, 2024

aahouzi commented Mar 8, 2024

sgwhat commented Mar 12, 2024 •

edited

Loading

airMeng commented Mar 12, 2024

sgwhat commented Mar 12, 2024 •

edited

Loading

NeoZhangJianyu commented Mar 12, 2024

NeoZhangJianyu commented Mar 12, 2024

sgwhat commented Mar 12, 2024 •

edited

Loading

NeoZhangJianyu commented Mar 13, 2024

sgwhat commented Mar 13, 2024

NeoZhangJianyu commented Mar 15, 2024

aahouzi commented Mar 15, 2024 •

edited

Loading

aahouzi commented Mar 17, 2024

NeoZhangJianyu commented Mar 18, 2024

aahouzi commented Mar 18, 2024

NeoZhangJianyu commented Mar 19, 2024

aahouzi commented Mar 19, 2024

SergioVargasRamirez commented May 2, 2024

github-actions bot commented Jun 17, 2024

[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

[SYCL] GGML_ASSERT issue when running llama.cpp with SYCL on A770 #5513

Comments

aahouzi commented Feb 15, 2024

Current Behavior:

Steps To Reproduce:

Environment:

airMeng commented Feb 17, 2024

aahouzi commented Feb 19, 2024

NeoZhangJianyu commented Feb 19, 2024

aahouzi commented Feb 19, 2024

NeoZhangJianyu commented Feb 20, 2024

aahouzi commented Feb 21, 2024

aahouzi commented Feb 21, 2024

airMeng commented Feb 22, 2024

aahouzi commented Feb 22, 2024

mudler commented Feb 23, 2024

aahouzi commented Feb 23, 2024

mudler commented Feb 23, 2024

aahouzi commented Feb 26, 2024 • edited Loading

NeoZhangJianyu commented Mar 6, 2024

aahouzi commented Mar 8, 2024

airMeng commented Mar 8, 2024

aahouzi commented Mar 8, 2024

sgwhat commented Mar 12, 2024 • edited Loading

airMeng commented Mar 12, 2024

sgwhat commented Mar 12, 2024 • edited Loading

NeoZhangJianyu commented Mar 12, 2024

NeoZhangJianyu commented Mar 12, 2024

sgwhat commented Mar 12, 2024 • edited Loading

NeoZhangJianyu commented Mar 13, 2024

sgwhat commented Mar 13, 2024

NeoZhangJianyu commented Mar 15, 2024

aahouzi commented Mar 15, 2024 • edited Loading

aahouzi commented Mar 17, 2024

NeoZhangJianyu commented Mar 18, 2024

aahouzi commented Mar 18, 2024

NeoZhangJianyu commented Mar 19, 2024

aahouzi commented Mar 19, 2024

SergioVargasRamirez commented May 2, 2024

github-actions bot commented Jun 17, 2024

aahouzi commented Feb 26, 2024 •

edited

Loading

sgwhat commented Mar 12, 2024 •

edited

Loading

sgwhat commented Mar 12, 2024 •

edited

Loading

sgwhat commented Mar 12, 2024 •

edited

Loading

aahouzi commented Mar 15, 2024 •

edited

Loading