Skip to content

Eval bug: Qwen2.5-Omni outputs GGGGG... when fully offloaded to GPU on Jetson AGX Orin #15923

@shakez0901

Description

@shakez0901

Name and Version

$ ./build/bin/llama-mtmd-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
version: 6316 (009b709)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu

Operating systems

Linux

GGML backends

CUDA

Hardware

Jetson AGX Orin 64GB

Models

Qwen2.5-Omni-7B

as sanity check, also the BF16 GGUF from https://huggingface.co/unsloth/Qwen2.5-Omni-7B-GGUF and Q4_K_M quants of the smaller 3B model from https://huggingface.co/unsloth/Qwen2.5-Omni-3B-GGUF and https://huggingface.co/ggml-org/Qwen2.5-Omni-3B-GGUF

Problem description & steps to reproduce

Build commands:

cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="87"
cmake --build build --config Release -- -j

Problem:
I am receiving gibberish output, specifically a chain of Gs: "GGGGGGGGGG...".
But interestingly, this only starts happening if the entire model is offloaded to the GPU and if there is audio input present. Text only works fine, as opposed to similar issues like #15556 or #15034, where text only already broke.

Smaller values for -ngl did not lead to this behavior.

Furthermore, the bug seems to be related to the Qwen2.5-Omni model itself, as it happened on all the above-mentioned models, but other models like Ultravox and Voxtral-Small seemed to run fine

I first noticed the issue with requests to llama-server, but with llama-mtmd-cli I could reproduce it in a simpler way:

./llama-mtmd-cli -hf ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0 -ngl 99 -p "What is being said here?" --audio /home/nvidia/samples/who.mp3

where the audio is just a TTS generated audio saying "Who are you?". But it should happen reguardless of the exact audio and prompt.

First Bad Commit

eb39499 is the first bad commit
commit eb39499
Author: Shawn yang 137684499+Yangxiaoz@users.noreply.github.com
Date: Sat May 31 14:48:04 2025 +0800

CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)

* 1.  add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted code indentation

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fixed incorrect setting of variable types

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the judgment logic

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add a defensive security assert

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Adjusted the support judgment logic.

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* revoke the suggest commit changes due to it's not applicable in jetson_device

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Add parentheses to enforce operator precedence​

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Update ggml/src/ggml-cuda/ggml-cuda.cu

Fix ci bug: add a spaces

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>

ggml/src/ggml-cuda/common.cuh | 1 +
ggml/src/ggml-cuda/ggml-cuda.cu | 20 ++++++++++++++------
2 files changed, 15 insertions(+), 6 deletions(-)

Relevant log output

with 009b709d:

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_mmproj-Qwen2.5-Omni-7B-Q8_0.gguf
build: 6316 (009b709d) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Orin) - 55943 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 7.6B
llama_model_loader: - kv   3:                            general.license str              = other
llama_model_loader: - kv   4:                       general.license.name str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-O...
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["multimodal", "any-to-any"]
llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   8:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv   9:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  10:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv  11:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  12:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  13:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% set audio_count = namespace(value=...
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 7.54 GiB (8.50 BPW) 
load: printing all EOG tokens:
load:   - 151643 ('<|endoftext|>')
load:   - 151645 ('<|im_end|>')
load:   - 151662 ('<|fim_pad|>')
load:   - 151663 ('<|repo_name|>')
load:   - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  7165.44 MiB
load_tensors:   CPU_Mapped model buffer size =   552.23 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: kv_unified    = false
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache:      CUDA0 KV buffer size =   224.00 MiB
llama_kv_cache: size =  224.00 MiB (  4096 cells,  28 layers,  1/1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   304.00 MiB
llama_context:  CUDA_Host compute buffer size =    17.02 MiB
llama_context: graph nodes  = 1070
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    1008
clip_model_loader: n_kv:         32

clip_model_loader: has vision encoder
clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          qwen2.5o
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               1280
load_hparams: n_layer:            32
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     3584

--- vision hparams ---
load_hparams: image_size:         1024
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       8

load_hparams: model size:         1476.70 MiB
load_hparams: metadata size:      0.35 MiB
alloc_compute_meta:      CUDA0 compute buffer size =     3.60 MiB
alloc_compute_meta:        CPU compute buffer size =     0.16 MiB
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          qwen2.5o
load_hparams: n_embd:             1280
load_hparams: n_head:             20
load_hparams: n_ff:               5120
load_hparams: n_layer:            32
load_hparams: ffn_op:             gelu_erf
load_hparams: projection_dim:     3584

--- audio hparams ---
load_hparams: n_mel_bins:         128
load_hparams: proj_stack_factor:  0

load_hparams: model size:         1476.70 MiB
load_hparams: metadata size:      0.35 MiB
alloc_compute_meta:      CUDA0 compute buffer size =   200.96 MiB
alloc_compute_meta:        CPU compute buffer size =    10.99 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
encoding audio slice...
audio slice encoded in 609 ms
decoding audio batch 1/1, n_tokens_batch = 750
audio decoded (batch 1/1) in 634 ms

GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^CG


llama_perf_context_print:        load time =    4233.91 ms
llama_perf_context_print: prompt eval time =    1544.74 ms /   766 tokens (    2.02 ms per token,   495.88 tokens per second)
llama_perf_context_print:        eval time =    3109.48 ms /    50 runs   (   62.19 ms per token,    16.08 tokens per second)
llama_perf_context_print:       total time =    5775.97 ms /   816 tokens
llama_perf_context_print:    graphs reused =          0






with e562eece7cb476276bfc4cbb18deb7c0369b2233 (last good commit):
./build/bin/llama-mtmd-cli -hf ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0 -ngl 99 -p "What is being said here?" --audio /home/nvidia/rosc8493/samples/who.mp3
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: Orin, compute capability 8.7, VMM: yes
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_mmproj-Qwen2.5-Omni-7B-Q8_0.gguf
build: 5548 (e562eece) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Orin) - 49873 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2vl
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                         general.size_label str              = 7.6B
llama_model_loader: - kv   3:                            general.license str              = other
llama_model_loader: - kv   4:                       general.license.name str              = apache-2.0
llama_model_loader: - kv   5:                       general.license.link str              = https://huggingface.co/Qwen/Qwen2.5-O...
llama_model_loader: - kv   6:                               general.tags arr[str,2]       = ["multimodal", "any-to-any"]
llama_model_loader: - kv   7:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv   8:                        qwen2vl.block_count u32              = 28
llama_model_loader: - kv   9:                     qwen2vl.context_length u32              = 32768
llama_model_loader: - kv  10:                   qwen2vl.embedding_length u32              = 3584
llama_model_loader: - kv  11:                qwen2vl.feed_forward_length u32              = 18944
llama_model_loader: - kv  12:               qwen2vl.attention.head_count u32              = 28
llama_model_loader: - kv  13:            qwen2vl.attention.head_count_kv u32              = 4
llama_model_loader: - kv  14:                     qwen2vl.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:   qwen2vl.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                          general.file_type u32              = 7
llama_model_loader: - kv  17:            qwen2vl.rope.dimension_sections arr[i32,4]       = [16, 24, 24, 0]
llama_model_loader: - kv  18:               general.quantization_version u32              = 2
llama_model_loader: - kv  19:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  20:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  21:                      tokenizer.ggml.tokens arr[str,152064]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  22:                  tokenizer.ggml.token_type arr[i32,152064]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  23:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  24:                tokenizer.ggml.eos_token_id u32              = 151645
llama_model_loader: - kv  25:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  26:                    tokenizer.chat_template str              = {% set audio_count = namespace(value=...
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type q8_0:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q8_0
print_info: file size   = 7.54 GiB (8.50 BPW) 
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2vl
print_info: vocab_only       = 0
print_info: n_ctx_train      = 32768
print_info: n_embd           = 3584
print_info: n_layer          = 28
print_info: n_head           = 28
print_info: n_head_kv        = 4
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: is_swa_any       = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 7
print_info: n_embd_k_gqa     = 512
print_info: n_embd_v_gqa     = 512
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: f_attn_scale     = 0.0e+00
print_info: n_ff             = 18944
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = -1
print_info: rope type        = 8
print_info: rope scaling     = linear
print_info: freq_base_train  = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 32768
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 7B
print_info: model params     = 7.62 B
print_info: general.name     = n/a
print_info: vocab type       = BPE
print_info: n_vocab          = 152064
print_info: n_merges         = 151387
print_info: BOS token        = 11 ','
print_info: EOS token        = 151645 '<|im_end|>'
print_info: EOT token        = 151645 '<|im_end|>'
print_info: PAD token        = 151643 '<|endoftext|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|endoftext|>'
print_info: EOG token        = 151645 '<|im_end|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors:        CUDA0 model buffer size =  7165.44 MiB
load_tensors:   CPU_Mapped model buffer size =   552.23 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch       = 2048
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = 0
llama_context: freq_base     = 1000000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context:  CUDA_Host  output buffer size =     0.58 MiB
llama_kv_cache_unified:      CUDA0 KV buffer size =   224.00 MiB
llama_kv_cache_unified: size =  224.00 MiB (  4096 cells,  28 layers,  1 seqs), K (f16):  112.00 MiB, V (f16):  112.00 MiB
llama_context:      CUDA0 compute buffer size =   304.00 MiB
llama_context:  CUDA_Host compute buffer size =    15.01 MiB
llama_context: graph nodes  = 1098
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant

clip_model_loader: model name:   
clip_model_loader: description:  
clip_model_loader: GGUF version: 3
clip_model_loader: alignment:    32
clip_model_loader: n_tensors:    1008
clip_model_loader: n_kv:         32

clip_model_loader: has vision encoder
clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          qwen2.5o
load_hparams: n_embd:             1280
load_hparams: n_head:             16
load_hparams: n_ff:               1280
load_hparams: n_layer:            32
load_hparams: ffn_op:             silu
load_hparams: projection_dim:     3584

--- vision hparams ---
load_hparams: image_size:         1024
load_hparams: patch_size:         14
load_hparams: has_llava_proj:     0
load_hparams: minicpmv_version:   0
load_hparams: proj_scale_factor:  0
load_hparams: n_wa_pattern:       8

load_hparams: model size:         1476.70 MiB
load_hparams: metadata size:      0.35 MiB
alloc_compute_meta:      CUDA0 compute buffer size =     2.77 MiB
alloc_compute_meta:        CPU compute buffer size =     0.16 MiB
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector:          qwen2.5o
load_hparams: n_embd:             1280
load_hparams: n_head:             20
load_hparams: n_ff:               5120
load_hparams: n_layer:            32
load_hparams: ffn_op:             gelu_erf
load_hparams: projection_dim:     3584

--- audio hparams ---
load_hparams: n_mel_bins:         128
load_hparams: proj_stack_factor:  0

load_hparams: model size:         1476.70 MiB
load_hparams: metadata size:      0.35 MiB
alloc_compute_meta:      CUDA0 compute buffer size =   200.96 MiB
alloc_compute_meta:        CPU compute buffer size =    10.99 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
    https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
encoding audio slice...
audio slice encoded in 618 ms
decoding audio batch 1/1, n_tokens_batch = 750
audio decoded (batch 1/1) in 634 ms

The text in the image says "Who are you?"


llama_perf_context_print:        load time =    3096.77 ms
llama_perf_context_print: prompt eval time =    1545.53 ms /   766 tokens (    2.02 ms per token,   495.62 tokens per second)
llama_perf_context_print:        eval time =     717.54 ms /    11 runs   (   65.23 ms per token,    15.33 tokens per second)
llama_perf_context_print:       total time =    3285.56 ms /   777 tokens

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions