-
Notifications
You must be signed in to change notification settings - Fork 13.3k
Description
Name and Version
$ ./build/bin/llama-mtmd-cli --version
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
version: 6316 (009b709)
built with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
Operating systems
Linux
GGML backends
CUDA
Hardware
Jetson AGX Orin 64GB
Models
Qwen2.5-Omni-7B
- GGUFs from ggml-org/Qwen2.5-Omni-3B-GGUF
as sanity check, also the BF16 GGUF from https://huggingface.co/unsloth/Qwen2.5-Omni-7B-GGUF and Q4_K_M quants of the smaller 3B model from https://huggingface.co/unsloth/Qwen2.5-Omni-3B-GGUF and https://huggingface.co/ggml-org/Qwen2.5-Omni-3B-GGUF
Problem description & steps to reproduce
Build commands:
cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="87"
cmake --build build --config Release -- -j
Problem:
I am receiving gibberish output, specifically a chain of Gs: "GGGGGGGGGG...".
But interestingly, this only starts happening if the entire model is offloaded to the GPU and if there is audio input present. Text only works fine, as opposed to similar issues like #15556 or #15034, where text only already broke.
Smaller values for -ngl did not lead to this behavior.
Furthermore, the bug seems to be related to the Qwen2.5-Omni model itself, as it happened on all the above-mentioned models, but other models like Ultravox and Voxtral-Small seemed to run fine
I first noticed the issue with requests to llama-server, but with llama-mtmd-cli I could reproduce it in a simpler way:
./llama-mtmd-cli -hf ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0 -ngl 99 -p "What is being said here?" --audio /home/nvidia/samples/who.mp3
where the audio is just a TTS generated audio saying "Who are you?". But it should happen reguardless of the exact audio and prompt.
First Bad Commit
eb39499 is the first bad commit
commit eb39499
Author: Shawn yang 137684499+Yangxiaoz@users.noreply.github.com
Date: Sat May 31 14:48:04 2025 +0800
CUDA: add a prop in ggml_cuda_device_infor for distinguish iGPU or dGPU in cuda (#13856) (#13895)
* 1. add "integrated" in ggml_cuda_device_info for distinguish whether it is Intergrate_gpu or discrete_gpu
2. Adjust the func:"ggml_backend_cuda_device_supports_buft" for this new feature
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted code indentation
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fixed incorrect setting of variable types
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the judgment logic
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* add a host_buft assert in case of integrated_cuda_device with func:'evaluate_and_capture_cuda_graph()'
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add a defensive security assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Adjusted the support judgment logic.
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* revoke the suggest commit changes due to it's not applicable in jetson_device
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Add parentheses to enforce operator precedence
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Update ggml/src/ggml-cuda/ggml-cuda.cu
Fix ci bug: add a spaces
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: yangxiao <yang_xl@tju.edu.cn>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: yangxiao <yangxl_zz@qq.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
ggml/src/ggml-cuda/common.cuh | 1 +
ggml/src/ggml-cuda/ggml-cuda.cu | 20 ++++++++++++++------
2 files changed, 15 insertions(+), 6 deletions(-)
Relevant log output
with 009b709d:
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_mmproj-Qwen2.5-Omni-7B-Q8_0.gguf
build: 6316 (009b709d) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Orin) - 55943 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 7.6B
llama_model_loader: - kv 3: general.license str = other
llama_model_loader: - kv 4: general.license.name str = apache-2.0
llama_model_loader: - kv 5: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-O...
llama_model_loader: - kv 6: general.tags arr[str,2] = ["multimodal", "any-to-any"]
llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 8: qwen2vl.block_count u32 = 28
llama_model_loader: - kv 9: qwen2vl.context_length u32 = 32768
llama_model_loader: - kv 10: qwen2vl.embedding_length u32 = 3584
llama_model_loader: - kv 11: qwen2vl.feed_forward_length u32 = 18944
llama_model_loader: - kv 12: qwen2vl.attention.head_count u32 = 28
llama_model_loader: - kv 13: qwen2vl.attention.head_count_kv u32 = 4
llama_model_loader: - kv 14: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: general.file_type u32 = 7
llama_model_loader: - kv 17: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.chat_template str = {% set audio_count = namespace(value=...
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q8_0: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 7.54 GiB (8.50 BPW)
load: printing all EOG tokens:
load: - 151643 ('<|endoftext|>')
load: - 151645 ('<|im_end|>')
load: - 151662 ('<|fim_pad|>')
load: - 151663 ('<|repo_name|>')
load: - 151664 ('<|file_sep|>')
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = n/a
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 7165.44 MiB
load_tensors: CPU_Mapped model buffer size = 552.23 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: kv_unified = false
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache: CUDA0 KV buffer size = 224.00 MiB
llama_kv_cache: size = 224.00 MiB ( 4096 cells, 28 layers, 1/1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: CUDA0 compute buffer size = 304.00 MiB
llama_context: CUDA_Host compute buffer size = 17.02 MiB
llama_context: graph nodes = 1070
llama_context: graph splits = 2
common_init_from_params: added <|endoftext|> logit bias = -inf
common_init_from_params: added <|im_end|> logit bias = -inf
common_init_from_params: added <|fim_pad|> logit bias = -inf
common_init_from_params: added <|repo_name|> logit bias = -inf
common_init_from_params: added <|file_sep|> logit bias = -inf
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name:
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 1008
clip_model_loader: n_kv: 32
clip_model_loader: has vision encoder
clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 1280
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 3584
--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
alloc_compute_meta: CUDA0 compute buffer size = 3.60 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 20
load_hparams: n_ff: 5120
load_hparams: n_layer: 32
load_hparams: ffn_op: gelu_erf
load_hparams: projection_dim: 3584
--- audio hparams ---
load_hparams: n_mel_bins: 128
load_hparams: proj_stack_factor: 0
load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
alloc_compute_meta: CUDA0 compute buffer size = 200.96 MiB
alloc_compute_meta: CPU compute buffer size = 10.99 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
encoding audio slice...
audio slice encoded in 609 ms
decoding audio batch 1/1, n_tokens_batch = 750
audio decoded (batch 1/1) in 634 ms
GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG^CG
llama_perf_context_print: load time = 4233.91 ms
llama_perf_context_print: prompt eval time = 1544.74 ms / 766 tokens ( 2.02 ms per token, 495.88 tokens per second)
llama_perf_context_print: eval time = 3109.48 ms / 50 runs ( 62.19 ms per token, 16.08 tokens per second)
llama_perf_context_print: total time = 5775.97 ms / 816 tokens
llama_perf_context_print: graphs reused = 0
with e562eece7cb476276bfc4cbb18deb7c0369b2233 (last good commit):
./build/bin/llama-mtmd-cli -hf ggml-org/Qwen2.5-Omni-7B-GGUF:Q8_0 -ngl 99 -p "What is being said here?" --audio /home/nvidia/rosc8493/samples/who.mp3
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: Orin, compute capability 8.7, VMM: yes
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
curl_perform_with_retry: HEAD https://huggingface.co/ggml-org/Qwen2.5-Omni-7B-GGUF/resolve/main/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf (attempt 1 of 1)...
common_download_file_single: using cached file: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_mmproj-Qwen2.5-Omni-7B-Q8_0.gguf
build: 5548 (e562eece) with cc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 for aarch64-linux-gnu
llama_model_load_from_file_impl: using device CUDA0 (Orin) - 49873 MiB free
llama_model_loader: loaded meta data with 27 key-value pairs and 339 tensors from /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen2vl
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.size_label str = 7.6B
llama_model_loader: - kv 3: general.license str = other
llama_model_loader: - kv 4: general.license.name str = apache-2.0
llama_model_loader: - kv 5: general.license.link str = https://huggingface.co/Qwen/Qwen2.5-O...
llama_model_loader: - kv 6: general.tags arr[str,2] = ["multimodal", "any-to-any"]
llama_model_loader: - kv 7: general.languages arr[str,1] = ["en"]
llama_model_loader: - kv 8: qwen2vl.block_count u32 = 28
llama_model_loader: - kv 9: qwen2vl.context_length u32 = 32768
llama_model_loader: - kv 10: qwen2vl.embedding_length u32 = 3584
llama_model_loader: - kv 11: qwen2vl.feed_forward_length u32 = 18944
llama_model_loader: - kv 12: qwen2vl.attention.head_count u32 = 28
llama_model_loader: - kv 13: qwen2vl.attention.head_count_kv u32 = 4
llama_model_loader: - kv 14: qwen2vl.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 15: qwen2vl.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 16: general.file_type u32 = 7
llama_model_loader: - kv 17: qwen2vl.rope.dimension_sections arr[i32,4] = [16, 24, 24, 0]
llama_model_loader: - kv 18: general.quantization_version u32 = 2
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,152064] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,152064] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 151643
llama_model_loader: - kv 26: tokenizer.chat_template str = {% set audio_count = namespace(value=...
llama_model_loader: - type f32: 141 tensors
llama_model_loader: - type q8_0: 198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q8_0
print_info: file size = 7.54 GiB (8.50 BPW)
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch = qwen2vl
print_info: vocab_only = 0
print_info: n_ctx_train = 32768
print_info: n_embd = 3584
print_info: n_layer = 28
print_info: n_head = 28
print_info: n_head_kv = 4
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 7
print_info: n_embd_k_gqa = 512
print_info: n_embd_v_gqa = 512
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-06
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 18944
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = -1
print_info: rope type = 8
print_info: rope scaling = linear
print_info: freq_base_train = 1000000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 32768
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 7B
print_info: model params = 7.62 B
print_info: general.name = n/a
print_info: vocab type = BPE
print_info: n_vocab = 152064
print_info: n_merges = 151387
print_info: BOS token = 11 ','
print_info: EOS token = 151645 '<|im_end|>'
print_info: EOT token = 151645 '<|im_end|>'
print_info: PAD token = 151643 '<|endoftext|>'
print_info: LF token = 198 'Ċ'
print_info: FIM PRE token = 151659 '<|fim_prefix|>'
print_info: FIM SUF token = 151661 '<|fim_suffix|>'
print_info: FIM MID token = 151660 '<|fim_middle|>'
print_info: FIM PAD token = 151662 '<|fim_pad|>'
print_info: FIM REP token = 151663 '<|repo_name|>'
print_info: FIM SEP token = 151664 '<|file_sep|>'
print_info: EOG token = 151643 '<|endoftext|>'
print_info: EOG token = 151645 '<|im_end|>'
print_info: EOG token = 151662 '<|fim_pad|>'
print_info: EOG token = 151663 '<|repo_name|>'
print_info: EOG token = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: CUDA0 model buffer size = 7165.44 MiB
load_tensors: CPU_Mapped model buffer size = 552.23 MiB
.......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 4096
llama_context: n_ctx_per_seq = 4096
llama_context: n_batch = 2048
llama_context: n_ubatch = 512
llama_context: causal_attn = 1
llama_context: flash_attn = 0
llama_context: freq_base = 1000000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (4096) < n_ctx_train (32768) -- the full capacity of the model will not be utilized
llama_context: CUDA_Host output buffer size = 0.58 MiB
llama_kv_cache_unified: CUDA0 KV buffer size = 224.00 MiB
llama_kv_cache_unified: size = 224.00 MiB ( 4096 cells, 28 layers, 1 seqs), K (f16): 112.00 MiB, V (f16): 112.00 MiB
llama_context: CUDA0 compute buffer size = 304.00 MiB
llama_context: CUDA_Host compute buffer size = 15.01 MiB
llama_context: graph nodes = 1098
llama_context: graph splits = 2
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
mtmd_cli_context: chat template example:
<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
clip_model_loader: model name:
clip_model_loader: description:
clip_model_loader: GGUF version: 3
clip_model_loader: alignment: 32
clip_model_loader: n_tensors: 1008
clip_model_loader: n_kv: 32
clip_model_loader: has vision encoder
clip_model_loader: has audio encoder
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 16
load_hparams: n_ff: 1280
load_hparams: n_layer: 32
load_hparams: ffn_op: silu
load_hparams: projection_dim: 3584
--- vision hparams ---
load_hparams: image_size: 1024
load_hparams: patch_size: 14
load_hparams: has_llava_proj: 0
load_hparams: minicpmv_version: 0
load_hparams: proj_scale_factor: 0
load_hparams: n_wa_pattern: 8
load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
alloc_compute_meta: CUDA0 compute buffer size = 2.77 MiB
alloc_compute_meta: CPU compute buffer size = 0.16 MiB
clip_ctx: CLIP using CUDA0 backend
load_hparams: projector: qwen2.5o
load_hparams: n_embd: 1280
load_hparams: n_head: 20
load_hparams: n_ff: 5120
load_hparams: n_layer: 32
load_hparams: ffn_op: gelu_erf
load_hparams: projection_dim: 3584
--- audio hparams ---
load_hparams: n_mel_bins: 128
load_hparams: proj_stack_factor: 0
load_hparams: model size: 1476.70 MiB
load_hparams: metadata size: 0.35 MiB
alloc_compute_meta: CUDA0 compute buffer size = 200.96 MiB
alloc_compute_meta: CPU compute buffer size = 10.99 MiB
init_audio: audio input is in experimental stage and may have reduced quality:
https://github.com/ggml-org/llama.cpp/discussions/13759
main: loading model: /home/nvidia/.cache/llama.cpp/ggml-org_Qwen2.5-Omni-7B-GGUF_Qwen2.5-Omni-7B-Q8_0.gguf
encoding audio slice...
audio slice encoded in 618 ms
decoding audio batch 1/1, n_tokens_batch = 750
audio decoded (batch 1/1) in 634 ms
The text in the image says "Who are you?"
llama_perf_context_print: load time = 3096.77 ms
llama_perf_context_print: prompt eval time = 1545.53 ms / 766 tokens ( 2.02 ms per token, 495.62 tokens per second)
llama_perf_context_print: eval time = 717.54 ms / 11 runs ( 65.23 ms per token, 15.33 tokens per second)
llama_perf_context_print: total time = 3285.56 ms / 777 tokens