Hard crash in `ggml_backend_dev_count` #75

AsbjornOlling · 2025-01-08T09:24:37Z

...this was also an issue before using the ggml stuff to enumerate GPUs, back when we were using wgpu for the same task.

Our code panics because a C++ exception is thrown:

       > terminate called after throwing an instance of 'vk::IncompatibleDriverError'
       >   what():  vk::createInstance: ErrorIncompatibleDriver

This is reproducible in github ci, on this PR: #71

It also happens on my machine, when running the integration tests in the nix sandbox. (The same integration test runs fine outside of the nix sandbox, using the exact same build files).

Since we can't catch C++ exceptions in rust, we need to come up with some sort of guard clause. Either that, or catch it on the other side of the FFI barrier.

The text was updated successfully, but these errors were encountered:

AsbjornOlling · 2025-01-08T09:28:50Z

I figured that the sane behavior here would be to fall back to CPU, but now I'm not as sure that's even possible. We built llama.cpp for vulkan, and it doesn't seem to be able to start any kind of vulkan driver.

EDIT: yeah... if I hard-code a false return in has_discrete_gpu, it just fails in the exact same way once we get to LlamaModel::load_from_file (throwing vk::createInstance: ErrorIncompatibleDriver)

AsbjornOlling · 2025-01-08T10:25:42Z

It's a bit weird that the cargo test unit tests work fine in the same (or, well... extremely similar) environment.
It should depend on a vulkan renderer just as much. Why would this change if we're running inside a godot project? Weird.

Also weird:
I tried checking for vulkan availability using the "ash" crate (which provides vulkan bindings for rust), and it seems to find a valid vulkan driver. It even shows the same api version when I run it with amdgpu, as it does when I run it in the nix sandbox.

AsbjornOlling · 2025-01-08T14:26:35Z

Okay so the environment where the unit tests pass and the environment where the integration test fails are not very alike after all.

Another way of reproducing the issue without running our code, is to run vulkaninfo --summary (with vulkaninfo from the vulkan-tools package). This fails with ERROR_INCOMPATIBLE_DRIVER in the integration test environment and succeeds in the unit test environment.

volesen · 2025-01-09T14:23:59Z

Can we check for the availability of vulkan drivers maybe?

volesen · 2025-01-09T15:03:23Z

There is functionality to query for Vulkan-compatible devices, without crashing through the VK_KHR_portability_enumeration extension. That was actually supported by llama.cpp in but removed in ggerganov/llama.cpp#5757, to support older Vulkan drivers.

I think we should document the error and try to apply an upstream fix.

AsbjornOlling added the bug Something isn't working label Jan 8, 2025

volesen closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hard crash in `ggml_backend_dev_count` #75

Hard crash in `ggml_backend_dev_count` #75

AsbjornOlling commented Jan 8, 2025

AsbjornOlling commented Jan 8, 2025 •

edited

Loading

AsbjornOlling commented Jan 8, 2025

AsbjornOlling commented Jan 8, 2025

volesen commented Jan 9, 2025

volesen commented Jan 9, 2025 •

edited

Loading

Hard crash in ggml_backend_dev_count #75

Hard crash in ggml_backend_dev_count #75

Comments

AsbjornOlling commented Jan 8, 2025

AsbjornOlling commented Jan 8, 2025 • edited Loading

AsbjornOlling commented Jan 8, 2025

AsbjornOlling commented Jan 8, 2025

volesen commented Jan 9, 2025

volesen commented Jan 9, 2025 • edited Loading

Hard crash in `ggml_backend_dev_count` #75

Hard crash in `ggml_backend_dev_count` #75

AsbjornOlling commented Jan 8, 2025 •

edited

Loading

volesen commented Jan 9, 2025 •

edited

Loading