Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hard crash in ggml_backend_dev_count #75

Closed
AsbjornOlling opened this issue Jan 8, 2025 · 5 comments
Closed

Hard crash in ggml_backend_dev_count #75

AsbjornOlling opened this issue Jan 8, 2025 · 5 comments
Labels
bug Something isn't working

Comments

@AsbjornOlling
Copy link
Contributor

...this was also an issue before using the ggml stuff to enumerate GPUs, back when we were using wgpu for the same task.

Our code panics because a C++ exception is thrown:

       > terminate called after throwing an instance of 'vk::IncompatibleDriverError'
       >   what():  vk::createInstance: ErrorIncompatibleDriver

This is reproducible in github ci, on this PR: #71

It also happens on my machine, when running the integration tests in the nix sandbox. (The same integration test runs fine outside of the nix sandbox, using the exact same build files).

Since we can't catch C++ exceptions in rust, we need to come up with some sort of guard clause. Either that, or catch it on the other side of the FFI barrier.

@AsbjornOlling
Copy link
Contributor Author

AsbjornOlling commented Jan 8, 2025

I figured that the sane behavior here would be to fall back to CPU, but now I'm not as sure that's even possible. We built llama.cpp for vulkan, and it doesn't seem to be able to start any kind of vulkan driver.

EDIT: yeah... if I hard-code a false return in has_discrete_gpu, it just fails in the exact same way once we get to LlamaModel::load_from_file (throwing vk::createInstance: ErrorIncompatibleDriver)

@AsbjornOlling AsbjornOlling added the bug Something isn't working label Jan 8, 2025
@AsbjornOlling
Copy link
Contributor Author

It's a bit weird that the cargo test unit tests work fine in the same (or, well... extremely similar) environment.
It should depend on a vulkan renderer just as much. Why would this change if we're running inside a godot project? Weird.

Also weird:
I tried checking for vulkan availability using the "ash" crate (which provides vulkan bindings for rust), and it seems to find a valid vulkan driver. It even shows the same api version when I run it with amdgpu, as it does when I run it in the nix sandbox.

@AsbjornOlling
Copy link
Contributor Author

Okay so the environment where the unit tests pass and the environment where the integration test fails are not very alike after all.

Another way of reproducing the issue without running our code, is to run vulkaninfo --summary (with vulkaninfo from the vulkan-tools package). This fails with ERROR_INCOMPATIBLE_DRIVER in the integration test environment and succeeds in the unit test environment.

@volesen
Copy link
Contributor

volesen commented Jan 9, 2025

Can we check for the availability of vulkan drivers maybe?

@volesen
Copy link
Contributor

volesen commented Jan 9, 2025

There is functionality to query for Vulkan-compatible devices, without crashing through the VK_KHR_portability_enumeration extension. That was actually supported by llama.cpp in but removed in ggerganov/llama.cpp#5757, to support older Vulkan drivers.

I think we should document the error and try to apply an upstream fix.

@volesen volesen closed this as completed Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants