-
Notifications
You must be signed in to change notification settings - Fork 385
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: Switch backends dynamically at runtime? #264
Comments
It sounds like there's two parts to this request:
|
It looks like you're already trying to load them one at a time in the static API... what would be the harm in loading all of the ones that are found? and just putting the IntPtrs into an array. We could set the one we want to use via a static enum property -- and the IntPtr could just be a read only property that returns the appropriate IntPtr based on which enum is selected. (Or throws an exception if a missing one was selected.) We could have another static method that is just like That would allow hotswap of the dlls at runtime. Edit: I suppose you wouldn't want the backend to switch out from under any already created LLMs, or whatever... but even then... you could still load them all into an array and use an enum (just don't do the static property stuff). Just pass the enum in as a model param, and it grabs the backend pointer at creation time. That would allow you to use multiple backends at the same time. |
Loading all of the backends at once wouldn't work with the current system because the native methods are written like this: [DllImport("libllama")]
public static extern void demo_method(); That will use the already loaded If we just wanted to allow unloading one backend DLL and loading another I think that could probably be done by keeping the If we wanted to allow multiple backends that's a lot more complex. I'm not even 100% sure if it's possible, is it guaranteed that two versions of libllama.dll don't try to use some per-process state internally? Assuming it is we'd probably have to define a non-static "LLamaBackend" object and then fetch all of the native methods using If you're interested in experimenting with any of this stuff I'm happy to help with that. Some prototypes in a fork showing some proof-of-concept unloading/reloading/multi-loading would be great! P.S. one extra complexity I just thought of: |
What if we wrote a bunch of delegates as a sort of wrapped API, and use However the performance hit could be heavy, also cleanup could become troublesome
|
As I understand it that's roughly equivalent to what I was suggesting with A few problems that I'm not sure about with loading multiple backends:
|
Just support for swapping would be handy, I am one of the few that run LLamaSharp on servers with GPUs but don't want to use GPU's as they are busy running other apps, so Just hoping we include a way to override the incoming auto-dectect code In the past |
To me either supporting multiple backends or swapping backends seems too aggressive now. The largest problem is that there're memories held in native library which we know few about. Things may be out of our control when swapping the backend. I suggest to allow user to select a preferred backend only once before loading llama model now (the "ability to configure the backend before it is loaded" discussed above). In the condition @saddam213 mentioned, there's a point which needs to be further discussed. If there's a model When we call Though I'm against dynamically switching backend after loading now, I think this issue is a commonly encountered one when building a service base on LLamaSharp and worth working on. I was just saying we shouldn't make the changes until we eliminate the risks. |
I'm working on an Electron/LLamaSharp application with .NET6 and was wondering if it would be possible to give the user the option to select the backend in the APP settings via a dropdown. The dll would then be downloaded from Github (https://github.com/ggerganov/llama.cpp/releases/latest) and unpacked. The application would then need to be restarted. When starting the application, LLamaSharp just needs to know in which directory the unpacked dll is located. I noticed this code, where an alternative path to the backend dll could possibly be loaded. I havn't tested the NativeLibraryConfig yet, but this looks very promising for dynamic loading of the backend. Edit: Edit2: Now the correct path is loaded, but the exception is still called. Probably I'm loading the wrong dll or I'm missing something. I will check it tomorrow ^^'. |
Please use
I think so, it was designed to do this :)
If you would like to, you can remove the check here to load the library anywhere at runtime to help to check if there's memory leak or other bad behaviours. |
Looks like you've already worked it out, but Please note though that you cannot just download the latest DLL from llama.cpp - you must download exactly right commit version. There is absolutely no compatibility from version to version! |
Please ensure you loaded a library with name |
Ok, I couldn't fall asleep without testing it again :D |
That's exactly why I was suggesting having all of the DLLs present loaded at once. -- So that a model which is already loaded, would not be affected; it would just continune working against whichever DLL was passed in to it when it was instantiated. Yes that will increase memory usage... but if each model is tied to the single version of the backend it was created on, then each model was going to consume that memory anyway.
Doesn't that make it like... super easy to do what I'm proposing then?? Before you scoff, and think that "wow, this will be a lot more typing and a lot more code to maintain" -- please remember that T4 templates exist -- and literally all of this can be automated to run at build time... to the point where you just pass in an Enum to the model at creation time, and everything else is done for you. All you would need to do is define one instance of the class (with the regular extern imports in C# which you are already doing), and which DLLs go to which enums... the enums themselves could even be generated by the T4 template -- so in that case you would just need a defintion file -- something like:
And your T4 template could just read that json definition file as an input, along with a specified C# to modify and it could generate the rest at compile time. SetupLet's consider the following setup as a very basic version of what we're talking about: Let's say I've got And Now, in C# I could do it quick and dirty like so:
And that works... but you can't switch between the two implementations dynamically. Simple InterfacesBelow is the dirty version of what I'm suggesting (this would basically be a simplified version of the output of your T4 template. -- I can assist in writing the template itself if you are interested in going this route):
So from here, the user just needs to specify the backend when they create their objects (again, preferably via an enum that is managed by the T4 template itself). Everything else can talk to the unmanaged code via their own version of the interface that was passed in to them. Motivation & UsageThis design fufills every letter of the SOLID acronym for good software design IMO, so that's part of why I'm suggesting it. It will be really obvious to the user which model is running on which backend -- and the code doesn't break when you switch -- all you do is create a new instance with the desired model, transfer the state over, and dispose the old one. -- Transfering the state would be the users' responsibility in this case. (Assuming those bugs were to get fixed.) |
@BrainSlugs83 Thank you a lot for these suggestions. We're always open for any proposal of features. :) Before we start a deep discussion about it, could you please further describe why you want to switch dynamically between cpu backend and cuda backend? If the reason is to switch between a gpu offloaded model and a cpu offloaded one, in #298 we found llama.cpp has supported using pure cpu with cuda-built library. |
I do think this is a viable design, the backend could be specified when you load the model and then from then on it can be hanled automatically within LLamaSharp (LLamaWeights would hold a reference to the backend instance, when you create a context that would in turn hold on to a reference to the backend etc). That said though, I do think it comes with a very large "complexity cost" for a relatively small benefit to be honest. I'm not against adding it, but I'm also not rushing to do all the necessary work 😆 Just a note about this: [DllImport(@"relative\runtime\path\to\1337\library1.dll")]
public static extern int SomeFunction(); As far as I know it must just be the name of the DLL, not an entire path. That's not a fatal flaw though, you could make it work by mashing the whole path down into the name like: class Foo {
[DllImport(@"1337_library1.dll")]
public static extern int SomeFunction();
}
class Bar {
[DllImport(@"42000_library1.dll")]
public static extern int SomeFunction();
} and renaming your libraries as appropriate. |
I tested the above with both absolute and relative paths, and they both work. Same directory is not needed -- however, if using relative paths, it needs to be a relative path from the current directory.
That's actually great information that I was not aware of before -- for the time being it sounds like I can now just stick with the CUDA backend. (But there's no guarantee that will work for folks who don't have CUDA installed -- for the release version of my app...) But at least, previously, I'd run into bugs and instabilities with either CUDA or non-CUDA depending on the version of LlamaSharp, and in the interest of trying to narrow down the bug, I found it very painful to have to constantly change NuGet packages, and I was thinking of other benefits as well such as runtime switching for an end user application. (Because it seems performance varies wildly between the two as well from version to version. 🫤) I was also thinking of other targets as well like Metal, or a non-avx CPU lib -- or perhaps down the road if other third party backend targets exist (such as an OpenCL, Windows ONNX Acceleration, or even a Vulkan backend), it would be great for compatibility to be able to switch between them on the fly. |
Yes, we are cautious about this feature so we still separate cpu and CUDA backend packages. However I believe this llama.cpp feature means to support running on non-cuda devices. Besides, in the condition you described, I think the cuda feature auto-detection included in v0.8.0 has supported it. We are thinking about gathering all the native libraries in one backend package (but haven't made the final decision) and automatically choose a library. It will choose cuda backend only when cuda is available on the device. |
Couldn't we just automatically and only load those that the CPU/GPUs support based on interrogating the OS? Then calculate the memory required and run it GPU if available and run it CPU if not? |
That's actually what we already do. CUDA binaries are laoded based on the version of CUDA installed and failing that the best CPU binary is loaded (based on which AVX version your CPU supports).
There was some discussion #42 about automatic layers count calculation. It seems like that's complicated even for llama.cpp (which has more information to work it out than LLamaSharp does), so that probably won't happen any time soon/ever unfortunately. |
Ok, well LM Studio is doing something, because if I add the CPU and Cuda to a project, it always uses the CPU and ignores CUDA being there, and if I have CUDA and I run out of memory for the request, you just get an out of memory message and your app dies. In LM Studio, it works using CUDA first if the context fits into GPU VRAM and if it doesn't, then it uses CPU. I can see it doing this with task manager. It would be nice to have the same results so that our apps aren't so brittle. |
Yeah, the other issue is there are multiple versions of CUDA, right? -- So even if I just packaged an app with CUDA, and were to set GPU layers to 0, I would still have to have two separate release builds of my application for the GPU users depending on if the user had CUDA11 or CUDA12. -- So again, it would be good to be able to switch at runtime... without needed separate NuGet packages for each backend. |
We have already supported detection of cuda version to load a suitable library. The only thing left is to integrate all the backend packages to one package. Actually I intended to make this break change of backend library in v1.0.0. I could add support for specifying base directory in v0.8.1, so that you can take use of this feature by keeping a certain file structure of your native libraries. V0.8.1 will be out within 1 or 2 days. |
Would this allow graceful fallback? Would it allow for an OpenCL version of the library as well? (AMD) Ideally we just want to have it work as fast as it can, and not have to expose knowledge of the running machine to be able to use this library. So it needs to just work like LM Studio does. |
I don't think we do anything specific for OpenCL at the moment. but this:
Is definitely the long term intention of the loading system. As much of it should be automated as possible. |
@BrainSlugs83 v0.8.1 has been out, which allows to specify the search directories of native libraries. |
Oh, dang I got confused by this: LLamaSharp/LLama/Native/NativeApi.Load.cs Line 199 in 884f5ad
Looking closer, I see that you are correct; this is like a fallback for when CUDA isn't detected. So that makes sense.
I'm fine with waiting for a proper merged library if it's coming in 1.0.0. 🙂
The llama.cpp folks already have an OpenCL version (and maybe a ROCm one IIUC), and I think they are planning a Vulkan backend as well (saw it under their discussions...) -- but the Vulkan build is a bit stalled.
Nice, I'll take a look! 🙂 |
OpenCL support will be merged in with #479, and will probably be included in the next release (still needs some work doing to create the new nuget packages). |
#670 suggests a way to use LLamaSharp without backend packages, which might be related with the point of this issue. For your attention here. :) |
Right now, to switch backends, I have to uninstall one nuget package, and then reinstall another. -- This means that if I make an app, I would have to create different builds for every configuration.
Instead I'd rather just give someone a dropdown box in a UI and let them pick the backend, and have the app switch on the fly to whatever they have selected.
Ideally this would just be an enum in the ModelParams, something like
Backend = LLamaSharp.Cuda
, orBackend = LLamaSharp.Cpu
, etc. (Heck, even if it has to be a global static parameter, that's better than having to make separate builds.)One possible way it could be implemented is by shipping all of the libllama.dll files in the output directory with a slightly different name (i.e. libllama-cuda.dll, libllama-avx.dll, etc.) and dynamically loading the DLL into memory based on which backend is being requested.
The text was updated successfully, but these errors were encountered: