-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ability to remove loras and save them in the context cache #3
Conversation
Just realised that copying pointers won't work here cause the memory that backs those structs is being explicitly freed using ggml_free(ctx) at the end of the method. Only way (although requires a double computation) is to deep copy the tensor data into some vector |
Also tried passing the model context (which survives during the whole run) instead of lora context (which gets cleared at the end of apply lora method) so that memory is not freed. But this just leads to OOM. I haven't looked at the implementation but its trying to allocated whole 13GB memory again when I call |
Another tradeoff I need to think about is whether to simply store Our aim here is to improve load/reload speed so for now I will got with caching |
Not knowledgeable enough to consider CPU to GPU memory copy for cache so would look into it later after a couple of GPT-4 tutor sessions. |
The overhead of copying floats to vector is around 28ms All in all it adds around 4 seconds extra overhead and increases lora loading time for the first time from 2 to 6 seconds. Figuring ways to make it faster. llama_apply_lora_from_file_internal: r = 16, alpha = 16, scaling = 1.00
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wq.weight' in 27.32 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wk.weight' in 28.59 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wv.weight' in 28.06 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wo.weight' in 29.14 ms
...
...
done (6343.69 ms) |
Ok, actually I am a bit dumb the overhead of storing full lora matrices would be around 8GB while just storing individual A & B matrices is around 64MB should be a no brainer for both speed and memory Edit: Yep, now the copying is really fast - Just 0.12 ms per layer
Only 15 ms total overhead compared to 4 seconds previous. LFG! |
Finally got it to work with cache after commit hash You can see sample weights get recovered perfectly. Compared MD5 hash as well to make sure all weights were indeed the same after each operations Next: Performance improvement - Cause it's taking exactly the same time as before to swap which means all this is for nothing
|
Wait Lmao, can't even offload to GPU with LoRAs enabled Processing layers in parallel also won't give any perf improvement since the Might have to come up with a different way to cook this lora dish |
TODO Do swap in a single graph computation instead of multiple. Now that we have caching, I can just take union of layers from both lora to be removed and lora to be added and keep on doing add/sub based on some metadata in a single graph. This should cut down time from 4 to 2 sec |
So I tried swapping them in a single graph computation but it is still takes same amount of time (i.e. 4 seconds). All 8 cores are getting utilised. The problem seems to be the graph is just an array of Nodes (that represent some output) created using DFS traversal. Now this array is just iterated one element at a time and the results are used for next element in the array. The parallelism is mostly used to quickly perform the calculations for a single matrix (not 100% sure, will need to go deep into worker spawan logic) DFS traversal - https://github.com/ggerganov/llama.cpp/blob/8596af427722775f0df4a7c90b9af067ba90d4ef/ggml.c#L15395 Graph compute iteration - |
8ba8703
to
33df016
Compare
Late update, this got insanely fast after the following cuda support PR - ggerganov#1970 Now you can swap in 200-300ms |
incredible |
Description
Adds ability to remove loras from the model
To reduce file I/O, we also cache lora matrix
s.BA
in thellama_context
now. These weights can be reused if we want to apply/remove lora againA smol utility to print model tensors since quantisation and f16 to f32 conversions make casting a bit non-trivial
Add lora cache so they you don't have to read model files everytime.