Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ability to remove loras and save them in the context cache #3

Closed
wants to merge 5 commits into from
Closed

Conversation

ghost
Copy link

@ghost ghost commented Jun 17, 2023

Description

  • Adds ability to remove loras from the model

  • To reduce file I/O, we also cache lora matrix s.BA in the llama_context now. These weights can be reused if we want to apply/remove lora again

  • A smol utility to print model tensors since quantisation and f16 to f32 conversions make casting a bit non-trivial

  • Add lora cache so they you don't have to read model files everytime.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Just realised that copying pointers won't work here cause the memory that backs those structs is being explicitly freed using ggml_free(ctx) at the end of the method.

Only way (although requires a double computation) is to deep copy the tensor data into some vector

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Also tried passing the model context (which survives during the whole run) instead of lora context (which gets cleared at the end of apply lora method) so that memory is not freed.

But this just leads to OOM. I haven't looked at the implementation but its trying to allocated whole 13GB memory again when I call ggml_new_tensor_2d with 64MB bytes.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Another tradeoff I need to think about is whether to simply store s . BA matrix which allows faster calculation OR store B and A seperately (which allows less memory overhead cause BA is 4096 * 4096 floats while B and A are 16 * 4096 floats each)

Our aim here is to improve load/reload speed so for now I will got with caching s.BA.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Not knowledgeable enough to consider CPU to GPU memory copy for cache so would look into it later after a couple of GPT-4 tutor sessions.

@ghost
Copy link
Author

ghost commented Jun 17, 2023

The overhead of copying floats to vector is around 28ms

All in all it adds around 4 seconds extra overhead and increases lora loading time for the first time from 2 to 6 seconds.

Figuring ways to make it faster.

llama_apply_lora_from_file_internal: r = 16, alpha = 16, scaling = 1.00
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wq.weight' in 27.32 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wk.weight' in 28.59 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wv.weight' in 28.06 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wo.weight' in 29.14 ms
...
...

done (6343.69 ms)

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Ok, actually I am a bit dumb

the overhead of storing full lora matrices would be around 8GB while just storing individual A & B matrices is around 64MB

should be a no brainer for both speed and memory

Edit:

Yep, now the copying is really fast - Just 0.12 ms per layer

llama_apply_lora_from_file_internal: r = 16, alpha = 16, scaling = 1.00
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wq.weight' in 0.11 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wk.weight' in 0.12 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wv.weight' in 0.12 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wo.weight' in 0.11 ms

Only 15 ms total overhead compared to 4 seconds previous. LFG!

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Finally got it to work with cache after commit hash 3129d11

You can see sample weights get recovered perfectly. Compared MD5 hash as well to make sure all weights were indeed the same after each operations

Next: Performance improvement - Cause it's taking exactly the same time as before to swap which means all this is for nothing

Acquiring lock
Removing lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: deactivating lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Deactivating lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.031586
1 : 0.025620
2 : -0.002708
3 : 0.012741
4 : 0.041321
5 : 0.051758
6 : 0.011909
7 : -0.012199
8 : -0.025803
9 : 0.003624

................... done (2696.39 ms)
Applying lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: applying lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Applying lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.030716
1 : 0.025238
2 : -0.001369
3 : 0.013504
4 : 0.039795
5 : 0.053345
6 : 0.014351
7 : -0.010208
8 : -0.023529
9 : 0.002390

................... done (2349.43 ms)
Acquiring lock
Removing lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: deactivating lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Deactivating lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.031586
1 : 0.025620
2 : -0.002708
3 : 0.012741
4 : 0.041321
5 : 0.051758
6 : 0.011909
7 : -0.012199
8 : -0.025803
9 : 0.003624

................... done (2744.64 ms)
Applying lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: applying lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Applying lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.030716
1 : 0.025238
2 : -0.001369
3 : 0.013504
4 : 0.039795
5 : 0.053345
6 : 0.014351
7 : -0.010208
8 : -0.023529
9 : 0.002390
``

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Lmao, All times seems to be spent only in calculating final weight matrix.

So this caching only alleviates like 5% pain

time taken till matmul layers.14.attention.wv.weight =     0.29 ms
time taken till scaling layers.14.attention.wv.weight =     0.30 ms
time taken till neg layers.14.attention.wv.weight =     0.32 ms
time taken till add inplace layers.14.attention.wv.weight =     0.32 ms
time taken till graph build forward layers.14.attention.wv.weight =     0.37 ms
time taken till graph compute layers.14.attention.wv.weight =    28.67 ms

Graph computation I guess can only be made fast by offloading layers to VRAM I guess
1687010788279846

@ghost
Copy link
Author

ghost commented Jun 17, 2023

Wait

Lmao, can't even offload to GPU with LoRAs enabled

ggerganov#1861

Processing layers in parallel also won't give any perf improvement since the graph_compute function is already multithreaded and hogging on all the cores.

Might have to come up with a different way to cook this lora dish

1687014131643269

@ghost
Copy link
Author

ghost commented Jun 17, 2023

TODO

Do swap in a single graph computation instead of multiple. Now that we have caching, I can just take union of layers from both lora to be removed and lora to be added and keep on doing add/sub based on some metadata in a single graph.

This should cut down time from 4 to 2 sec

@ghost
Copy link
Author

ghost commented Jun 18, 2023

So

I tried swapping them in a single graph computation but it is still takes same amount of time (i.e. 4 seconds).

All 8 cores are getting utilised.

The problem seems to be the graph is just an array of Nodes (that represent some output) created using DFS traversal.

Now this array is just iterated one element at a time and the results are used for next element in the array.

The parallelism is mostly used to quickly perform the calculations for a single matrix (not 100% sure, will need to go deep into worker spawan logic)

DFS traversal - https://github.com/ggerganov/llama.cpp/blob/8596af427722775f0df4a7c90b9af067ba90d4ef/ggml.c#L15395

Graph compute iteration -
https://github.com/ggerganov/llama.cpp/blob/8596af427722775f0df4a7c90b9af067ba90d4ef/ggml.c#L16013

@ghost
Copy link
Author

ghost commented Jun 19, 2023

300ms more improvement in swap times.

Realised the ggml_neg uses only 1 thread in graph_compute while ggml_scale uses all the available threads.

Both of them seem to be doing almost same thing underneath so it's better to use scale with -ve factor then to use neg.

Screenshot from 2023-06-19 23-20-25
Screenshot from 2023-06-19 23-20-12

@yacineMTB yacineMTB force-pushed the yacine/node.addon branch from 8ba8703 to 33df016 Compare June 20, 2023 00:17
@ghost ghost marked this pull request as ready for review June 20, 2023 14:56
@ghost ghost changed the title WIP: Add ability to remove loras and save them in the context cache Add ability to remove loras and save them in the context cache Jun 20, 2023
@ghost
Copy link
Author

ghost commented Jul 17, 2023

Late update, this got insanely fast after the following cuda support PR - ggerganov#1970

Now you can swap in 200-300ms

@yacineMTB
Copy link
Owner

incredible

@ghost ghost closed this by deleting the head repository Sep 1, 2023
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants