Add ability to remove loras and save them in the context cache #3

ghost · 2023-06-17T09:46:01Z

Description

Adds ability to remove loras from the model
To reduce file I/O, we also cache lora matrix s.BA in the llama_context now. These weights can be reused if we want to apply/remove lora again
A smol utility to print model tensors since quantisation and f16 to f32 conversions make casting a bit non-trivial
Add lora cache so they you don't have to read model files everytime.

ghost · 2023-06-17T11:21:06Z

Just realised that copying pointers won't work here cause the memory that backs those structs is being explicitly freed using ggml_free(ctx) at the end of the method.

Only way (although requires a double computation) is to deep copy the tensor data into some vector

ghost · 2023-06-17T11:22:47Z

Also tried passing the model context (which survives during the whole run) instead of lora context (which gets cleared at the end of apply lora method) so that memory is not freed.

But this just leads to OOM. I haven't looked at the implementation but its trying to allocated whole 13GB memory again when I call ggml_new_tensor_2d with 64MB bytes.

ghost · 2023-06-17T11:25:13Z

Another tradeoff I need to think about is whether to simply store s . BA matrix which allows faster calculation OR store B and A seperately (which allows less memory overhead cause BA is 4096 * 4096 floats while B and A are 16 * 4096 floats each)

Our aim here is to improve load/reload speed so for now I will got with caching s.BA.

ghost · 2023-06-17T11:26:38Z

Not knowledgeable enough to consider CPU to GPU memory copy for cache so would look into it later after a couple of GPT-4 tutor sessions.

ghost · 2023-06-17T12:08:55Z

The overhead of copying floats to vector is around 28ms

All in all it adds around 4 seconds extra overhead and increases lora loading time for the first time from 2 to 6 seconds.

Figuring ways to make it faster.

llama_apply_lora_from_file_internal: r = 16, alpha = 16, scaling = 1.00
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wq.weight' in 27.32 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wk.weight' in 28.59 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wv.weight' in 28.06 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wo.weight' in 29.14 ms
...
...

done (6343.69 ms)

ghost · 2023-06-17T12:13:27Z

Ok, actually I am a bit dumb

the overhead of storing full lora matrices would be around 8GB while just storing individual A & B matrices is around 64MB

should be a no brainer for both speed and memory

Edit:

Yep, now the copying is really fast - Just 0.12 ms per layer

llama_apply_lora_from_file_internal: r = 16, alpha = 16, scaling = 1.00
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wq.weight' in 0.11 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wk.weight' in 0.12 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wv.weight' in 0.12 ms
llama_apply_lora_from_file_internal: copied lora tensor 'layers.0.attention.wo.weight' in 0.11 ms

Only 15 ms total overhead compared to 4 seconds previous. LFG!

ghost · 2023-06-17T15:22:38Z

Finally got it to work with cache after commit hash 3129d11

You can see sample weights get recovered perfectly. Compared MD5 hash as well to make sure all weights were indeed the same after each operations

Next: Performance improvement - Cause it's taking exactly the same time as before to swap which means all this is for nothing

Acquiring lock
Removing lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: deactivating lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Deactivating lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.031586
1 : 0.025620
2 : -0.002708
3 : 0.012741
4 : 0.041321
5 : 0.051758
6 : 0.011909
7 : -0.012199
8 : -0.025803
9 : 0.003624

................... done (2696.39 ms)
Applying lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: applying lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Applying lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.030716
1 : 0.025238
2 : -0.001369
3 : 0.013504
4 : 0.039795
5 : 0.053345
6 : 0.014351
7 : -0.010208
8 : -0.023529
9 : 0.002390

................... done (2349.43 ms)
Acquiring lock
Removing lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: deactivating lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Deactivating lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.031586
1 : 0.025620
2 : -0.002708
3 : 0.012741
4 : 0.041321
5 : 0.051758
6 : 0.011909
7 : -0.012199
8 : -0.025803
9 : 0.003624

................... done (2744.64 ms)
Applying lora from Path: models/lora/ggml-adapter-model.bin
llama_apply_lora_from_cache_internal: applying lora adapter from cache - please wait ...
.............Layer Name: layers.0.attention.wk.weight, Metadata: Applying lora from cache
Tensor Type 1
Data Length Num Elements 16777216
Data Length Bytes 33554432
0 : -0.030716
1 : 0.025238
2 : -0.001369
3 : 0.013504
4 : 0.039795
5 : 0.053345
6 : 0.014351
7 : -0.010208
8 : -0.023529
9 : 0.002390
``

ghost · 2023-06-17T15:46:41Z

Lmao, All times seems to be spent only in calculating final weight matrix.

So this caching only alleviates like 5% pain

time taken till matmul layers.14.attention.wv.weight =     0.29 ms
time taken till scaling layers.14.attention.wv.weight =     0.30 ms
time taken till neg layers.14.attention.wv.weight =     0.32 ms
time taken till add inplace layers.14.attention.wv.weight =     0.32 ms
time taken till graph build forward layers.14.attention.wv.weight =     0.37 ms
time taken till graph compute layers.14.attention.wv.weight =    28.67 ms

Graph computation I guess can only be made fast by offloading layers to VRAM I guess

ghost · 2023-06-17T15:55:36Z

Wait

Lmao, can't even offload to GPU with LoRAs enabled

ggerganov#1861

Processing layers in parallel also won't give any perf improvement since the graph_compute function is already multithreaded and hogging on all the cores.

Might have to come up with a different way to cook this lora dish

ghost · 2023-06-17T17:51:49Z

TODO

Do swap in a single graph computation instead of multiple. Now that we have caching, I can just take union of layers from both lora to be removed and lora to be added and keep on doing add/sub based on some metadata in a single graph.

This should cut down time from 4 to 2 sec

ghost · 2023-06-18T12:01:18Z

So

I tried swapping them in a single graph computation but it is still takes same amount of time (i.e. 4 seconds).

All 8 cores are getting utilised.

The problem seems to be the graph is just an array of Nodes (that represent some output) created using DFS traversal.

Now this array is just iterated one element at a time and the results are used for next element in the array.

The parallelism is mostly used to quickly perform the calculations for a single matrix (not 100% sure, will need to go deep into worker spawan logic)

DFS traversal - https://github.com/ggerganov/llama.cpp/blob/8596af427722775f0df4a7c90b9af067ba90d4ef/ggml.c#L15395

Graph compute iteration -
https://github.com/ggerganov/llama.cpp/blob/8596af427722775f0df4a7c90b9af067ba90d4ef/ggml.c#L16013

ghost · 2023-06-19T17:52:52Z

300ms more improvement in swap times.

Realised the ggml_neg uses only 1 thread in graph_compute while ggml_scale uses all the available threads.

Both of them seem to be doing almost same thing underneath so it's better to use scale with -ve factor then to use neg.

…use cache

…ations

ghost · 2023-07-17T05:55:47Z

Late update, this got insanely fast after the following cuda support PR - ggerganov#1970

Now you can swap in 200-300ms

yacineMTB · 2023-07-21T20:39:49Z

incredible

yacineMTB force-pushed the yacine/node.addon branch from 8ba8703 to 33df016 Compare June 20, 2023 00:17

CTOJunior added 5 commits June 20, 2023 19:26

WIP: Add ability to remove loras and save them in the context cache

591b1c1

Cache individual adapters instead of matmul result

9f7ec8f

Cache metadata of individual adapaters and change the load method to …

edf7d28

…use cache

Add ability to swap lora in a single method by combining graph calcul…

1309f25

…ations

Improve lora swap times by 300ms by removing neg ops, plus code cleanup

9ab1248

ghost marked this pull request as ready for review June 20, 2023 14:56

ghost changed the title ~~WIP: Add ability to remove loras and save them in the context cache~~ Add ability to remove loras and save them in the context cache Jun 20, 2023

ghost closed this by deleting the head repository Sep 1, 2023

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ability to remove loras and save them in the context cache #3

Add ability to remove loras and save them in the context cache #3

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 18, 2023

ghost commented Jun 19, 2023

ghost commented Jul 17, 2023

yacineMTB commented Jul 21, 2023

Add ability to remove loras and save them in the context cache #3

Add ability to remove loras and save them in the context cache #3

Conversation

ghost commented Jun 17, 2023 • edited by ghost Loading

Description

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023 • edited by ghost Loading

ghost commented Jun 17, 2023 • edited by ghost Loading

ghost commented Jun 17, 2023

ghost commented Jun 17, 2023 • edited by ghost Loading

ghost commented Jun 17, 2023 • edited by ghost Loading

ghost commented Jun 17, 2023 • edited by ghost Loading

ghost commented Jun 18, 2023

ghost commented Jun 19, 2023

ghost commented Jul 17, 2023

yacineMTB commented Jul 21, 2023

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading

ghost commented Jun 17, 2023 •

edited by ghost

Loading