Add multiple derived adaptions hosting #8415

zhipenghan · 2024-07-10T18:58:31Z

Host multiple fine-tuned derived models on memory-constrained devices by splitting the GGUF files into two parts:

*-foundation.gguf contains the shared tensors across GGUF models.
*-adaptor-taskX.gguf contains the task-specific tensors.

Taking advantage of mmap, only one copy of the shared tensors is kept in memory. The task-specific tensors are loaded and swapped out dynamically as needed.

Overview of the weights used by llama.cpp as below:

To share weights using the same foundation.gguf, you need to perform the split task in a customizable way. The change includes support for splitting the gguf file according to your preferences.

The example gguf files:

Phi-3-mini-4k-instruct_multi-adaptor_gguf

Download the gguf file from above repo and use below command to test it.

llama_multi-adaptation.exe -m models\Phi-3-mini-4k-instruct-adaptor-base.gguf \
 -mpa code_writer=models\Phi-3-mini-4k-instruct-adaptor-code_writer.gguf \
 -mpa summarize=models\Phi-3-mini-4k-instruct-adaptor-summarization.gguf

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

… by name set

ggerganov · 2024-07-16T07:20:00Z

Does this have any advantages over the newly added LoRA support in #8332?

zhipenghan · 2024-07-16T23:05:01Z

Does this have any advantages over the newly added LoRA support in #8332?

I checked the PR you pointed; it calculates Wx + B(Ax) in the model compute graph and it applies all the adaptors in the computation graph. If there are multiple adaptors, it will get res = Wx + B1(A1x)*scale1 + B2(A2x)*scale2 + ... . This works if the adaptor2 is finetuned based on adaptor1 result.
In real scenarios, it's common that multiple scenarios are fine-tuned based on the same foundation model. The adaptors work independently. My proposal it to handle multiple adaptors in horizontal way. The runtime support host multiple different scenarios. E.g. adator_1 is finetuned for email writing scenario, adaptor_2 is finetuned for code writing scenario. While inferencing, we only choose one adaptor to apply but not all adaptors. (If there is inherency relationship, may choose adaptors set to be applies, but this can be merged offline instead calculate in runtime).

Thanks point me to this PR, my proposal is expanding the weights offline and it saves the lora applying time while initializing. And above PR saves the weights gguf size. And as a tradeoff, it increases the graph compute time. I will do a perf compare.

ngxson · 2024-07-21T10:43:32Z

Personally I don't see many advantages of this approach vs merging lora weight offline. After all, most lora adapters target all linear modules, so your *-adaptor-taskX.gguf will contains mostly the same number of tensors of the base model *-foundation.gguf, minus embeddings/output and some bias vectors.

For the task-switching ability: I assume there're performance gains here compared to doing lora during inference time. But again, this is not different from merging lora with base model, which I've covered in my recent PR #8607 . If users want to run multiple merged models at the same time, they can simply spawn multiple instances of llama.cpp. Both performance and memory usage will not change much compared to your current approach.

The current approach only has advantage if the adapter has significantly less tensors than the base model, for example if lora only target some selected modules or some layers, which is possible in theory. But in reality, I never see someone made a such lora adapter, since targeting less modules means more time for training loss to converge.

zhhan added 4 commits July 8, 2024 16:05

add multi adaptor hosting

0ab112a

add customized split functionality, define tensor names set and split…

55fbe83

… by name set

remove cpp header map/string in llama.h

ec9e5c7

update multip adaptation readme

c09c574

github-actions bot added the examples label Jul 10, 2024

zhipenghan changed the title ~~Add support multiple adaption~~ Add multiple adaptions hosting Jul 10, 2024

zhipenghan changed the title ~~Add multiple adaptions hosting~~ Add multiple derived adaptions hosting Jul 10, 2024

zhhan added 2 commits July 10, 2024 14:49

fix file naming pattern sequence

1de1d07

fix bug

f5166cb

mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024

zhipenghan closed this Jul 15, 2024

zhipenghan reopened this Jul 15, 2024

zhipenghan marked this pull request as ready for review July 15, 2024 19:26

zhipenghan closed this Jul 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multiple derived adaptions hosting #8415

Add multiple derived adaptions hosting #8415

zhipenghan commented Jul 10, 2024 •

edited

Loading

ggerganov commented Jul 16, 2024

zhipenghan commented Jul 16, 2024

ngxson commented Jul 21, 2024 •

edited

Loading

Add multiple derived adaptions hosting #8415

Add multiple derived adaptions hosting #8415

Conversation

zhipenghan commented Jul 10, 2024 • edited Loading

ggerganov commented Jul 16, 2024

zhipenghan commented Jul 16, 2024

ngxson commented Jul 21, 2024 • edited Loading

zhipenghan commented Jul 10, 2024 •

edited

Loading

ngxson commented Jul 21, 2024 •

edited

Loading