Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add multiple derived adaptions hosting #8415

Closed
wants to merge 6 commits into from

Conversation

zhipenghan
Copy link

@zhipenghan zhipenghan commented Jul 10, 2024

Host multiple fine-tuned derived models on memory-constrained devices by splitting the GGUF files into two parts:

  • *-foundation.gguf contains the shared tensors across GGUF models.
  • *-adaptor-taskX.gguf contains the task-specific tensors.

Taking advantage of mmap, only one copy of the shared tensors is kept in memory. The task-specific tensors are loaded and swapped out dynamically as needed.

Overview of the weights used by llama.cpp as below:
multi-lora-pr

To share weights using the same foundation.gguf, you need to perform the split task in a customizable way. The change includes support for splitting the gguf file according to your preferences.

The example gguf files:

Download the gguf file from above repo and use below command to test it.

llama_multi-adaptation.exe -m models\Phi-3-mini-4k-instruct-adaptor-base.gguf \
 -mpa code_writer=models\Phi-3-mini-4k-instruct-adaptor-code_writer.gguf \
 -mpa summarize=models\Phi-3-mini-4k-instruct-adaptor-summarization.gguf

@zhipenghan zhipenghan changed the title Add support multiple adaption Add multiple adaptions hosting Jul 10, 2024
@zhipenghan zhipenghan changed the title Add multiple adaptions hosting Add multiple derived adaptions hosting Jul 10, 2024
@mofosyne mofosyne added the Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level label Jul 13, 2024
@zhipenghan zhipenghan closed this Jul 15, 2024
@zhipenghan zhipenghan reopened this Jul 15, 2024
@zhipenghan zhipenghan marked this pull request as ready for review July 15, 2024 19:26
@ggerganov
Copy link
Owner

Does this have any advantages over the newly added LoRA support in #8332?

@zhipenghan
Copy link
Author

Does this have any advantages over the newly added LoRA support in #8332?

I checked the PR you pointed; it calculates Wx + B(Ax) in the model compute graph and it applies all the adaptors in the computation graph. If there are multiple adaptors, it will get res = Wx + B1(A1x)*scale1 + B2(A2x)*scale2 + ... . This works if the adaptor2 is finetuned based on adaptor1 result.
In real scenarios, it's common that multiple scenarios are fine-tuned based on the same foundation model. The adaptors work independently. My proposal it to handle multiple adaptors in horizontal way. The runtime support host multiple different scenarios. E.g. adator_1 is finetuned for email writing scenario, adaptor_2 is finetuned for code writing scenario. While inferencing, we only choose one adaptor to apply but not all adaptors. (If there is inherency relationship, may choose adaptors set to be applies, but this can be merged offline instead calculate in runtime).

image

Thanks point me to this PR, my proposal is expanding the weights offline and it saves the lora applying time while initializing. And above PR saves the weights gguf size. And as a tradeoff, it increases the graph compute time. I will do a perf compare.

@ngxson
Copy link
Collaborator

ngxson commented Jul 21, 2024

Personally I don't see many advantages of this approach vs merging lora weight offline. After all, most lora adapters target all linear modules, so your *-adaptor-taskX.gguf will contains mostly the same number of tensors of the base model *-foundation.gguf, minus embeddings/output and some bias vectors.

For the task-switching ability: I assume there're performance gains here compared to doing lora during inference time. But again, this is not different from merging lora with base model, which I've covered in my recent PR #8607 . If users want to run multiple merged models at the same time, they can simply spawn multiple instances of llama.cpp. Both performance and memory usage will not change much compared to your current approach.

The current approach only has advantage if the adapter has significantly less tensors than the base model, for example if lora only target some selected modules or some layers, which is possible in theory. But in reality, I never see someone made a such lora adapter, since targeting less modules means more time for training loss to converge.

@zhipenghan zhipenghan closed this Jul 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
examples Review Complexity : Medium Generally require more time to grok but manageable by beginner to medium expertise level
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants