-
Notifications
You must be signed in to change notification settings - Fork 10.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add multiple derived adaptions hosting #8415
Conversation
Does this have any advantages over the newly added LoRA support in #8332? |
I checked the PR you pointed; it calculates Thanks point me to this PR, my proposal is expanding the weights offline and it saves the lora applying time while initializing. And above PR saves the weights gguf size. And as a tradeoff, it increases the graph compute time. I will do a perf compare. |
Personally I don't see many advantages of this approach vs merging lora weight offline. After all, most lora adapters target all linear modules, so your For the task-switching ability: I assume there're performance gains here compared to doing lora during inference time. But again, this is not different from merging lora with base model, which I've covered in my recent PR #8607 . If users want to run multiple merged models at the same time, they can simply spawn multiple instances of llama.cpp. Both performance and memory usage will not change much compared to your current approach. The current approach only has advantage if the adapter has significantly less tensors than the base model, for example if lora only target some selected modules or some layers, which is possible in theory. But in reality, I never see someone made a such lora adapter, since targeting less modules means more time for training loss to converge. |
Host multiple fine-tuned derived models on memory-constrained devices by splitting the GGUF files into two parts:
*-foundation.gguf
contains the shared tensors across GGUF models.*-adaptor-taskX.gguf
contains the task-specific tensors.Taking advantage of
mmap
, only one copy of the shared tensors is kept in memory. The task-specific tensors are loaded and swapped out dynamically as needed.Overview of the weights used by llama.cpp as below:
![multi-lora-pr](https://private-user-images.githubusercontent.com/5036905/348828849-3a168455-d6bc-4092-a08d-25f9ad467dd4.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3Mzg5NjE3MjcsIm5iZiI6MTczODk2MTQyNywicGF0aCI6Ii81MDM2OTA1LzM0ODgyODg0OS0zYTE2ODQ1NS1kNmJjLTQwOTItYTA4ZC0yNWY5YWQ0NjdkZDQucG5nP1gtQW16LUFsZ29yaXRobT1BV1M0LUhNQUMtU0hBMjU2JlgtQW16LUNyZWRlbnRpYWw9QUtJQVZDT0RZTFNBNTNQUUs0WkElMkYyMDI1MDIwNyUyRnVzLWVhc3QtMSUyRnMzJTJGYXdzNF9yZXF1ZXN0JlgtQW16LURhdGU9MjAyNTAyMDdUMjA1MDI3WiZYLUFtei1FeHBpcmVzPTMwMCZYLUFtei1TaWduYXR1cmU9ZjNhOGI3OTE1YjI2MzM4MjNkMzYxMzBmOTkzZTFjZWI0Njc4NzJiN2NkNDQxODQ3M2Y5YzNhMTZhOWQ4NmU0NSZYLUFtei1TaWduZWRIZWFkZXJzPWhvc3QifQ.VBs8s2qSRZodsbQ6a5lZr5Cx1iMIox11Ipti3doqHko)
To share weights using the same foundation.gguf, you need to perform the split task in a customizable way. The change includes support for splitting the gguf file according to your preferences.
The example gguf files:
Download the gguf file from above repo and use below command to test it.