Replies: 2 comments 17 replies
-
That would imply keeping the base model and the fine-tuned model in RAM. When the LoRA is changed, it would involve freeing the fine-tuned model and making a copy of the base model with the LoRA applied. |
Beta Was this translation helpful? Give feedback.
13 replies
-
@ltoniazzi Hi there, I am also paying attention to adapters Swap. I saw that you have implemented this on the CPU. Could you provide relevant information on the use of it, such as the configuration and loading of the LoRA model. Thank you very much for your time and assistance! |
Beta Was this translation helpful? Give feedback.
4 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
New Feature?
Compile LLMs in gguf so that they take an additional int parameter that allows to swap between different LoRA adapters at runtime.
Questions
Is this feature possible to implement now with llama.cpp (if not already available)?
If yes to 1 (with feature not currently available), how difficult is it to develop this feature?
Related discussions
A discussion opened ~1year ago, with the main approach being optimizing tall-skinny matmul to avoid cacheing LoRA weights in the un-merged PR #996. Has this project progressed in other directions?
Example
Basic onnx example here, to clarify what this feature aims to do.
Context/Application
I wanted to use this for gaming applications, so that one can fine-tune multiple LoRA-adapters for multiple NPCs so that these NPCs can all benefit from the same large model being loaded in memory only once.
Beta Was this translation helpful? Give feedback.
All reactions