Potential ideas for LoRA adapters #1101
Replies: 3 comments 2 replies
-
I discarded this idea in the initial implementation because it increases the size of the LoRAs dramatically, which goes against the point of supporting applying LoRAs on-the-fly in the first place. I would look into optimizing the LoRA mat muls before going this path (for example #996).
I think this may be interesting, I didn't bother too much with this because ggml doesn't support f16xf16 mat mul anyway and f16xf32 didn't seem to increase performance significantly.
That's great, and we should definitely support that. |
Beta Was this translation helpful? Give feedback.
-
I would love to see the attach and detach land upstream here. Hot loading LoRAs will be incredibly useful. |
Beta Was this translation helpful? Give feedback.
-
I've seen a couple requests elsewhere for LoRA + base model quantizing in the repo, but saw the choice to make LoRA adapters fp16 compatible here. I noticed HF's PeftModel class allows for base model 8bit loading + LoRA. What's the intuition behind why ggml can only do LoRA in fp16? I'd love to use 13B pre-trained LLama/Vicuna + LoRA on a 4090 but I OOM unless I can load it in 8-bit (~15 GB VRAM). |
Beta Was this translation helpful? Give feedback.
-
Hi,
Thank you for the repo and initial lora adapter support.
We explored a few experiments in the fastLLaMa repo.
What we did:
we cached and saved these results so when we load the adapters, it is faster.
Do these features seem like something that would be relevant to this repo?
If yes we would be happy to help implement these!
Happy hacking :)
Beta Was this translation helpful? Give feedback.
All reactions