Potential ideas for LoRA adapters #1101

PotatoSpudowski · 2023-04-21T13:55:00Z

PotatoSpudowski
Apr 21, 2023

Hi,
Thank you for the repo and initial lora adapter support.

We explored a few experiments in the fastLLaMa repo.

What we did:

Cached the lora matrix multiplication results in the convert-lora-to-ggml.py script.
- Instead of performing the calculation during runtime,
  we cached and saved these results so when we load the adapters, it is faster.
- We also have arguments that allow users to use the same implementation as seen in this repo.
- We also have an argument that lets the users cache the lora adapters in fp32 or fp16 mode.
- fp16 mode seemed to be decent and might benefit downstream applications.
Added support for detaching lora adapters.
- We added support to detach adapters in llama.cpp here.

Do these features seem like something that would be relevant to this repo?
If yes we would be happy to help implement these!

Happy hacking :)

slaren · 2023-04-22T11:38:39Z

slaren
Apr 22, 2023
Collaborator

Instead of performing the calculation during runtime, we cached and saved these results so when we load the adapters, it is faster.

I discarded this idea in the initial implementation because it increases the size of the LoRAs dramatically, which goes against the point of supporting applying LoRAs on-the-fly in the first place. I would look into optimizing the LoRA mat muls before going this path (for example #996).

fp16 mode seemed to be decent and might benefit downstream applications.

I think this may be interesting, I didn't bother too much with this because ggml doesn't support f16xf16 mat mul anyway and f16xf32 didn't seem to increase performance significantly.

We added support to detach adapters in llama.cpp here.

That's great, and we should definitely support that.

1 reply

amitsingh19975 Apr 24, 2023

Instead of performing the calculation during runtime, we cached and saved these results so when we load the adapters, it is faster.

I discarded this idea in the initial implementation because it increases the size of the LoRAs dramatically, which goes against the point of supporting applying LoRAs on-the-fly in the first place. I would look into optimizing the LoRA mat muls before going this path (for example #996).

It would be great if matrix multiplication is fast enough not to cache the matrices. I did some of my thesis on optimizing BLAS operations using OpenMP, if you are interested, I can link my repo. On the other note, @PotatoSpudowski and I implemented support for both cached and uncached adapters. Furthermore, we recently added support for attaching adapters to mmaped models, and it will be merged soon.

fp16 mode seemed to be decent and might benefit downstream applications.

I think this may be interesting, I didn't bother too much with this because ggml doesn't support f16xf16 mat mul anyway and f16xf32 didn't seem to increase performance significantly.

I am more interested in quantized mat mul as it might be more performant but with less quality. We did try it, but I'm not confident with the implementation. Our current implementation was very bad, but we had a few ideas on how to fix it. However, we had to ditch it because we were not confident with the implementation or it required extra memory or storage space.

We added support to detach adapters in llama.cpp here.

That's great, and we should definitely support that.

If you need any help, feel free to ask.

ekryski · 2023-05-14T01:49:44Z

ekryski
May 14, 2023

I would love to see the attach and detach land upstream here. Hot loading LoRAs will be incredibly useful.

0 replies

mamoesta · 2023-07-17T20:20:09Z

mamoesta
Jul 17, 2023

I've seen a couple requests elsewhere for LoRA + base model quantizing in the repo, but saw the choice to make LoRA adapters fp16 compatible here.

I noticed HF's PeftModel class allows for base model 8bit loading + LoRA. What's the intuition behind why ggml can only do LoRA in fp16? I'd love to use 13B pre-trained LLama/Vicuna + LoRA on a 4090 but I OOM unless I can load it in 8-bit (~15 GB VRAM).

1 reply

xiaoyunwu Nov 28, 2023

I would love to see how to effectively decode with multiple loras with the same base model. that will be extremely useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential ideas for LoRA adapters #1101

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Potential ideas for LoRA adapters #1101

PotatoSpudowski Apr 21, 2023

Replies: 3 comments · 2 replies

slaren Apr 22, 2023 Collaborator

amitsingh19975 Apr 24, 2023

ekryski May 14, 2023

mamoesta Jul 17, 2023

xiaoyunwu Nov 28, 2023

PotatoSpudowski
Apr 21, 2023

Replies: 3 comments 2 replies

slaren
Apr 22, 2023
Collaborator

ekryski
May 14, 2023

mamoesta
Jul 17, 2023