Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add LoRA support #820

Merged
merged 15 commits into from
Apr 17, 2023
Merged

Add LoRA support #820

merged 15 commits into from
Apr 17, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Apr 6, 2023

This change allows applying LoRA adapters on the fly without having to duplicate the model files.

Instructions:

  • Obtain the HF PEFT LoRA files adapter_config.json and adapter_model.bin of a LoRA adapter and put them in the same path. For alpaca, this can be found at https://huggingface.co/tloen/alpaca-lora-7b/tree/main
  • Convert it using convert-lora-to-ggml.py to obtain ggml-adapter-model.bin
python convert-lora-to-ggml.py lora/alpaca-lora-7b
  • Use the ggml-adapter-model.bin with --lora
./main -m models/7B/ggml-model-f16.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7
  • When using a quantized model, the quality may suffer. To avoid this, specify a f16/f32 model with --lora-base to use as a base. The layers modified by LoRA adapter will be applied to the lora-base model and then quantized to the same format as the model specified with -m. Layers not modified by the LoRA adapter will remain untouched.
./main -m models/7B/ggml-model-q4_0.bin --lora lora/alpaca-lora-7b/ggml-adapter-model.bin --lora-base models/7B/ggml-model-f16.bin --color -f ./prompts/alpaca.txt -ins -b 256 --top_k 10000 --temp 0.2 --repeat_penalty 1 -t 7

Limitations:

  • Using --lora disables mmap since the models have to be modified anyway.
  • When using --lora-base, a ggml_cpy operation is used to quantize the result, which currently is done in a single thread. Parallelizing ggml_cpy will improve loading times.

@slaren slaren changed the title Add lora support Add LoRA support Apr 6, 2023
@MillionthOdin16
Copy link

Awesome! Loras would be super useful, especially with how easy to train they're becoming right now 🔥

@Piezoid
Copy link
Contributor

Piezoid commented Apr 7, 2023

Do you think it is possible (or desirable) to produce a quantized versions of the patched tensors?

( f16 llama model, LoRA's tensors) --> f16 patched tensors --> quantized patched tensors

This would brings the speedups from quantization and allow to mmap both files. The pages from the original tensors won't be faulted / loaded in memory (the MAP_POPULATE would have to be disabled)

@slaren
Copy link
Collaborator Author

slaren commented Apr 7, 2023

@Piezoid I am not sure what is the best way to handle this. Ideally for simplicity, the resulting patched tensors would be in the same format as they were initially, so if you patch a q4_0 model you still end with a q4_0 model. However, that may affect the quality significantly and it may be as slow or slower than just patching the f16 model and quantizing it afterwards on the fly. We need to run more tests, I may try implementing both options to see what works best.

@Piezoid
Copy link
Contributor

Piezoid commented Apr 7, 2023

@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward. My suggestion was to generate a separate model file consisting solely of the patched tensors with the LoRA full-rank weights added, and potentially applying quantization as a final step.

The idea is to save disk space by only requiring the space for the modified tensors. By completing the patching process offline, it's possible that the load time will also decrease.

Your proposal of patching and quantizing during load time is interesting, but it necessitates loading an f16 llama model and quantizing tensors that haven't been modified.
It's possible that I'm mistaken since I'm unsure which tensors are quantized and which ones are patched by LoRA.

@slaren
Copy link
Collaborator Author

slaren commented Apr 7, 2023

@Piezoid it is not really viable to store the pre-patched tensors because the file size would be nearly the same than the entire model. The advantage of lora is that to patch a 4096x4096 matrix you only need a 16x4096 and a 4096x16 matrices (for rank 16, could be any other number). Patch it and suddenly your 2x16x4096 becomes 4096x4096.

@ggerganov
Copy link
Owner

ggerganov commented Apr 7, 2023

Very useful info.

Another approach to think about is to use the distributive property of matrix multiplication: (B+C)A=BA+CA
We can add optional LoRA nodes to the llama computation graph.
Examples:

cur = ggml_mul_mat(ctx0, model.layers[il].wo, cur);

would become:

curL = ggml_mul_mat(ctx0, model.layers[il].wo, cur);
if (lora_enabled) {
    // can be precomputed once at the cost of more memory
    // or we can keep unpacking it each time to save memory
    lora = ggml_mul_mat(ctx0, model.loraB[il], model.loraA_trans[il]);

    lora = ggml_mul_mat(ctx0, lora, cur); // F32 mul_mat
    curL = ggml_add(ctx0, curL, lora);    // F32 add
}
cur = curL;

The drawback is slower inference due to extra ggml_mul_mat, but it would be trivial to dynamically load new LoRAs on-the-fly. And the fundamental model is unchanged and can remain quantized.

@slaren
Copy link
Collaborator Author

slaren commented Apr 7, 2023

A small side-note, I realized that in some cases it will also be necessary to add a scaling factor. Specifically this what PEFT does to merge the lora:

self.scaling = self.lora_alpha / self.r
if fan_in_fan_out:
    self.weight.data = self.weight.data.T
...
self.weight.data += (
    transpose(self.lora_B.weight @ self.lora_A.weight, self.fan_in_fan_out) * self.scaling
)
...
def transpose(weight, fan_in_fan_out):
    return weight.T if fan_in_fan_out else weight

Where lora_alpha and r (rank) are parameters in the adapter_model.json.
In the case of alpaca lora_alpha = r so this is a noop, but this is not always case, for example in gpt4all lora_alpha=32 and r=8.

@slaren
Copy link
Collaborator Author

slaren commented Apr 7, 2023

@ggerganov In addition to the performance considerations, something to keep in mind is that the tensors to apply lora to is entirely up to the implementation, for example alpaca applies to all q,k,v,o but gpt4all only to q,v. I imagine that eval would quickly turn to spaguetti if we have to consider every single tensor separately.

@slaren slaren force-pushed the lora branch 2 times, most recently from cd2dbea to a4539e1 Compare April 8, 2023 11:58
@slaren
Copy link
Collaborator Author

slaren commented Apr 8, 2023

This should work with quantized models now. Patching quantized models doesn't seem so bad, I got a perplexity of 6.6281 on q4_0 with alpaca.

@slaren
Copy link
Collaborator Author

slaren commented Apr 10, 2023

Now that #801 has been merged, using --lora disables mmap. Loading is a bit slower but it should work on windows now.

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 10, 2023 via email

@jon-chuang
Copy link
Contributor

jon-chuang commented Apr 11, 2023

So, to be clear, we will load orig params, and then in a batched fashion:

  1. Load fp16 LoRA for the given matrix
  2. Dequantize orig params to fp16
  3. Apply lora
  4. Requantize to save memory

Any rough estimate for how long this adapter "loading" time is?

using --lora disables mmap

I guess since you may patch an arbitrary fraction of weights, the orig weights for the patched matrices are loaded but once. But mmap might still be useful for the case of relatively small fraction of weights + hot-swapping LoRAs. Just a thought.

CoW for large fraction of weights is basically duplicating the weights, so very much unviable.

@slaren
Copy link
Collaborator Author

slaren commented Apr 11, 2023

Replace fp16 with fp32 and that's pretty close to the way it works at the moment:

  • multiply matrices lora B and lora A in f32
  • scale BA with f32
  • add BAs to original weights. this is where the dequantizing/requantizing happens if necessary

The time to apply the adapter for me varies from ~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.

There may be some ways to accelerate this, but at the moment I am more concerned with correctness and supporting all the use cases.

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 11, 2023

I'm trying to troubleshoot some issues on windows. First, the conversion script and overall process was straightforward, so good job making it simple.

I was able to load the 7B llama and 7B lora fine, but I noticed that I didn't seem to get the responses I expect with the Lora applied. This seemed odd, because it was behaving as if the lora wasn't present at all.

When I tried testing with the 13B model and 13B lora, I ran into issues when trying to run main. It mentioned not enough space in the context's memory pool. I have 64GB system ram, and it's not close to being maxed, so I'm confused about what is happening.

C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin --lora D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin
main: seed = 1681243691
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: f16        = 2
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)

Any pointers?
Super pumped to get this working because it opens up a ton of possibilities! Also, just an idea, but it might be nice to have the option to fuse the lora to the base model. Once you have a lora that works really well and constantly use it, it would be nice to bundle it permanently.

edit (some additional info):

ggml_new_tensor_impl: context memory pool -> (needed 209715232, available 421527552)
ggml_new_tensor_impl: context memory pool -> (needed 419430640, available 421527552)
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: ggml_new_tensor_impl: context memory pool -> (needed 163872, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 327920, available 104857600)
ggml_new_tensor_impl: context memory pool -> (needed 105185728, available 104857600)
ggml_new_tensor_impl: not enough space in the context's memory pool (needed 105185904, available 104857600)

@slaren
Copy link
Collaborator Author

slaren commented Apr 11, 2023

@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?

@MillionthOdin16
Copy link

MillionthOdin16 commented Apr 11, 2023

@MillionthOdin16 thanks for testing this, it has been a struggle telling for sure if the lora that I had tried had any meaningful effects, but I think I found a problem. Can you see if the latest changes fixes your issues?

Awesome! Memory allocation issues are fixed and now things are running smoothly.

I'm not getting the responses I'd expect lora-wise, so I suspect there is something off about how the lora is applied. Now that I can run my 13B model, it's much easier to see when the lora is working correctly (13B is my best trained lora). Is there anything I can do to help troubleshoot?

I have a lora that's 25MB that when put on the plain llama model significantly improves the output. I don't know if a lora that is fully merged into the base model might help as well (don't know if we can compare effective weights between this implantation and the lora-fused model?)

Once this works as expected it will be huge. Moving around 25MB loras vs base models is so much easier. And there's lots to be evaluated with layering loras and scaling them based off ratios :D

@slaren
Copy link
Collaborator Author

slaren commented Apr 11, 2023

Are you using a f16 model? Trying to apply a lora to a quantized model may be a terrible idea after all.

@MillionthOdin16
Copy link

You're right. The output works as expected when the llama model is f-32. Nice job!

Now I'm trying to figure out the best way to make it usable. After the model is merged completely with the lora and quantized to 4 bits, it still produces good output (my point being that eventually we will want to get these fully quantized).

So we're merging at f-32 to keep precision? I'm wondering what the best approach is for allowing this to work on quantized models. The ability to have a lora run on top of the base model in llama.cpp is in itself huge because moving significant variations of llama becomes trivial. Having a way for a user to set and lora and have it fused to the model, which could then be quantized down to 4bits would be really helpful. It's not as streamlined as realtime loading of loras, but it makes the use of loras significantly easier.

Do you have any thoughts on how quantization could be worked on in memory? Has anyone tested if a quantized lora still has a useful impact on a quantized base model?

Extra Info

This works:

PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-f32.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first         
main: seed = 1681250916
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-f32.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 0 (all F32)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 50843293.73 KB
llama_model_load_internal: mem required  = 51699.65 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (18393.01 ms)

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

This doesn't:

PS C:\Users\Bradarr\Documents\GitHub\llama.cpp> ./build/bin/main -m "D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin" --lora "D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin" --interactive-first
main: seed = 1681251252
llama.cpp: loading model from D:\models\LLaMA\13B\ggml-model-q4_0-nmap.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 512
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
....................................................................................................
llama_init_from_file: kv self size  =  400.00 MB
llama_apply_lora_from_file: applying lora adapter from 'D:\models\loras\bradarr-lora\13B\ShareGPTUnchained\ggml-adapter-model.bin' - please wait ...
llama_apply_lora_from_file: r = 8, alpha = 16, scaling = 2.00
llama_apply_lora_from_file: .......... done (10663.88 ms)

system_info: n_threads = 4 / 24 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: interactive mode on.
sampling: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.100000
generate: n_ctx = 512, n_batch = 8, n_predict = 128, n_keep = 0

@slaren
Copy link
Collaborator Author

slaren commented Apr 11, 2023

Good to hear that it is working!

Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with convert-pth-to-ggml.py. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.

I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in -m.

@MillionthOdin16
Copy link

I suspect that loading the layers modified by the lora in f16 and then quantizing them back into the same format as the model may be fast enough to be practical. So you could do something like main -m models/7B/q4_0.bin --lora-base models/7B/f16.bin --lora mylora.bin, and it would keep the unmodified layers from the q4_0 model, but any layers modified by the lora would be loaded from the f16, patched and then quantized to q4_0 or whatever is the format of the model specified in -m.

Okay, I see. Just to note, I tested f-32 f-16 and q4_0 base llama models with the same lora file. f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.

Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).

Regarding creating pre-merged models, it is already possible to do that in python by using a script similar to this one from alpaca-lora that merges the lora and then exports the model as pth, which can then be converted to ggml as usual with convert-pth-to-ggml.py. I am not sure that it is worth replicating the same feature in llama.cpp, but I am not entirely opposed to it if it can bring some convenience.

Yes, I've seen the scripts, but I think for most users the understanding of model file formats and what they currently have vs what format they need is very confusing. My thought is that loras have the ability to significantly change the model outputs, are super lightweight, and are becoming more accessible and easier to train with projects like @lxe/simple-llm-finetuner. If we are able to streamline the use of loras and conversion of a lora adapter to a ggml model format they are familiar with, we can make learning about language models much easier (abstracting away as much pytorch/GPU/heavy ML stuff as possible). I know you already know this haha, I'm just excited about how this and similar projects make very technical areas easy to play with.

@MillionthOdin16
Copy link

Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model? If so, it could be useful to break it out as an argument to make experimentation easier. With stable diffusion they've done some pretty cool things with mixing different lora layers (so I'm thinking about this for down the line)

@slaren
Copy link
Collaborator Author

slaren commented Apr 11, 2023

f-32 was definitely lora-ized, f16 was definitely lora-ized (although I don't know how the output quality is different than f-32), and q4_0 didn't seem to have any variation resulting from the lora. Haven't checked the code to know if this is expected.

From what I understand, the llama models are natively f16, so I wouldn't expect much benefit from using a f32 model.

Do you think applying a quantized lora to a quantized mode might have any merit? Sometimes we get interesting results, and it would definitely be faster (assuming you want to trade the accuracy for the speed).

The problem with doing that is that the loras make very small modification to the weights, and the changes may be completely lost in the noise when applied to a quantizied model. Using a quantizied lora too just makes the problem worse, I don't think that would work at all.

Also, I've noticed a scaling factor in console and you've mentioned it some. Is this something that directly affects the 'impact' of the lora weights on the overall model?

This is just something that the PEFT library does based on the lora_alpha parameter and the rank of the lora, and I don't think it should be modified at all, but who knows what effect it might have. Applying loras on top of other loras seems very suspect to me, I wouldn't expect it to work at all, but I guess in some cases it might? Anyway I would leave that experimentation to the GPU people, if they find something worthwhile we can back-port it here.

@jon-chuang
Copy link
Contributor

~5 seconds with a small lora adapter on 7B to upwards of a minute with a larger lora on 30B. The slowest part by far is multiplying the lora matrices.

Is this already parallelised?

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One idea for improvement in the future is to add an option for specifying a subset of tensors to be loaded via llama_model_load(), for example with a list of tensor names. With this, we can avoid loading the entire base model and only load the LoRA tensors that will be adapted

@slaren
Copy link
Collaborator Author

slaren commented Apr 17, 2023

There are a lot of things that could be improved about this, but since it is already functional and there is very little risk of breaking anything unrelated to this change, let's merge it to allow other people to contribute their improvements, and also to receive more broad feedback.

@slaren slaren merged commit 315a95a into ggerganov:master Apr 17, 2023
@iplayfast
Copy link

I REALLY wish someone would make a video tutorial showing how these models can interact with each other. The tech is moving so fast it's hard to keep up.

@wassname
Copy link

wassname commented Apr 19, 2023

@slaren Like you said, adding the LoRA deltas to a q4 quantized model is most likely very bad for quality. The quantization must happen afterward.

Has anyone tested this? NNs are robust to lots of operations and it's not clear if they are robust to adding a int16 delta to a int4 weight.

, it works but the quality is predictably not good

Just to be clear, this is still the case?

@slaren slaren deleted the lora branch April 19, 2023 23:31
@slaren
Copy link
Collaborator Author

slaren commented Apr 19, 2023

I don't think that perplexity is a good way to test if a LoRA has been applied successfully. You need to test if the changes that the LoRA makes are still there, and in my tests they mostly aren't.

@wassname
Copy link

wassname commented Apr 19, 2023

I'm not quite sure what you mean. Maybe you mean perplexity over wiki.test.raw is not a good way to test it? If so, I agree. But perplexity over a LLaMA specific prompt would work well, and the quantitative measure that is perplexity seems better than eyeballing it. Although eyeballing might be sufficient.

For example, the following would be much more likely in a LoRA model, so it should have a lower perplexity.

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
What is 2+2?
### Response:
4

@slaren
Copy link
Collaborator Author

slaren commented Apr 20, 2023

The perplexity over some subset of the dataset used to train the LoRA could work.

@wassname
Copy link

I'm with you on that!

Thanks for putting this PR together, btw. Do you still see those quality issues, or did you manage to resolve in your latest commits? Sorry I couldn't work it out by reading the thread

@slaren
Copy link
Collaborator Author

slaren commented Apr 20, 2023

The issues are still there. The solution was adding the --lora-base option to take the layers from an unquantized model. I didn't run any formal tests like what you are suggesting, but simply from observing the outputs of the model it was evident that the quality is not good. Any additional research into this could be interesting, though.

@wassname
Copy link

btw you may have already seen it but this is the prompt format used to train most of the Alpaca LoRA's. Sometimes people use regular chat, and get poor results, until they switch to this format.

@wassname
Copy link

wassname commented Apr 20, 2023

The issues are still there. The solution was...

I see! Thanks for explaining

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

Support for Loading a Subset of Tensors for LoRA Models
8 participants