Swap LoRA adapters at runtime #7850

ltoniazzi · 2024-06-10T09:43:35Z

ltoniazzi
Jun 10, 2024

New Feature?

Compile LLMs in gguf so that they take an additional int parameter that allows to swap between different LoRA adapters at runtime.

Questions

Is this feature possible to implement now with llama.cpp (if not already available)?
If yes to 1 (with feature not currently available), how difficult is it to develop this feature?

Related discussions

A discussion opened ~1year ago, with the main approach being optimizing tall-skinny matmul to avoid cacheing LoRA weights in the un-merged PR #996. Has this project progressed in other directions?

Example

Basic onnx example here, to clarify what this feature aims to do.

Context/Application

I wanted to use this for gaming applications, so that one can fine-tune multiple LoRA-adapters for multiple NPCs so that these NPCs can all benefit from the same large model being loaded in memory only once.

FSSRepo · 2024-06-10T15:17:10Z

FSSRepo
Jun 10, 2024
Collaborator

That would imply keeping the base model and the fine-tuned model in RAM. When the LoRA is changed, it would involve freeing the fine-tuned model and making a copy of the base model with the LoRA applied.

13 replies

slaren Jun 16, 2024
Collaborator

This is not related to the issue with finetune. You need to allocate the lora tensors using ggml-backend, and using the same buffer type as the type of the layer of the model, to avoid having to copy the tensors constantly to VRAM. llama_kv_cache_init shows how to do this.

The issue with finetune is that it creates lora adapters for every tensor, and that includes the token embeddings tensors and the norm tensors. These tensors are not used in matrix multiplications, so the lora_mul_mat function won't work for them. It would be possible to create similar functions for other operations, but it may not be very efficient, and I don't think loras were ever intended to be used for these layers.

ltoniazzi Jun 17, 2024
Author

@slaren Sorry but I'm quite new to this. Can you give a couple more details on the ggml-backend? Mainly:

What are the essential building blocks for this allocation (get buffer for a layer, create tensors there, move weights to the buffer, etc)?
I tried allocating space for lora tensors in llama_kv_cache_init like this:

for (int i = 0; i < (int) n_layer; i++) {
          struct ggml_context * ctx = offload ? ctx_map.at(model.buft_layer[i].buft) : cache.ctxs.front();
          // Allocate space for every (say 5) lora pair in each layer
          for (int n = 0; n < 5; n++) {
              ggml_tensor * loraA = ggml_new_tensor_1d(ctx, type_k, 512*4);
              ggml_tensor * loraB = ggml_new_tensor_1d(ctx, type_k, 512*4);
              ggml_format_name(loraA, "cache_loraA_l%d%d", i, n);
              ggml_format_name(loraB, "cache_loraB_l%d%d", i, n);
              cache.k_l.push_back(loraA); 
              cache.v_l.push_back(loraB);
          }  
  }

but I get the same error as before, and I do not see how I can allocate the existing weights to the respective buffer.

slaren Jun 17, 2024
Collaborator

Since each layer may be stored in a different device, you have to allocate the lora tensors in the memory of the device in which they are used. Otherwise, it will result in lots of data being copied between devices during inference.

Buffer types are things like "CPU buffer", "CUDA buffer" or "Metal buffer". Basically every backend has a different buffer type that represents the local memory of the device.

The basic process is like this:

Create a ggml_context for each buffer type used in the model
Create tensors in the context corresponding to their buffer type
Use ggml_backend_alloc_ctx_tensors_from_buft to allocate all the tensors in each of the contexts in a buffer of its corresponding type
Use ggml_backend_tensor_set to load the data from the file into the tensors

Another example of how to do this is in llama_control_vector_init. There is one control vector for each layer, and it needs to be allocated in the same device as the layer.

You can simplify all of this for testing, and if you are using CPU only, and just run with -ngl 0 and allocate everything in a CPU buffer by using ggml_backend_alloc_ctx_tensors_from_buft(ctx, ggml_backend_cpu_buffer_type());, or run with -ngl 99 and use a Metal buffer type instead with ggml_backend_metal_buffer_type(). It will still run if you allocate the tensors in the wrong buffer type as long as you use ggml-backend to allocate the tensors, it will just be slower.

ltoniazzi Jul 1, 2024
Author

@slaren I managed to run allocating lora tensors in a Metal buffer, by doing context creation + tensors creation + buffer allocation + data allocation like this:

load and allocate lora in a context

struct lora_data * result = new struct lora_data;
    result->ctx = NULL;
    ...
    struct ggml_init_params params_ggml;
    params_ggml.mem_size   = ggml_tensor_overhead() * GGML_DEFAULT_GRAPH_SIZE;
    params_ggml.mem_buffer = NULL;
    params_ggml.no_alloc   = true;
    result->ctx = ggml_init(params_ggml);
    ...
    while(!file.eof()) {
        ...
        struct ggml_tensor * tensor = ggml_new_tensor(result->ctx, (enum ggml_type) type, n_dims, ne);
        ggml_set_name(tensor, name_buf.data());
        ...
        file.seek(nbytes, SEEK_CUR);
    }

    ggml_backend_buffer_t buf = ggml_backend_alloc_ctx_tensors_from_buft(result->ctx,  ggml_backend_metal_buffer_type());
    ...
    // read tensor data
    result->data.resize(total_nbytes_pad);
    size_t data_offset = 0;
    for (size_t i = 0; i < tensors.size(); ++i) {
        struct ggml_tensor * tensor = tensors[i];
        ...
        tensor->data = result->data.data() + data_offset;
        file.read_raw(tensor->data, nbytes);
        data_offset += nbytes_pad;
    }
    return result;

So the code runs on Metal (and I can debug seeing the lora tensors in a metal buffer), but it is incredibly slow and shows the GGML_METAL_LOG_ERRORof the ggml_metal_get_buffer function after each generated token for all lora tensors, such as

ggml_metal_get_buffer: error: tensor 'blk.23.attn_v.weight.loraA' buffer is nil

The methods raising is this in ggml-metal.m:

ggml_metal_get_buffer

static id<MTLBuffer> ggml_metal_get_buffer(struct ggml_tensor * t, size_t * offs) {
    //GGML_METAL_LOG_INFO("%s: data tensor '%16s', offs_data = %8ld, offs_eval = %8ld, offs_cach = %8ld\n", __func__, t->name, offs_data, offs_eval, offs_cach);

    const int64_t tsize = ggml_nbytes(t);

    ggml_backend_buffer_t buffer = t->view_src ? t->view_src->buffer : t->buffer;

    struct ggml_backend_metal_buffer_context * buf_ctx = (struct ggml_backend_metal_buffer_context *) buffer->context;

    // find the view that contains the tensor fully
    for (int i = 0; i < buf_ctx->n_buffers; ++i) {
        const int64_t ioffs = (int64_t) t->data - (int64_t) buf_ctx->buffers[i].data;

        //GGML_METAL_LOG_INFO("ioffs = %10ld, tsize = %10ld, sum = %10ld, buf_ctx->buffers[%d].size = %10ld\n", ioffs, tsize, ioffs + tsize, i, buf_ctx->buffers[i].size);
        if (ioffs >= 0 && ioffs + tsize <= (int64_t) buf_ctx->buffers[i].size) {
            *offs = (size_t) ioffs;

            //GGML_METAL_LOG_INFO("%s: tensor '%16s', offs = %8ld\n", __func__, t->name, *offs);

            return buf_ctx->buffers[i].metal;
        }
    }

    GGML_METAL_LOG_ERROR("%s: error: tensor '%s' buffer is nil\n", __func__, t->name);

    return nil;
}

Can you point me to what the issue is?
Is the issue that I should put the lora tensors in the exact same context where the corresponding base tensors for the mul_mats are located?

slaren Jul 1, 2024
Collaborator

I have added a review comment where I think is the problem. To understand how ggml-backend is used, it may help you to check the simple and gpt2 examples in the ggml repository:
https://github.com/ggerganov/ggml/blob/master/examples/simple/simple-backend.cpp
https://github.com/ggerganov/ggml/blob/master/examples/gpt-2/main-backend.cpp

MaTwickenham · 2024-06-21T09:45:25Z

MaTwickenham
Jun 21, 2024

@ltoniazzi Hi there, I am also paying attention to adapters Swap. I saw that you have implemented this on the CPU. Could you provide relevant information on the use of it, such as the configuration and loading of the LoRA model. Thank you very much for your time and assistance!

4 replies

ltoniazzi Jun 21, 2024
Author

@MaTwickenham No problem, it's all in a messy branch in my fork now (branch first-notes).

Main logic:

Created a lora adapter using the ./finetune example.
This adapter's .bin file is hardcoded in the llama.cpp file in lora.filename = "./models/open-llama/lora-ggml-model-q8_0-shakespeare-LATEST.bin";
Running main .main -m ./models/open-llama/ggml-model-q8_0.gguf -n 128 -ngl 0 the adapters are loaded in llama_new_context_with_model into a new mapping lora_weight_map in llama_context, and these mappings are picked up by the new function lora_mul_mat, doing basically what @slaren suggested here.

Before cleaning my code I wanted to make the metal implementation run but I haven't had time this week to look at it.
But happy to clean up the code later today if it helps you, let me know!

MaTwickenham Jun 21, 2024

@ltoniazzi Really appreciate that! It will help a lot. I have been doing work related to adapter swap recently. Let’s keep in communication.😃

ltoniazzi Jun 21, 2024
Author

@MaTwickenham I opened the draft PR #8056. The PR description explains how to create the adapter and how to run with the adapter loaded. Let me know how it goes 😃

MaTwickenham Jun 22, 2024

@ltoniazzi Very helpful, thank you! I am trying to create a vector in the context variable to store multiple adapters based on your code, in order to test the overhead of adapters hot swapping. Additionally, do you have any social media? I would like to have some real-time communication with you. If it's convenient, could you leave your contact information? BTW, My homepage has my email address.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap LoRA adapters at runtime #7850

{{title}}

Replies: 2 comments 17 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Swap LoRA adapters at runtime #7850

ltoniazzi Jun 10, 2024

New Feature?

Questions

Related discussions

Example

Context/Application

Replies: 2 comments · 17 replies

FSSRepo Jun 10, 2024 Collaborator

slaren Jun 16, 2024 Collaborator

ltoniazzi Jun 17, 2024 Author

slaren Jun 17, 2024 Collaborator

ltoniazzi Jul 1, 2024 Author

slaren Jul 1, 2024 Collaborator

MaTwickenham Jun 21, 2024

ltoniazzi Jun 21, 2024 Author

MaTwickenham Jun 21, 2024

ltoniazzi Jun 21, 2024 Author

MaTwickenham Jun 22, 2024

ltoniazzi
Jun 10, 2024

Replies: 2 comments 17 replies

FSSRepo
Jun 10, 2024
Collaborator

slaren Jun 16, 2024
Collaborator

ltoniazzi Jun 17, 2024
Author

slaren Jun 17, 2024
Collaborator

ltoniazzi Jul 1, 2024
Author

slaren Jul 1, 2024
Collaborator

MaTwickenham
Jun 21, 2024

ltoniazzi Jun 21, 2024
Author

ltoniazzi Jun 21, 2024
Author