API for manipulating token-level input embeddings #4537

ringohoffman · 2023-12-19T20:49:10Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

API to allow manipulation of token-level input embeddings.

Motivation

I work for Protopia AI, a company that offers an LLM privacy solution that works by transforming LLM input embeddings. We have a client that uses llama.cpp, and we are interested to see how our PyTorch based solution can integrate with llama.cpp. I have been looking at llama-cpp-python as an avenue to understand llama.cpp's APIs.

I've tried to understand the current embedding API with little luck:

llama-cpp-python discussion: Embedding for sentence
llama-cpp-python discussion: Load 70b model only once -- for embedding and for completion

llama.cpp discussion: What exactly does llama_get_embeddings return?
llama.cpp issue: Bug: Invalid Embeddings if GPU offloaded (CUDA)
llama.cpp discussion: Where does the embedding come from?
llama.cpp discussion: How to use embedding ?
llama.cpp discussion: How do I get input embeddings?
llama.cpp discussion: How to get and modify weights between activated neurons?

How much work would it be to expose access to the token-level input embeddings for manipulation at inference? Is this possible using llama.cpp's existing APIs?

abb128 · 2023-12-20T00:29:25Z

This was very confusing to me initially, but I found the following (ugly) method to work

You can extract the full input embed matrix like so

    std::vector<float> embeddings;
    embeddings.resize(llama_n_embd(model) * llama_n_vocab(model));

    auto tensor = llama_get_model_tensor(model, "token_embd.weight");
    ASSERT(tensor);

    if(tensor->type != GGML_TYPE_F32) {
        ggml_internal_get_type_traits(tensor->type).to_float(tensor->data,
                                                             embeddings.data(),
                                                             embeddings.size());
    } else {
        ASSERT((tensor->ne[0] * tensor->ne[1]) == embeddings.size());
        memcpy(embeddings.data(), tensor->data, embeddings.size() * sizeof(float));
    }

You can then convert token id to embeds like so

                    const float *embeds = embeddings.data() +
                                 (t.token * n_embd);

Then use the embeds field of llama_batch to input this embedding or a modified embedding based on this, in my experience I think you can only decode one token at a time, this may have changed since I last updated

github-actions · 2024-03-18T01:35:47Z

This issue is stale because it has been open for 30 days with no activity.

github-actions · 2024-04-02T01:10:30Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ringohoffman added the enhancement New feature or request label Dec 19, 2023

github-actions bot added the stale label Mar 18, 2024

github-actions bot closed this as completed Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API for manipulating token-level input embeddings #4537

API for manipulating token-level input embeddings #4537

ringohoffman commented Dec 19, 2023

abb128 commented Dec 20, 2023

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024

API for manipulating token-level input embeddings #4537

API for manipulating token-level input embeddings #4537

Comments

ringohoffman commented Dec 19, 2023

Prerequisites

Feature Description

Motivation

abb128 commented Dec 20, 2023

github-actions bot commented Mar 18, 2024

github-actions bot commented Apr 2, 2024