Skip to content
This repository has been archived by the owner on Jun 24, 2024. It is now read-only.

Embedding extraction #72

Merged
merged 8 commits into from
Mar 26, 2023
Merged

Embedding extraction #72

merged 8 commits into from
Mar 26, 2023

Conversation

setzer22
Copy link
Collaborator

@setzer22 setzer22 commented Mar 24, 2023

Implements #56.

I ported the llama.cpp code to allow extracting word embeddings and logits from a call to evaluate. I validated this using an ad_hoc_test (currently hard-coded in main) and results seem to make sense: The dot product for different embeddings is higher, the more similar the two words are, which is exactly how embeddings should work.

This serves as a proof of concept, but we need to discuss the API before we can merge. Currently, I added an EvaluateOutputRequest struct, so we can expand this in the future, allowing retrieval for other interesting bits of the inference process, but these values are not easily obtainable using the regular APIs (i.e. feed_prompt, infer_next_token). I'm not sure if that's a problem: Are we ok with users having to drop down to the lower level evaluate function when they need to retrieve this kind of information?

On a different note, I would really like for someone with a bit of understanding to validate that the results here are correct. Perhaps @hlhr202 can shed some light there?

Finally, should we consider exposing this to llama-cli at all?

llama-cli/src/main.rs Outdated Show resolved Hide resolved
Comment on lines 378 to 379
// Try other words: 'dog', 'cat', 'potato', '$' -> To see decreasingly lower dot product values.
let dog2 = model.tokenize(&vocab, "dog", false).unwrap();
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I'm doing here is feeding the following two sentences through the transformer:

  • "My favourite animal is the dog"
  • "I just adopted a cute dog"

Afterwards, I retrieve the embeddings for the last token (dog), and compute their similarity with a simple dot product.

Then, I tried changing the second sentence from 'dog' to 'cat', 'potato', '$' respectively, and the semantic similarity dropped accordingly, with $ ranking the lowest.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@setzer22 will feed prmpt before eval has different embeddings compared to eval all tokens together?

Copy link
Collaborator Author

@setzer22 setzer22 Mar 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hlhr202 The embeddings wouldn't be affected, but you shouldn't call evaluate with the whole prompt like that for a couple reasons:

  • A call to evaluate runs all the tokens you give it as a batch, meaning it requires increased memory usage. For very long prompts, this could become very expensive.
  • The output value will return the output embeddings for every token that you fed through eval. This means you would be retrieving a lot more embedding data than for just the word "dog".

This is why the test code uses feed_prompt first, to set up the context, and then makes a call to evaluate with a single token to retrieve the embeddings for a single word.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@setzer22 I just understand your comments here. this means we can only extract embeddings for a single part of words (which may also have hidden information mixed with context of the whole sentence). that should a little bit different with OpenAI's embedding function. what i understand for openai's embedding, is for the whole sentence but at the same time returned in a fixed size of tensor... that is quite beyond my knowledge though

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well I guess I might find possible ways to implement such 'sentence embedding', I will try add some special end token and extract the hidden layer once the end token evaluated. not sure if it works, but it must worth a try.

llama-rs/src/lib.rs Outdated Show resolved Hide resolved
llama-rs/src/lib.rs Outdated Show resolved Hide resolved
@philpax
Copy link
Collaborator

philpax commented Mar 25, 2023

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

@hlhr202
Copy link
Contributor

hlhr202 commented Mar 25, 2023

LGTM once the other review feedback's sorted out.

For exposing it from the CLI, I'm not sure... people might use it as a process in a CLI pipeline (get the embeddings of two texts and then comparing them), but I'm not sure what that would look like or how people would do that. (What output format would we use?)

Unless someone can suggest a "standard" output format for this, I'd suggest leaving it out for now and figuring it out later.

It would be just nice for me to expose such get embedding function as in crate library. actually i do not care much about cli exposing. what I v seen llama.cpp they provide a parameter --embedding for output purpose. but they still did not find out a way to expose it though. thats why i still cannot get the embedding from their cli currently.
I only test a few cases with comparing to openai's embedding. should be some difference, but i think that is caused by different model.

@KerfuffleV2
Copy link
Contributor

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

@hlhr202
Copy link
Contributor

hlhr202 commented Mar 25, 2023

@hlhr202 The CLI is just a consumer of the library crate, so when using the library you'll be able to get the embeddings.

yes absolutely, cuz i m porting llama-rs to llama-node, so i just need library pub function exposing.
it doesnt make sense to expose embeddings in cli anyway.

@philpax philpax mentioned this pull request Mar 25, 2023
3 tasks
@setzer22
Copy link
Collaborator Author

I already addressed the review feedback and removed the ad-hoc test code. So I take it a good plan now would be to merge this as-is and have embedding extraction as a low-level feature of llama-rs, but simply not expose it to the CLI?

Copy link
Collaborator

@philpax philpax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, ready to merge after the comment's fixed

@KerfuffleV2
Copy link
Contributor

Since I added the --dump-prompt-tokens option, you can probably guess I like exposing information. :) I know people asked about being able to show the embeddings with llama.cli, so it does seem like there's some kind of demand for it in a CLI.

@philpax
Copy link
Collaborator

philpax commented Mar 26, 2023

If there's demand, I'm happy to do so - just not sure what the output format should be. JSON array or newline-delimited floats?

@KerfuffleV2
Copy link
Contributor

KerfuffleV2 commented Mar 26, 2023

Is it a lot of data? You could probably just print in the normal Rust debug format which should look like a comma separated list if it's in a Vec or something. That should be pretty easy to transform to other formats without need to write extra code or pull in dependencies.

This is the related issue: ggerganov/llama.cpp#224 (there was actually only one person who wanted it as an option)

@setzer22
Copy link
Collaborator Author

setzer22 commented Mar 26, 2023

Is it a lot of data?

It is quite a lot of data for comfortably printing to stdout. It's 4096 floats per token. Not that it wouldn't work, but it's a bit uncomfortable.

@KerfuffleV2
Copy link
Contributor

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

@philpax philpax merged commit a067431 into main Mar 26, 2023
@rpbrokaw
Copy link

rpbrokaw commented Apr 2, 2023

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

@hlhr202
Copy link
Contributor

hlhr202 commented Apr 2, 2023

Ahh, then seems like it probably isn't worth even bothering to add to the CLI right now unless someone comes here and requests it. Or they could probably just write their own little utility to load a model, feed a prompt and print out embeddings however they wanted.

I would love that in the CLI! Perhaps with a parameter that specifies an output file. I need the embeddings to build a vector database based on some local files. Any chance you could take a look? It has been many years since I programmed C/C++.

vector is around 4096 length for one token, not very suitable for being well printed in CLI. I guess you need to call it through rust api.

@setzer22
Copy link
Collaborator Author

setzer22 commented Apr 2, 2023

I'm open to adding a way for the CLI to output embeddings if people find this is an interesting use case. The main blocker here is that the use case is not clear to me and thus I can't figure out the right API and output format.

What we need here, is someone who understands how embeddings in an LLM like LLaMA work, has a clear use case for extracting them and can tell us how would they expect an API like this to work. If anyone wants to open an issue with a clear description of what we need to provide, I'd be happy to add an implementation 🙂

@hlhr202
Copy link
Contributor

hlhr202 commented Apr 2, 2023

@setzer22 I have made a new embedding extraction example. can check it here https://github.com/hlhr202/llama-node/blob/develop/packages/core/example/semantic-compare/compare.py
I noticed that llama.cpp they use "\n" as end token, so I also do the same. it is quite close to openai's text-embedding-ada-002 result

@turbo
Copy link

turbo commented Apr 3, 2023

I'm working on a large dense-vector embedding database (about 2 million data points from books), which is currently using OpenAI's Ada embeddings (~1600 dimensions). I can do a comparison of performance between those and the 4k LLaMa embeds if needed.

as a clear use case for extracting them and can tell us how would they expect an API like this to work

From an ops perspective, ideally one could provide a batch input and get a batch output (just like OpenAI's API) via CLI. The format doesn't matter much - it can be JSONL or a binary format. I'd personally recommend sticking to those two, since that is supported by most VSS databases (e.g. Redis RediSearch).

@merlinvn
Copy link

merlinvn commented May 8, 2023

My use case here is if you have a sets of documents, and if you can get the embeddings of those documents, whenever a new question comming in, you can embed the new questions and find the most relevant documents to send a long with you prompt. So basically you can have a natural Q&A chat bot based on your own data.

@philpax philpax deleted the feat/extract_embeddings branch July 16, 2023 19:27
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants