This repository has been archived by the owner on Jun 24, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Embedding extraction #72
Embedding extraction #72
Changes from 2 commits
843be5d
f12e70e
6e68b0b
5c70570
38a6322
04e83f5
280f89f
6873a61
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm doing here is feeding the following two sentences through the transformer:
Afterwards, I retrieve the embeddings for the last token (dog), and compute their similarity with a simple dot product.
Then, I tried changing the second sentence from 'dog' to 'cat', 'potato', '$' respectively, and the semantic similarity dropped accordingly, with $ ranking the lowest.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@setzer22 will feed prmpt before eval has different embeddings compared to eval all tokens together?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hlhr202 The embeddings wouldn't be affected, but you shouldn't call
evaluate
with the whole prompt like that for a couple reasons:eval
. This means you would be retrieving a lot more embedding data than for just the word "dog".This is why the test code uses
feed_prompt
first, to set up the context, and then makes a call toevaluate
with a single token to retrieve the embeddings for a single word.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@setzer22 I just understand your comments here. this means we can only extract embeddings for a single part of words (which may also have hidden information mixed with context of the whole sentence). that should a little bit different with OpenAI's embedding function. what i understand for openai's embedding, is for the whole sentence but at the same time returned in a fixed size of tensor... that is quite beyond my knowledge though
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
well I guess I might find possible ways to implement such 'sentence embedding', I will try add some special end token and extract the hidden layer once the end token evaluated. not sure if it works, but it must worth a try.