-
Notifications
You must be signed in to change notification settings - Fork 16.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Instructor Model to Embeddins #771
Conversation
Was broken, should be fixed now. |
Hi @enoreyes. Seems that we cross paths a bit and have duplicate PRs (#756) for the Instructor embeddings (they are very cool). I am not expert but your structure looked better (and you had tests!!) - but I noticed that there were a few differences in function. Have commented directly in the PR. If I can help in any way let me know. |
|
||
from pydantic import BaseModel, Extra | ||
|
||
from langchain.embeddings.base import Embeddings | ||
|
||
DEFAULT_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2" | ||
DEFAULT_INSTRUCTION = "Represent the following text:" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the Instructor paper - they support asymmetric instructions (ie. the instruction for embedding and the instruction for retrieval are different). From their repo, "Represent the Wikipedia document for retrieval: " is used for the original embedding and "Represent the Wikipedia question for retrieving supporting documents: " is used when constructing the query embedding. Would be good to support this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, however adding in the ability to do this makes the code a little wonky. Would love to get this merged first as a V1 and then add the asymmetric instructions component later. Alternatively, if you can figure out a good way to do this feel free to add it to this PR. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the idea of asymmetric prompts for an embed/query class like this feels somewhat generic - so my instinct would be just to have two parameters to the class embed_instuction
and query_instruction
instead of just instruction
. They can both default to DEFAULT_INSTRUCTION
but then the embed and query methods can just reference the correct one. I am not a good judge if this is 'wonky' though.
I agree that this would be ok to merge as is for v1 though - so maybe time for @hwchase17 to have a look again
Co-authored-by: seanaedmiston <seane999@gmail.com>
Co-authored-by: seanaedmiston <seane999@gmail.com>
@enoreyes @seanaedmiston what do you think about making this a separate class entirely? the reason i think this might be nice is the logic seems to diverge kinda substantially (theres 3 separate places with the if... else... loop). i can also take a stab at this and see what i might look like |
@enoreyes @seanaedmiston took a stab at it here: #811 i think i kinda like it! the different in the embedding logic is different for every method, so i dont actually know what we gain by having them in the same class Thoughts? that pr still needs to be cleaned up (and to work with the test here, etc) so if you like/agree maybe you could take over from there? |
@hwchase17 I agree with your separation of Classes. I'll pick this up and try to add in the separation of instruction for query vs. embed. |
closing as other pr landed! |
Add new Instructor model to embeddings