Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Instructor Model to Embeddins #771

Closed
wants to merge 8 commits into from

Conversation

enoreyes
Copy link
Contributor

Add new Instructor model to embeddings

@enoreyes
Copy link
Contributor Author

Was broken, should be fixed now.

@seanaedmiston
Copy link
Contributor

Hi @enoreyes. Seems that we cross paths a bit and have duplicate PRs (#756) for the Instructor embeddings (they are very cool). I am not expert but your structure looked better (and you had tests!!) - but I noticed that there were a few differences in function. Have commented directly in the PR. If I can help in any way let me know.


from pydantic import BaseModel, Extra

from langchain.embeddings.base import Embeddings

DEFAULT_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
DEFAULT_INSTRUCTION = "Represent the following text:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the Instructor paper - they support asymmetric instructions (ie. the instruction for embedding and the instruction for retrieval are different). From their repo, "Represent the Wikipedia document for retrieval: " is used for the original embedding and "Represent the Wikipedia question for retrieving supporting documents: " is used when constructing the query embedding. Would be good to support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however adding in the ability to do this makes the code a little wonky. Would love to get this merged first as a V1 and then add the asymmetric instructions component later. Alternatively, if you can figure out a good way to do this feel free to add it to this PR. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of asymmetric prompts for an embed/query class like this feels somewhat generic - so my instinct would be just to have two parameters to the class embed_instuction and query_instruction instead of just instruction. They can both default to DEFAULT_INSTRUCTION but then the embed and query methods can just reference the correct one. I am not a good judge if this is 'wonky' though.

I agree that this would be ok to merge as is for v1 though - so maybe time for @hwchase17 to have a look again

enoreyes and others added 2 commits January 30, 2023 12:31
Co-authored-by: seanaedmiston <seane999@gmail.com>
Co-authored-by: seanaedmiston <seane999@gmail.com>
@hwchase17
Copy link
Contributor

@enoreyes @seanaedmiston what do you think about making this a separate class entirely? the reason i think this might be nice is the logic seems to diverge kinda substantially (theres 3 separate places with the if... else... loop). i can also take a stab at this and see what i might look like

@hwchase17
Copy link
Contributor

@enoreyes @seanaedmiston took a stab at it here: #811

i think i kinda like it! the different in the embedding logic is different for every method, so i dont actually know what we gain by having them in the same class

Thoughts?

that pr still needs to be cleaned up (and to work with the test here, etc) so if you like/agree maybe you could take over from there?

@enoreyes
Copy link
Contributor Author

@hwchase17 I agree with your separation of Classes. I'll pick this up and try to add in the separation of instruction for query vs. embed.

@hwchase17
Copy link
Contributor

closing as other pr landed!

@hwchase17 hwchase17 closed this Feb 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants