Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Instructor Model to Embeddins #771

Closed
wants to merge 8 commits into from
Closed
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 49 additions & 12 deletions langchain/embeddings/huggingface.py
Original file line number Diff line number Diff line change
@@ -1,17 +1,23 @@
"""Wrapper around HuggingFace embedding models."""
from typing import Any, List
from enum import Enum

from pydantic import BaseModel, Extra

from langchain.embeddings.base import Embeddings

DEFAULT_MODEL_NAME = "sentence-transformers/all-mpnet-base-v2"
DEFAULT_INSTRUCTION = "Represent the following text:"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the Instructor paper - they support asymmetric instructions (ie. the instruction for embedding and the instruction for retrieval are different). From their repo, "Represent the Wikipedia document for retrieval: " is used for the original embedding and "Represent the Wikipedia question for retrieving supporting documents: " is used when constructing the query embedding. Would be good to support this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, however adding in the ability to do this makes the code a little wonky. Would love to get this merged first as a V1 and then add the asymmetric instructions component later. Alternatively, if you can figure out a good way to do this feel free to add it to this PR. Thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the idea of asymmetric prompts for an embed/query class like this feels somewhat generic - so my instinct would be just to have two parameters to the class embed_instuction and query_instruction instead of just instruction. They can both default to DEFAULT_INSTRUCTION but then the embed and query methods can just reference the correct one. I am not a good judge if this is 'wonky' though.

I agree that this would be ok to merge as is for v1 though - so maybe time for @hwchase17 to have a look again


class MODEL_TYPE(Enum):
SENTENCE_TRANSFORMER = 1
INSTRUCTION_EMBEDDING = 2

class HuggingFaceEmbeddings(BaseModel, Embeddings):
"""Wrapper around sentence_transformers embedding models.

To use, you should have the ``sentence_transformers`` python package installed.
To use sentence transformers, you should have the ``sentence_transformers`` python package installed.
To use Instructor, you should have ``InstructorEmbedding`` python package installed.

Example:
.. code-block:: python
Expand All @@ -23,20 +29,35 @@ class HuggingFaceEmbeddings(BaseModel, Embeddings):

client: Any #: :meta private:
model_name: str = DEFAULT_MODEL_NAME
model_type: str = MODEL_TYPE.SENTENCE_TRANSFORMER
instruction: str = DEFAULT_INSTRUCTION
"""Model name to use."""

def __init__(self, **kwargs: Any):
"""Initialize the sentence_transformer."""
super().__init__(**kwargs)
try:
import sentence_transformers

if (self.model_name == DEFAULT_MODEL_NAME):
try:
import sentence_transformers

self.client = sentence_transformers.SentenceTransformer(self.model_name)
except ImportError:
raise ValueError(
"Could not import sentence_transformers python package. "
"Please install it with `pip install sentence_transformers`."
)
elif ("instructor" in self.model_name):
try:
from InstructorEmbedding import INSTRUCTOR
self.model_type = MODEL_TYPE.INSTRUCTION_EMBEDDING
self.client = INSTRUCTOR(self.model_name)
except ImportError:
raise ValueError(
"Could not import InstructorEmbedding python package. "
"Please install it with `pip install InstructorEmbedding`."
)

self.client = sentence_transformers.SentenceTransformer(self.model_name)
except ImportError:
raise ValueError(
"Could not import sentence_transformers python package. "
"Please install it with `pip install sentence_transformers`."
)

class Config:
"""Configuration for this pydantic object."""
Expand All @@ -53,18 +74,34 @@ def embed_documents(self, texts: List[str]) -> List[List[float]]:
List of embeddings, one for each text.
"""
texts = list(map(lambda x: x.replace("\n", " "), texts))
embeddings = self.client.encode(texts)

if (self.model_type == MODEL_TYPE.INSTRUCTION_EMBEDDING):
instruction_pairs = []
for text in texts:
instruction_pairs.append([self.instruction, text])
embeddings = self.client.encode(instruction_pairs)
else:
embeddings = self.client.encode(texts)

if (self.model_name == DEFAULT_MODEL_NAME):
return embeddings.tolist()

return embeddings.tolist()

def embed_query(self, text: str) -> List[float]:
"""Compute query embeddings using a HuggingFace transformer model.

Args:
text: The text to embed.

Returns:
Embeddings for the text.
"""
text = text.replace("\n", " ")
embedding = self.client.encode(text)

if (self.model_type == MODEL_TYPE.INSTRUCTION_EMBEDDING):
instruction_pair = [self.instruction, text]
embedding = self.client.encode(instruction_pair)
enoreyes marked this conversation as resolved.
Show resolved Hide resolved
else:
embedding = self.client.encode(text)

return embedding.tolist()
15 changes: 15 additions & 0 deletions tests/integration_tests/embeddings/test_huggingface.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,18 @@ def test_huggingface_embedding_query() -> None:
embedding = HuggingFaceEmbeddings()
output = embedding.embed_query(document)
assert len(output) == 768

def test_huggingface_instructor_embedding_documents() -> None:
"""Test huggingface embeddings."""
documents = ["foo bar"]
embedding = HuggingFaceEmbeddings(model_name="hkunlp/instructor-large", instruction="Represent the text")
output = embedding.embed_documents(documents)
assert len(output) == 1
assert len(output[0]) == 768

def test_huggingface_instructor_embedding_query() -> None:
"""Test huggingface embeddings."""
query = "foo bar"
embedding = HuggingFaceEmbeddings(model_name="hkunlp/instructor-large", instruction="Represent the text")
output = embedding.embed_query(query)
assert len(output[0]) == 768
enoreyes marked this conversation as resolved.
Show resolved Hide resolved