-
Notifications
You must be signed in to change notification settings - Fork 24
Description
Problem
If I use some of the similarity_search* methods i can not retrieve all fields of my Elasticsearch Index. The custom doc_builder method has no effect because the fields i would like to return are already not in the hits that get returned by the underlying search call.
I investigated and found out that the fields parameter of the underlying search method is not correctly treated.
This is for example one of the places where this happens:
langchain-elastic/libs/elasticsearch/langchain_elasticsearch/_sync/vectorstores.py
Line 567 in 03da5e1
hits = self._store.search( |
I tested it out, if you copy all the content of the similarity_search* method and provide the field list and also specify a custom doc_builder you can retrieve other fields then just your text_field and metadata.
Example code
...
vector_store = ...
query_embedding = ...
k = ...
response = vector_store.similarity_search_by_vector_with_relevance_scores(
embedding=query_embedding,
k=k
)
Hot fix
...
def custom_doc_builder(hit: Dict) -> Document:
doc = Document(
page_content=hit["_source"].get("text", ""),
metadata=hit["_source"].get("metadata", {}),
)
doc.metadata["EMBEDDING_VECTOR"] = hit["_source"].get("EMBEDDING_VECTOR", "")
return doc
vector_store = ...
query_embedding = ...
k = ...
filter = ...
custom_query = ...
fields = ["text", "metadata", "EMBEDDING_VECTOR"]
hits = vector_store._store.search(
query=None,
query_vector=query_embedding,
k=k,
filter=filter,
fields=fields
custom_query=custom_query,
)
docs = _hits_to_docs_scores(
hits=hits,
content_field=vector_store.query_field,
doc_builder=custom_doc_builder,
)
Soution/Fix
So i would suggest to update this line:
langchain-elastic/libs/elasticsearch/langchain_elasticsearch/_sync/vectorstores.py
Line 567 in 03da5e1
hits = self._store.search( |
to this:
...
def similarity_search_by_vector_with_relevance_scores(
self,
embedding: List[float],
k: int = 4,
filter: Optional[List[Dict]] = None,
fields: List[str] = None, # <===
*,
custom_query: Optional[
Callable[[Dict[str, Any], Optional[str]], Dict[str, Any]]
] = None,
doc_builder: Optional[Callable[[Dict], Document]] = None,
**kwargs: Any,
) -> List[Tuple[Document, float]]:
...
hits = self._store.search(
query=None,
query_vector=embedding,
k=k,
filter=filter,
fields=fields, # <===
custom_query=custom_query,
)
...
in every similarity_search* occurence in this file.
If you see that the same I can create the PR with the fix.
EDIT
I thought about it and i think in general there should be a standard document builder that looks more like this:
# Specify in user code
fields = ["text", "metadata", "EMBEDDING_VECTOR"]
....
def doc_builder(hit: Dict, fields: List[str]) -> Document:
doc = Document(
page_content=hit["_source"].get(content_field, ""),
metadata=hit["_source"].get("metadata", {}),
)
for field_key in fields:
doc.metadata[field_key] = hit["_source"].get(field_key, None)
return doc
Then one can simply specify the fields and the documents come out with all fields. Also I think a this point some how the id of the elements should also be specified to be returned.
Now this is the implementation:
def default_doc_builder(hit: Dict) -> Document: |