Skip to content

Elastisearch 'fields' parameter not usable in similarity_search* methods #62

@krauhen

Description

@krauhen

Problem

If I use some of the similarity_search* methods i can not retrieve all fields of my Elasticsearch Index. The custom doc_builder method has no effect because the fields i would like to return are already not in the hits that get returned by the underlying search call.

I investigated and found out that the fields parameter of the underlying search method is not correctly treated.
This is for example one of the places where this happens:

I tested it out, if you copy all the content of the similarity_search* method and provide the field list and also specify a custom doc_builder you can retrieve other fields then just your text_field and metadata.

Example code

...
vector_store = ...
query_embedding = ...
k = ...

response = vector_store.similarity_search_by_vector_with_relevance_scores(
    embedding=query_embedding,
    k=k
)

Hot fix

...
def custom_doc_builder(hit: Dict) -> Document:
    doc = Document(
        page_content=hit["_source"].get("text", ""),
        metadata=hit["_source"].get("metadata", {}),
    )
    doc.metadata["EMBEDDING_VECTOR"] = hit["_source"].get("EMBEDDING_VECTOR", "")
    return doc

vector_store = ...
query_embedding = ...
k = ...
filter = ...
custom_query = ...

fields = ["text", "metadata", "EMBEDDING_VECTOR"]

hits = vector_store._store.search(
    query=None,
    query_vector=query_embedding,
    k=k,
    filter=filter,
    fields=fields
    custom_query=custom_query,
)
docs = _hits_to_docs_scores(
    hits=hits,
    content_field=vector_store.query_field,
    doc_builder=custom_doc_builder,
)

Soution/Fix

So i would suggest to update this line:


to this:

...
def similarity_search_by_vector_with_relevance_scores(
    self,
    embedding: List[float],
    k: int = 4,
    filter: Optional[List[Dict]] = None,
    fields: List[str] = None,                     # <===
    *,
    custom_query: Optional[
        Callable[[Dict[str, Any], Optional[str]], Dict[str, Any]]
    ] = None,
    doc_builder: Optional[Callable[[Dict], Document]] = None,
    **kwargs: Any,
) -> List[Tuple[Document, float]]:
...
hits = self._store.search(
    query=None,
    query_vector=embedding,
    k=k,
    filter=filter,
    fields=fields,                                      # <===
    custom_query=custom_query,
)
...

in every similarity_search* occurence in this file.

If you see that the same I can create the PR with the fix.

EDIT

I thought about it and i think in general there should be a standard document builder that looks more like this:

# Specify in user code
fields = ["text", "metadata", "EMBEDDING_VECTOR"]
....
def doc_builder(hit: Dict, fields: List[str]) -> Document:
    doc = Document(
        page_content=hit["_source"].get(content_field, ""),
        metadata=hit["_source"].get("metadata", {}),
    )
    for field_key in fields:
        doc.metadata[field_key]  = hit["_source"].get(field_key, None)
    return doc

Then one can simply specify the fields and the documents come out with all fields. Also I think a this point some how the id of the elements should also be specified to be returned.
Now this is the implementation:

def default_doc_builder(hit: Dict) -> Document:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions