Elasticsearch 'fields' parameter not usable in similarity_search* methods #64

krauhen · 2025-03-05T15:52:45Z

See Issue #62 for the description for this PR.
@giuliabaldini

…lds parameter in similarity_search* methods.

giuliabaldini · 2025-03-05T16:04:39Z

libs/elasticsearch/langchain_elasticsearch/_async/vectorstores.py

        filter: Optional[List[dict]] = None,
-        *,
+        fields: Optional[List[str]] = None,
+            *,


There's still a random indent here.

giuliabaldini · 2025-03-05T16:08:06Z

libs/elasticsearch/langchain_elasticsearch/_utilities.py

            page_content=hit["_source"].get(content_field, ""),
            metadata=hit["_source"].get("metadata", {}),
        )
+        for field_key in fields:


Lines (84-88) (now 87-91) do exactly that, wouldn't those lines (80-88) get removed then? Maybe there are other instances in which people create more sophisticated document builders and the fields should be added anyway as they are being added in those lines.

In my current code I just made the changes from above and everything works as expected without any changes in this file.

Maybe the maintainers can have a look at this.

miguelgrinberg · 2025-03-12T16:44:16Z

I'm not sure this is needed. I find it confusing that this change is mixing up independent fields (which are not supported by LangChain) with metadata fields, which are the proper way to store additional fields when working with LangChain.

How do you plan on addressing collisions if you are just adding standalone fields to the search results as metadata?

Maybe I need some clarification on how you ended up with an Elasticsearch index that has extra fields, given that the LangChain integration does not add them, so that I better understand the use case that your PR is trying to address. Thanks!

sbartlett97 · 2025-04-23T10:49:33Z

I'm not sure this is needed. I find it confusing that this change is mixing up independent fields (which are not supported by LangChain) with metadata fields, which are the proper way to store additional fields when working with LangChain.

How do you plan on addressing collisions if you are just adding standalone fields to the search results as metadata?

Maybe I need some clarification on how you ended up with an Elasticsearch index that has extra fields, given that the LangChain integration does not add them, so that I better understand the use case that your PR is trying to address. Thanks!

I think the changes to the vectorestore.py files are necessary, as it makes sense to be able to specify fields to be extracted as metadata universally - I've had to patch a version locally as I need this functionality - but agree that the changes to _utilities.py are unnecessary - @krauhen would you be able to remove those changes so the maintainers can review/approve this PR?

miguelgrinberg · 2025-04-23T20:54:16Z

@sbartlett97 I understand that this solves a problem that and a few others have, but you are not addressing the concerns I raised. LangChain has its own solution for extra fields attached to a document, and this integration tries to adapt to that solution.

Please correct me if I'm wrong, but my guess is that you are trying to use an Elasticsearch index that was not created with the LangChain integration. My thinking is that there are two options available to you:

Create your Elasticsearch index using this integration with LangChain, which will ensure that your metadata fields are written in the way that LangChain likes and that this integration fully supports.
Keep using your Elasticsearch as it is, but drop this LangChain integration and use the Elasticsearch client package directly. The ES client is lower level and does not impose any constraints on how your document and metadata should be structured.

My concerns with this change:

This change is asymmetrical. The change adds a second format only for the search/read side, but the write side is not being updated in the same way.
Given that there is a metadata format already implemented and this change is adding a second format that is not compatible, I don't see a clear way to handle metadata field collisions if an index were to have both types of metadata, either intentionally or more likely by mistake, given the above issue with asymmetry between the read and the write sides.

If I'm missing anything with my analysis please clarify.

sbartlett97 · 2025-04-23T21:36:26Z

@miguelgrinberg I understand your concerns, and indeed, my use case is integrating data from an existing elastic index that wasn't setup using this package.

This change is asymmetrical. The change adds a second format only for the search/read side, but the write side is not being updated in the same way.

However, if the fields functionality is already implemented in the max_marginal_relevance_search, then why not the other retrieval methods?

Given that there is a metadata format already implemented and this change is adding a second format that is not compatible, I don't see a clear way to handle metadata field collisions if an index were to have both types of metadata, either intentionally or more likely by mistake, given the above issue with asymmetry between the read and the write sides.

Also the changes I proposed was to leave the metadata handling in _hits_to_docs_scores at it's current implementation, but just adding the ability to specify the specific fields to use if they are not stored under metadata in the index. If this is an issue, then surely the current implementation in max_marginal_relevance should be removed as well, as in theory this has the same implications around your concerns of metadata collisions?

miguelgrinberg · 2025-04-27T16:20:56Z

However, if the fields functionality is already implemented in the max_marginal_relevance_search, then why not the other retrieval methods?

I need you to clarify why you think this is an inconsistency. The max_marginal_relevance_search method is not in this package, it is in the Elasticsearch client. As I said above, the ES client does not use the LangChain document structure.

I guess it is possible that there are inconsistencies in this LangChain integration (which has been worked on by several people and not just by myself), but at least the one example you pointed out does not seem inconsistent. As I said above, if the support we provide directly in our official client works better for your needs, then there is really no need for you to be using the LangChain integration.

Edit: sorry, actually I do see the inconsistency now, which exists in the max_marginal_relevance() function that accepts fields as an argument in this package. Unfortunately my thinking does not change, this is a bit problematic as it does not provide a method of handling collisions, but we'll have a look at this.

But as I mentioned several times above, why do you need this integration package if you can just work with the ES client directly and not be constrained to the LangChain document structure which you obviously do not care about?

Krause Henning and others added 3 commits March 5, 2025 16:46

Change default document building from elasticsearch element; pass fie…

3dd536a

…lds parameter in similarity_search* methods.

Remove indent.

01cfcac

Pass fields to _hits_to_docs_scores.

26f56d0

giuliabaldini reviewed Mar 5, 2025

View reviewed changes

sbartlett97 mentioned this pull request Apr 23, 2025

Allow fields to be passed in to all similarity searches #65

Open

pquentin changed the title ~~Elastisearch 'fields' parameter not usable in similarity_search* methods~~ Elasticsearch 'fields' parameter not usable in similarity_search* methods Apr 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Elasticsearch 'fields' parameter not usable in similarity_search* methods #64

Elasticsearch 'fields' parameter not usable in similarity_search* methods #64

Uh oh!

krauhen commented Mar 5, 2025 •

edited

Loading

Uh oh!

giuliabaldini Mar 5, 2025

Uh oh!

giuliabaldini Mar 5, 2025

Uh oh!

miguelgrinberg commented Mar 12, 2025 •

edited

Loading

Uh oh!

sbartlett97 commented Apr 23, 2025

Uh oh!

miguelgrinberg commented Apr 23, 2025 •

edited

Loading

Uh oh!

sbartlett97 commented Apr 23, 2025

Uh oh!

miguelgrinberg commented Apr 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Elasticsearch 'fields' parameter not usable in similarity_search* methods #64

Are you sure you want to change the base?

Elasticsearch 'fields' parameter not usable in similarity_search* methods #64

Uh oh!

Conversation

krauhen commented Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

giuliabaldini Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

giuliabaldini Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

miguelgrinberg commented Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbartlett97 commented Apr 23, 2025

Uh oh!

miguelgrinberg commented Apr 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbartlett97 commented Apr 23, 2025

Uh oh!

miguelgrinberg commented Apr 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

krauhen commented Mar 5, 2025 •

edited

Loading

miguelgrinberg commented Mar 12, 2025 •

edited

Loading

miguelgrinberg commented Apr 23, 2025 •

edited

Loading

miguelgrinberg commented Apr 27, 2025 •

edited

Loading