Skip to content

Conversation

krauhen
Copy link

@krauhen krauhen commented Mar 5, 2025

See Issue #62 for the description for this PR.
@giuliabaldini

filter: Optional[List[dict]] = None,
*,
fields: Optional[List[str]] = None,
*,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's still a random indent here.

page_content=hit["_source"].get(content_field, ""),
metadata=hit["_source"].get("metadata", {}),
)
for field_key in fields:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lines (84-88) (now 87-91) do exactly that, wouldn't those lines (80-88) get removed then? Maybe there are other instances in which people create more sophisticated document builders and the fields should be added anyway as they are being added in those lines.

In my current code I just made the changes from above and everything works as expected without any changes in this file.

Maybe the maintainers can have a look at this.

@miguelgrinberg
Copy link
Collaborator

miguelgrinberg commented Mar 12, 2025

I'm not sure this is needed. I find it confusing that this change is mixing up independent fields (which are not supported by LangChain) with metadata fields, which are the proper way to store additional fields when working with LangChain.

How do you plan on addressing collisions if you are just adding standalone fields to the search results as metadata?

Maybe I need some clarification on how you ended up with an Elasticsearch index that has extra fields, given that the LangChain integration does not add them, so that I better understand the use case that your PR is trying to address. Thanks!

@sbartlett97
Copy link

I'm not sure this is needed. I find it confusing that this change is mixing up independent fields (which are not supported by LangChain) with metadata fields, which are the proper way to store additional fields when working with LangChain.

How do you plan on addressing collisions if you are just adding standalone fields to the search results as metadata?

Maybe I need some clarification on how you ended up with an Elasticsearch index that has extra fields, given that the LangChain integration does not add them, so that I better understand the use case that your PR is trying to address. Thanks!

I think the changes to the vectorestore.py files are necessary, as it makes sense to be able to specify fields to be extracted as metadata universally - I've had to patch a version locally as I need this functionality - but agree that the changes to _utilities.py are unnecessary - @krauhen would you be able to remove those changes so the maintainers can review/approve this PR?

@miguelgrinberg
Copy link
Collaborator

miguelgrinberg commented Apr 23, 2025

@sbartlett97 I understand that this solves a problem that and a few others have, but you are not addressing the concerns I raised. LangChain has its own solution for extra fields attached to a document, and this integration tries to adapt to that solution.

Please correct me if I'm wrong, but my guess is that you are trying to use an Elasticsearch index that was not created with the LangChain integration. My thinking is that there are two options available to you:

  • Create your Elasticsearch index using this integration with LangChain, which will ensure that your metadata fields are written in the way that LangChain likes and that this integration fully supports.
  • Keep using your Elasticsearch as it is, but drop this LangChain integration and use the Elasticsearch client package directly. The ES client is lower level and does not impose any constraints on how your document and metadata should be structured.

My concerns with this change:

  • This change is asymmetrical. The change adds a second format only for the search/read side, but the write side is not being updated in the same way.
  • Given that there is a metadata format already implemented and this change is adding a second format that is not compatible, I don't see a clear way to handle metadata field collisions if an index were to have both types of metadata, either intentionally or more likely by mistake, given the above issue with asymmetry between the read and the write sides.

If I'm missing anything with my analysis please clarify.

@sbartlett97
Copy link

@miguelgrinberg I understand your concerns, and indeed, my use case is integrating data from an existing elastic index that wasn't setup using this package.

  • This change is asymmetrical. The change adds a second format only for the search/read side, but the write side is not being updated in the same way.

However, if the fields functionality is already implemented in the max_marginal_relevance_search, then why not the other retrieval methods?

  • Given that there is a metadata format already implemented and this change is adding a second format that is not compatible, I don't see a clear way to handle metadata field collisions if an index were to have both types of metadata, either intentionally or more likely by mistake, given the above issue with asymmetry between the read and the write sides.

Also the changes I proposed was to leave the metadata handling in _hits_to_docs_scores at it's current implementation, but just adding the ability to specify the specific fields to use if they are not stored under metadata in the index. If this is an issue, then surely the current implementation in max_marginal_relevance should be removed as well, as in theory this has the same implications around your concerns of metadata collisions?

@pquentin pquentin changed the title Elastisearch 'fields' parameter not usable in similarity_search* methods Elasticsearch 'fields' parameter not usable in similarity_search* methods Apr 25, 2025
@miguelgrinberg
Copy link
Collaborator

miguelgrinberg commented Apr 27, 2025

However, if the fields functionality is already implemented in the max_marginal_relevance_search, then why not the other retrieval methods?

I need you to clarify why you think this is an inconsistency. The max_marginal_relevance_search method is not in this package, it is in the Elasticsearch client. As I said above, the ES client does not use the LangChain document structure.

I guess it is possible that there are inconsistencies in this LangChain integration (which has been worked on by several people and not just by myself), but at least the one example you pointed out does not seem inconsistent. As I said above, if the support we provide directly in our official client works better for your needs, then there is really no need for you to be using the LangChain integration.

Edit: sorry, actually I do see the inconsistency now, which exists in the max_marginal_relevance() function that accepts fields as an argument in this package. Unfortunately my thinking does not change, this is a bit problematic as it does not provide a method of handling collisions, but we'll have a look at this.

But as I mentioned several times above, why do you need this integration package if you can just work with the ES client directly and not be constrained to the LangChain document structure which you obviously do not care about?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants