Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metadata filtering support and fix multi-document and metadata issue #22

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Athe-kunal
Copy link

This PR includes the following changes

  1. If we pass multiple metadata for each document in a folder, the add_to_index will error out because in the below code, it tries to index the ith element, but it is a dictionary, hence it will raise KeyError. Fixed it by renaming metadata as doc_metadata and removed the indexing
for i, item in enumerate(input_items):
     current_doc_id = doc_ids[i] if doc_ids else self.highest_doc_id + 1 + i
     current_metadata = metadata[i] if metadata else None
  1. Added support for filtering based on metadata, so that users can pass a filter_metadata field in the search and get ColPali will only search from these documents.
results = model.search(query, k=5,filter_metadata={"file_name":"attention.pdf"})


print(f"Search results for '{query}':")
for result in results:
   print(f"Doc ID: {result.doc_id}, Page: {result.page_num}, Score: {result.score}")

Results

Search results for 'what's the BLEU score of this new strange method?':
Doc ID: 4, Page: 9, Score: 19.75
Doc ID: 4, Page: 8, Score: 19.5
Doc ID: 4, Page: 11, Score: 17.75
Doc ID: 4, Page: 1, Score: 17.625
Doc ID: 4, Page: 14, Score: 17.375

Now if we check doc ID to metadata

model.model.doc_id_to_metadata

Results

{0: {'file_name': 'attention_table.png'},
 1: {'file_name': 'product_c.png'},
 2: {'file_name': 'financial_report.pdf'},
 3: {'file_name': 'attention_with_a_mustache.pdf'},
 4: {'file_name': 'attention.pdf'}}

It only pulled from Doc ID 4, which we intended.

Please let me know about your suggestions. Looking forward to your collaboration

@Athe-kunal
Copy link
Author

@bclavie
Let me know about the PR, thanks

@bclavie
Copy link
Contributor

bclavie commented Sep 23, 2024

Hey, thanks for this, this PR is helpful!

I have just a couple change requests

  • The policy is to never deprecate/modify an API, unless it is absolutely critical. As such, could you revert the changes so doc_metadata is back to metadata? The slight added clarity isn't worth making a breaking change.
  • Could you update the code to be compatible with the rest of the updated codebase (using the newer colpali engine)? Apologies for this, things move fast and the backend API changed slightly for which we acommodated!

I'll play with this further in the next few days just to make sure nothing's broken. Thank you again!

@nuschandra
Copy link

nuschandra commented Sep 28, 2024

@bclavie @Athe-kunal I was trying to use byaldi for my project and realized that this PR still has an issue with the metadata when there are multiple documents.

In line 340,
doc_metadata = metadata[doc_id] if metadata else None -> We already get the metadata associated with the given doc_id which is a dictionary.

In line 406, we once again do current_metadata = metadata[i] if metadata else None -> This will give a key error.

This happens specifically in v0.0.4

@Athe-kunal
Copy link
Author

Hi @nuschandra and @bclavie
Sorry had a busy week. I got back to this issue and will implement the changes

@Athe-kunal
Copy link
Author

@bclavie
I have addressed the PR comments, and also updated the tutorial quick_overview.ipynb for metadata based filtering. Looking forward to your suggestions.

@Athe-kunal
Copy link
Author

@nuschandra
The bug has been fixed in my current fork. Hoping to merge it soon. Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants