Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

Open
Xnhyacinth opened this issue Jan 8, 2025 · 5 comments
Open

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

Xnhyacinth opened this issue Jan 8, 2025 · 5 comments

Comments

@Xnhyacinth
Copy link

hello. I wonder if this only applies to prebuilt indexes? I encounter an error AttributeError: 'FaissSearcher' object has no attribute 'ssearcher'. Did you mean: 'search'? when using the following code:

from pyserini.search.faiss import FaissSearcher
from pyserini.search.lucene import LuceneSearcher
searcher = FaissSearcher( # dense
    'indexes/index', # file path
    'facebook/contriever-msmarco' # model name
    )
#hits = searcher.search('what is a lobster roll')
searcher.doc(0)

I cannot get the raw content through doc(doc_id), but executing searcher.num_docs is ok.

I notice that usage_fetch.md can only provide raw content by LuceneSearcher. So do I need to index the corpus twice and load two searchers?

@lintool
Copy link
Member

lintool commented Jan 8, 2025

Hi @Xnhyacinth thanks for opening this issue. ssearcher seems like a reference to a class that's been deprecated/refactored... so this might be a bug.

However, the actual answer to your question is that only LuceneSearcher provides access to the raw text of the corpus. The Faiss indexes don't, and the implementation for prebuilt indexes should simply dispatch to the corresponding Lucene prebuilt index.

@Xnhyacinth
Copy link
Author

Hi, @lintool thanks for your answer. Another issue is the retrieval time. I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

@lintool
Copy link
Member

lintool commented Jan 9, 2025

I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

What does "750k data" mean? What does this translate into in terms of queries per second?

@Xnhyacinth
Copy link
Author

Xnhyacinth commented Jan 12, 2025

I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

What does "750k data" mean? What does this translate into in terms of queries per second?

“750k data” represents 750k queries. I found that FaissSearcher requires 200+ hours on the enwiki 2021 corpus, while LuceneSearcher only takes 1 hour.

For FaissSearcher to process a query takes more than 2 seconds, which is a significant delay.

@lintool
Copy link
Member

lintool commented Jan 12, 2025

I see. We haven't profiled the performance of FaissSearcher, but there's likely room for improvement...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants