The raw() of DenseSearchResult and PRFDenseSearchResult #2057

Xnhyacinth · 2025-01-08T06:02:49Z

hello. I wonder if this only applies to prebuilt indexes? I encounter an error AttributeError: 'FaissSearcher' object has no attribute 'ssearcher'. Did you mean: 'search'? when using the following code:

from pyserini.search.faiss import FaissSearcher
from pyserini.search.lucene import LuceneSearcher
searcher = FaissSearcher( # dense
    'indexes/index', # file path
    'facebook/contriever-msmarco' # model name
    )
#hits = searcher.search('what is a lobster roll')
searcher.doc(0)

I cannot get the raw content through doc(doc_id), but executing searcher.num_docs is ok.

I notice that usage_fetch.md can only provide raw content by LuceneSearcher. So do I need to index the corpus twice and load two searchers?

The text was updated successfully, but these errors were encountered:

lintool · 2025-01-08T14:22:31Z

Hi @Xnhyacinth thanks for opening this issue. ssearcher seems like a reference to a class that's been deprecated/refactored... so this might be a bug.

However, the actual answer to your question is that only LuceneSearcher provides access to the raw text of the corpus. The Faiss indexes don't, and the implementation for prebuilt indexes should simply dispatch to the corresponding Lucene prebuilt index.

Xnhyacinth · 2025-01-09T05:31:24Z

Hi, @lintool thanks for your answer. Another issue is the retrieval time. I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

lintool · 2025-01-09T14:33:04Z

I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

What does "750k data" mean? What does this translate into in terms of queries per second?

Xnhyacinth · 2025-01-12T13:21:04Z

I found that it takes a large amount of time to retrieve using faiss searcher: processing 750k data requires more than 200 hours. Is this normal or do I need to optimize it somehow.

What does "750k data" mean? What does this translate into in terms of queries per second?

“750k data” represents 750k queries. I found that FaissSearcher requires 200+ hours on the enwiki 2021 corpus, while LuceneSearcher only takes 1 hour.

For FaissSearcher to process a query takes more than 2 seconds, which is a significant delay.

lintool · 2025-01-12T20:13:06Z

I see. We haven't profiled the performance of FaissSearcher, but there's likely room for improvement...

Xnhyacinth mentioned this issue Jan 8, 2025

Add raw() to DenseSearchResult and PRFDenseSearchResult #1876

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

Xnhyacinth commented Jan 8, 2025

lintool commented Jan 8, 2025

Xnhyacinth commented Jan 9, 2025

lintool commented Jan 9, 2025

Xnhyacinth commented Jan 12, 2025 •

edited

Loading

lintool commented Jan 12, 2025

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

The raw() of DenseSearchResult and PRFDenseSearchResult #2057

Comments

Xnhyacinth commented Jan 8, 2025

lintool commented Jan 8, 2025

Xnhyacinth commented Jan 9, 2025

lintool commented Jan 9, 2025

Xnhyacinth commented Jan 12, 2025 • edited Loading

lintool commented Jan 12, 2025

Xnhyacinth commented Jan 12, 2025 •

edited

Loading