-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retrieving speed for large set of documents #25
Comments
@StalVars whats the latency you got for ~400K documents? Also whats the memory usage? |
@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here? |
You may want to use a library such as Gensim, which builds a dictionary mapping from words to ids and then indexes the documents using a sparse matrix. This makes indexing slower, but retrieval is much faster than rank-bm25, because it can use fast matrix operations: piskvorky/gensim#3304 Alternatively, use industry-strength packages such as pyserini or elasticsearch. Rank-bm25 is not built for speed. |
thanks @Witiko. What's the downside of using pyserini? Also what's the use case of Rank-bm25? Ease of use? |
Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file. |
1 similar comment
Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file. |
Rank-bm25 is a simple solution for use cases, where speed is not a concern. |
@Witiko thanks a lot for the feedback, really appreciate it 🙏 So there is no python solution that is fast? |
Sounds like need to try with Gensim. But if the loading of documents are slow with Gensim, that's not a good fit for me either. I need to <~250 ms response time during retrieval. |
Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:
See piskvorky/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon. |
great. Would I get same retrieval accuracy with Gensim in comparison to |
@Witiko what would be the performance of Gensim during loading the 500k documents? Would that be competitive with |
The algorithm is exactly the same aa rank-bm25, so accuracy should also be the same. The loading may be slightly slower than in rank-bm25, because we need to build a dictionary, but the retrieval should be significantly faster. Try it out and let me know. |
I ran the command But it does not install and fails with the following error message:
|
I am not sure what the issue is. It works fine in the python:3.7 Docker image. |
@Witiko when I install like Thanks for the feedback. I will try with python:3.7 |
It's not an issue with the branch. The release version of gensim has a precompiled wheel, which circumvents your compiler. |
A little tangential, but I found another interesting speed issue. I made a refactored/simplified version of BM25Okapi from rank_bm25 to https://github.com/jankovicsandras/plpgsql_bm25/blob/main/mybm25okapi.py (E.g. it's possible to compute half of the lines 119-120 in rank_bm25.py beforehand. score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
(q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl))) ) |
Sounds interesting, I haven't had a close look at the changes yet, but basically you're just precalculating all the static terms from the scoring function? And I guess since the initial stuff was done with the |
Thanks for your answer! 😊 I opened a PR: #46 |
I made an optimized rewrite with all 3 algorithms: https://github.com/jankovicsandras/bm25opt Comparative testing shows it runs approx 30-40 x faster than rank_bm25 while producing exactly same scores. |
I found the retrieval very slow for ~ 20 million documents (wikipedia). Is it the case?
The text was updated successfully, but these errors were encountered: