Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieving speed for large set of documents #25

Open
StalVars opened this issue Jul 13, 2022 · 22 comments
Open

Retrieving speed for large set of documents #25

StalVars opened this issue Jul 13, 2022 · 22 comments

Comments

@StalVars
Copy link

I found the retrieval very slow for ~ 20 million documents (wikipedia). Is it the case?

@nashid
Copy link

nashid commented Aug 25, 2022

@StalVars whats the latency you got for ~400K documents? Also whats the memory usage?

@ramsey-coding
Copy link

@StalVars @Witiko I found the library to be super slow during retrieval from ~350K samples. Please help 🙏 🙏 🙏

@ramsey-coding
Copy link

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

You may want to use a library such as Gensim, which builds a dictionary mapping from words to ids and then indexes the documents using a sparse matrix. This makes indexing slower, but retrieval is much faster than rank-bm25, because it can use fast matrix operations: piskvorky/gensim#3304

Alternatively, use industry-strength packages such as pyserini or elasticsearch. Rank-bm25 is not built for speed.

@ramsey-coding
Copy link

thanks @Witiko. What's the downside of using pyserini?

Also what's the use case of Rank-bm25? Ease of use?

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

1 similar comment
@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

Pyserini is a python binding for the anserini java library. Therefore, you need to have Java installed, which makes pyserini more difficult to install than rank-bm25, which is essentially just a single Python file.

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

Rank-bm25 is a simple solution for use cases, where speed is not a concern.

@ramsey-coding
Copy link

@Witiko thanks a lot for the feedback, really appreciate it 🙏

So there is no python solution that is fast?

@ramsey-coding
Copy link

Sounds like need to try with Gensim. But if the loading of documents are slow with Gensim, that's not a good fit for me either.

I need to <~250 ms response time during retrieval.

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:

pip install git+https://github.com/witiko/gensim.git@feature/bm25

See piskvorky/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

@ramsey-coding
Copy link

great. Would I get same retrieval accuracy with Gensim in comparison to rank_bm25?

@ramsey-coding
Copy link

@Witiko what would be the performance of Gensim during loading the 500k documents? Would that be competitive with rank_bm25?

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

The algorithm is exactly the same aa rank-bm25, so accuracy should also be the same. The loading may be slightly slower than in rank-bm25, because we need to build a dictionary, but the retrieval should be significantly faster. Try it out and let me know.

@ramsey-coding
Copy link

ramsey-coding commented Aug 27, 2022

@Witiko

Gensim is a pure Python solution that uses accelerated python libraries such as SciPy and NumPy. It is quite fast in the retrieval stage. Support for BM25 in Gensim is still experimental; you can install it as follows:

pip install git+https://github.com/witiko/gensim.git@feature/bm25

See RaRe-Technologies/gensim#3304 for an example of how you would use it. If you find it useful, please put a comment there, so that Gensim developers know that it is valuable to users and will merge the support for BM25 soon.

I ran the command pip install git+https://github.com/witiko/gensim.git@feature/bm25.

But it does not install and fails with the following error message:

      building 'gensim.models.nmf_pgd' extension
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/models/nmf_pgd.c -o build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -shared -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib -Wl,-rpath,/root/miniconda/envs/codex-env/lib -Wl,-rpath-link,/root/miniconda/envs/codex-env/lib -L/root/miniconda/envs/codex-env/lib build/temp.linux-x86_64-cpython-39/gensim/models/nmf_pgd.o -o build/lib.linux-x86_64-cpython-39/gensim/models/nmf_pgd.cpython-39-x86_64-linux-gnu.so
      building 'gensim.similarities.fastss' extension
      creating build/temp.linux-x86_64-cpython-39/gensim/similarities
      gcc -pthread -B /root/miniconda/envs/codex-env/compiler_compat -Wno-unused-result -Wsign-compare -DNDEBUG -O2 -Wall -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -I/root/miniconda/envs/codex-env/include -fPIC -O2 -isystem /root/miniconda/envs/codex-env/include -fPIC -I/root/miniconda/envs/codex-env/include/python3.9 -I/root/miniconda/envs/codex-env/lib/python3.9/site-packages/numpy/core/include -c gensim/similarities/fastss.c -o build/temp.linux-x86_64-cpython-39/gensim/similarities/fastss.o
      gensim/similarities/fastss.c: In function ‘ceditdist’:
      gensim/similarities/fastss.c:725:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH tmpi = 0; tmpi <= len_s1; tmpi++) row2[tmpi] = tmpi;
               ^
      gensim/similarities/fastss.c:725:9: note: use option -std=c99 or -std=gnu99 to compile your code
      gensim/similarities/fastss.c:727:9: error: ‘for’ loop initial declarations are only allowed in C99 mode
               for (WIDTH i2 = 0; i2 < len_s2; i2++) {
               ^
      gensim/similarities/fastss.c:738:13: error: ‘for’ loop initial declarations are only allowed in C99 mode
                   for (WIDTH i1 = 0; i1 < len_s1; i1++) {
                   ^
      error: command '/usr/bin/gcc' failed with exit code 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> gensim

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

I am not sure what the issue is. It works fine in the python:3.7 Docker image.

@ramsey-coding
Copy link

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

Thanks for the feedback. I will try with python:3.7

@Witiko
Copy link
Contributor

Witiko commented Aug 27, 2022

@Witiko when I install like pip install --upgrade gensim, it just works. Looks like an issue with the https://github.com/witiko/gensim.git@feature/bm25 branch.

It's not an issue with the branch. The release version of gensim has a precompiled wheel, which circumvents your compiler.

@jankovicsandras
Copy link

A little tangential, but I found another interesting speed issue. I made a refactored/simplified version of BM25Okapi from rank_bm25 to https://github.com/jankovicsandras/plpgsql_bm25/blob/main/mybm25okapi.py
This is without numpy (!), using just math and precalculates stuff in __init__ to simplify the query function. Still it's 2-4x faster than rank_bm25 in this quick-and-dirty test, so it might be possible to speed up rank_bm25 a lot.

https://github.com/jankovicsandras/plpgsql_bm25/blob/main/plpgsql_bm25_comparison_with_paradedb_pg_search.ipynb

(E.g. it's possible to compute half of the lines 119-120 in rank_bm25.py beforehand.

     score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))

)

@dorianbrown
Copy link
Owner

dorianbrown commented Oct 7, 2024

Sounds interesting, I haven't had a close look at the changes yet, but basically you're just precalculating all the static terms from the scoring function?

And I guess since the initial stuff was done with the math functions, vectorizing it with numpy would result in larger speed gains? Sounds feasible and promising, I'll try and double check the change when I've got some time, and run some tests. But feel free to create a PR if you're interested!

@jankovicsandras
Copy link

Thanks for your answer! 😊 I opened a PR: #46

@jankovicsandras
Copy link

I made an optimized rewrite with all 3 algorithms: https://github.com/jankovicsandras/bm25opt

Comparative testing shows it runs approx 30-40 x faster than rank_bm25 while producing exactly same scores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants