Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can I fed 500K documents in rank_bm25? #27

Open
ramsey-coding opened this issue Aug 25, 2022 · 9 comments
Open

Can I fed 500K documents in rank_bm25? #27

ramsey-coding opened this issue Aug 25, 2022 · 9 comments

Comments

@ramsey-coding
Copy link

Thanks for this awesome library.

I am curious to know whether rank_bm25 can handle 500K documents. Each document has around 1000 words.

Looking forward to your feedback. I want to use the following functionality with rank_bm25:

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)


query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)
result = bm25.get_top_n(tokenized_query, corpus, n=1)

print(result)
@ramsey-coding
Copy link
Author

@Witiko can you please provide any insight?

@Witiko
Copy link
Contributor

Witiko commented Aug 26, 2022

@ramsey-coding I don't see a reason why it shouldn't. Have you tried?

@ramsey-coding
Copy link
Author

@Witiko the problem is call to the bm25.get_top_n is very very slow :-(

It is taking ~5 second per call.

@ramsey-coding
Copy link
Author

@dorianbrown the library is slow to retrieval from ~350K samples. Can you please guide what to do here?

@AmenRa
Copy link

AmenRa commented Nov 17, 2022

Hi @ramsey-coding,

I have just released a new Python-based search engine called retriv.
It only takes ~40ms to query 8M documents on my machine.
If you try it, please, let me know if it works for your use case.

@nashid
Copy link

nashid commented Nov 17, 2022

@AmenRa I am also interested in this feature. Would try out retriv.

@nocoolsandwich
Copy link

Better use elastichsearch.Python version can be slow makes you crazy

@AmenRa
Copy link

AmenRa commented Apr 19, 2023

@nocoolsandwich

You should try my library retriv.
It takes 10 ms to search 10 million documents with BM25.

@dorianbrown
Copy link
Owner

dorianbrown commented Oct 8, 2024

This library started as a side project, and gained a fair amount of traction organically. It was designed as a fairly simple implementation of these retrieval algorithms, but won't compare to something like the mentioned retriv package which has had a lot more effort put into it, and will perform much better in large scale use cases.

I've now also added a remark in the readme to direct users to retriv if they are looking for a performant implementation in python.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants