Differences in results compared to Lucene #53

ignorejjj · 2024-09-08T06:37:34Z

ignorejjj
Sep 8, 2024

This is a very good library, very lightweight and easy to install. I want to use it in my project to replace the use of Pyserini (based on Lucene and Java).

I tested the results of BM25 under the same corpus and found some differences in the results (use your example code for nq). Overall, the top 10 result documents are similar, but there are some differences in order. Is this normal?

Answered by xhluca

Sep 8, 2024

Yeah that's normal, we can see the difference in NDCG@10 in the report: https://arxiv.org/abs/2407.03618

There could be a few reasons:

We use a different tokenizer from the Lucene library, which I was not able to find the exact implementation for
The use of stemmer and choice of stopwords can affect the final scores
Our scoring method might differ, we base on the Kamphuis+ survey whereas Pyserini uses Lucene behind the scene, which changes its algorithm (if you can find it, please feel free to share it here)

Overall I think the best way to verify if this is the exact BM25 scoring you want is to implement bm25 manually (should be less than 50lines) and compare it against one of the suppo…

View full answer

xhluca · 2024-09-08T14:02:32Z

xhluca
Sep 8, 2024
Maintainer

Yeah that's normal, we can see the difference in NDCG@10 in the report: https://arxiv.org/abs/2407.03618

There could be a few reasons:

We use a different tokenizer from the Lucene library, which I was not able to find the exact implementation for
The use of stemmer and choice of stopwords can affect the final scores
Our scoring method might differ, we base on the Kamphuis+ survey whereas Pyserini uses Lucene behind the scene, which changes its algorithm (if you can find it, please feel free to share it here)

Overall I think the best way to verify if this is the exact BM25 scoring you want is to implement bm25 manually (should be less than 50lines) and compare it against one of the supported variant here. You can easily add your variant here:

bm25s/bm25s/scoring.py

Lines 99 to 160 in 0a49c62

    
           def _score_tfc_robertson(tf_array, l_d, l_avg, k1, b, delta=None): 
        
               """ 
        
               Computes the term frequency component of the BM25 score using Robertson+ (original) variant 
        
               Implementation: https://cs.uwaterloo.ca/~jimmylin/publications/Kamphuis_etal_ECIR2020_preprint.pdf 
        
               """ 
        
               # idf component is given by the idf_array 
        
               # we calculate the term-frequency component (tfc) 
        
               return tf_array / (k1 * ((1 - b) + b * l_d / l_avg) + tf_array) 
        
           def _score_tfc_lucene(tf_array, l_d, l_avg, k1, b, delta=None): 
        
               """ 
        
               Computes the term frequency component of the BM25 score using Lucene variant (accurate) 
        
               Implementation: https://cs.uwaterloo.ca/~jimmylin/publications/Kamphuis_etal_ECIR2020_preprint.pdf 
        
               """ 
        
               return _score_tfc_robertson(tf_array, l_d, l_avg, k1, b) 
        
           def _score_tfc_atire(tf_array, l_d, l_avg, k1, b, delta=None): 
        
               """ 
        
               Computes the term frequency component of the BM25 score using ATIRE variant 
        
               Implementation: https://cs.uwaterloo.ca/~jimmylin/publications/Kamphuis_etal_ECIR2020_preprint.pdf 
        
               """ 
        
               # idf component is given by the idf_array 
        
               # we calculate the term-frequency component (tfc) 
        
               return (tf_array * (k1 + 1)) / (tf_array + k1 * (1 - b + b * l_d / l_avg)) 
        
           def _score_tfc_bm25l(tf_array, l_d, l_avg, k1, b, delta): 
        
               """ 
        
               Computes the term frequency component of the BM25 score using BM25L variant 
        
               Implementation: https://cs.uwaterloo.ca/~jimmylin/publications/Kamphuis_etal_ECIR2020_preprint.pdf 
        
               """ 
        
               c_array = tf_array / (1 - b + b * l_d / l_avg) 
        
               return ((k1 + 1) * (c_array + delta)) / (k1 + c_array + delta) 
        
           def _score_tfc_bm25plus(tf_array, l_d, l_avg, k1, b, delta): 
        
               """ 
        
               Computes the term frequency component of the BM25 score using BM25+ variant 
        
               Implementation: https://cs.uwaterloo.ca/~jimmylin/publications/Kamphuis_etal_ECIR2020_preprint.pdf 
        
               """ 
        
               num = (k1 + 1) * tf_array 
        
               den = k1 * (1 - b + b * l_d / l_avg) + tf_array 
        
               return (num / den) + delta 
        
           def _select_tfc_scorer(method) -> callable: 
        
               if method == "robertson": 
        
                   return _score_tfc_robertson 
        
               elif method == "lucene": 
        
                   return _score_tfc_lucene 
        
               elif method == "atire": 
        
                   return _score_tfc_atire 
        
               elif method == "bm25l": 
        
                   return _score_tfc_bm25l 
        
               elif method == "bm25+": 
        
                   return _score_tfc_bm25plus 
        
               else: 
        
                   error_msg = f"Invalid score_tfc value: {method}. Choose from 'robertson', 'lucene', 'atire'." 
        
                   raise ValueError(error_msg)

1 reply

ignorejjj Sep 8, 2024
Author

Thank you for your reply! I will conduct some experiments later.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Differences in results compared to Lucene #53

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

Differences in results compared to Lucene #53

ignorejjj Sep 8, 2024

Replies: 1 comment · 1 reply

xhluca Sep 8, 2024 Maintainer

ignorejjj Sep 8, 2024 Author

ignorejjj
Sep 8, 2024

Replies: 1 comment 1 reply

xhluca
Sep 8, 2024
Maintainer

ignorejjj Sep 8, 2024
Author