Negative scores from DirichletLM #248

ojwb · 2024-06-02T04:59:44Z

In modules/core/src/main/java/org/terrier/matching/models/DirichletLM.java a comment says:

This model is formulated such that all scores are > 0.

and also:

This has one parameter, mu > 0.

The score is calculated by:

        public double score(double tf, double docLength) {
                return WeightingModelLibrary.log(1 + (tf/(mu * (super.termFrequency / numberOfTokens))) ) + WeightingModelLibrary.log(mu/(docLength+mu));
        }

As above, mu is positive; also docLength should be >= 0, so mu/(docLength+mu) will be <= 1 and the second log will be <= 0. It seems to me that given a large enough value for docLength the calculated score can also end up negative.

There are restrictions on the relationships between the possible values of some of the variables in the equation (e.g. numberOfTokens is the sum of docLength over all documents) so let's try to find an actual example where we get a negative score.

Rearranging, this happens when:

docLength > tf * numberOfTokens / super.termFrequency

Say we have 10 documents, one length 11, the rest length 1 (so numberOfTokens is 20). One term occurs once in all of them (so its super.termFrequency is 10), and 10 other different terms occur once in the long document. This seems to satisfy the criteria I derived above: 11 > 1 * 20 / 10

Checking the value calculated by the formula in score() in this case for the common term in the long document we indeed seem to get a negative score:

$ python -c 'from math import *; print(log2(1 + (1/(2500 * (10 / 20))) )+log2(2500/(11+2500)))'
-0.005180239105680202

Have I misunderstood something here?

The text was updated successfully, but these errors were encountered:

ojwb · 2024-06-04T01:34:22Z

Looking equation 6 in the Zhai & Lafferty 2004 paper linked to from the source file, I think the formula Terrier is using isn't quite right either.

As the paper notes, "the last term on the righthand side is independent of the document d , and thus can be ignored in ranking".

The problem is the second term is n log α_d where n is the number of terms in the query, but Terrier instead adds log α_d to the formula computed for each matching term, so effectively multiplies log α_d by the number of matching query terms for each document, which is different for documents which don't match all query terms.

cmacdonald · 2024-08-21T10:52:38Z

Hi Olly (@ojwb). Good to hear from you, and sorry for the delay. I missed this coming in.

So I wasn't involved in the Dirichlet formulation in Terrier. I suspect that the original authors were aiming for an implementation that always gave a positive score, as Terrier (at that point) (and other systems?) assume that scores should be positive (e.g. for pruning).

Happy to test other formulations on our test collections.

ojwb · 2024-08-31T21:08:00Z

Hi Craig, hope you're well,

Meanwhile I've dug further into this, and (I think) now understand things better. For background I was trying to fix Xapian's LM weighting implementation which was also wrong, and looking at Terrier's implementation as well as the original papers for guidance.

I've now fixed Xapian's implementation. The trick used to avoid negative per-term weight contributions is to add an extra component to each term's contribution such that it's always positive, then subtract the sum of these components so the formula gives the same result - we arrange that this sum only depends on the document so it's a term-independent weight contribution - w(doc) in this formula:

weight(doc) = sum_over_term(w(term,doc)) + w(doc)

We can also add on a fixed (for a given query and document collection) contribution to every weight (which won't affect the ranking) such that w(doc) >= 0 (for pruning).

However, this approach doesn't seem to work within Terrier's existing framework as it doesn't seem to support a term-independent weight component (LM with Jelinek-Mercer smoothing is the only LM model I've looked at which doesn't seem to need this).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Negative scores from DirichletLM #248

Negative scores from DirichletLM #248

ojwb commented Jun 2, 2024

ojwb commented Jun 4, 2024

cmacdonald commented Aug 21, 2024

ojwb commented Aug 31, 2024

Negative scores from DirichletLM #248

Negative scores from DirichletLM #248

Comments

ojwb commented Jun 2, 2024

ojwb commented Jun 4, 2024

cmacdonald commented Aug 21, 2024

ojwb commented Aug 31, 2024