Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Negative scores from DirichletLM #248

Open
ojwb opened this issue Jun 2, 2024 · 3 comments
Open

Negative scores from DirichletLM #248

ojwb opened this issue Jun 2, 2024 · 3 comments

Comments

@ojwb
Copy link

ojwb commented Jun 2, 2024

In modules/core/src/main/java/org/terrier/matching/models/DirichletLM.java a comment says:

This model is formulated such that all scores are > 0.

and also:

This has one parameter, mu > 0.

The score is calculated by:

        public double score(double tf, double docLength) {
                return WeightingModelLibrary.log(1 + (tf/(mu * (super.termFrequency / numberOfTokens))) ) + WeightingModelLibrary.log(mu/(docLength+mu));
        }

As above, mu is positive; also docLength should be >= 0, so mu/(docLength+mu) will be <= 1 and the second log will be <= 0. It seems to me that given a large enough value for docLength the calculated score can also end up negative.

There are restrictions on the relationships between the possible values of some of the variables in the equation (e.g. numberOfTokens is the sum of docLength over all documents) so let's try to find an actual example where we get a negative score.

Rearranging, this happens when:

docLength > tf * numberOfTokens / super.termFrequency

Say we have 10 documents, one length 11, the rest length 1 (so numberOfTokens is 20). One term occurs once in all of them (so its super.termFrequency is 10), and 10 other different terms occur once in the long document. This seems to satisfy the criteria I derived above: 11 > 1 * 20 / 10

Checking the value calculated by the formula in score() in this case for the common term in the long document we indeed seem to get a negative score:

$ python -c 'from math import *; print(log2(1 + (1/(2500 * (10 / 20))) )+log2(2500/(11+2500)))'
-0.005180239105680202

Have I misunderstood something here?

@ojwb
Copy link
Author

ojwb commented Jun 4, 2024

Looking equation 6 in the Zhai & Lafferty 2004 paper linked to from the source file, I think the formula Terrier is using isn't quite right either.

As the paper notes, "the last term on the righthand side is independent of the document d , and thus can be ignored in ranking".

The problem is the second term is n log αd where n is the number of terms in the query, but Terrier instead adds log αd to the formula computed for each matching term, so effectively multiplies log αd by the number of matching query terms for each document, which is different for documents which don't match all query terms.

@cmacdonald
Copy link
Contributor

Hi Olly (@ojwb). Good to hear from you, and sorry for the delay. I missed this coming in.

So I wasn't involved in the Dirichlet formulation in Terrier. I suspect that the original authors were aiming for an implementation that always gave a positive score, as Terrier (at that point) (and other systems?) assume that scores should be positive (e.g. for pruning).

Happy to test other formulations on our test collections.

@ojwb
Copy link
Author

ojwb commented Aug 31, 2024

Hi Craig, hope you're well,

Meanwhile I've dug further into this, and (I think) now understand things better. For background I was trying to fix Xapian's LM weighting implementation which was also wrong, and looking at Terrier's implementation as well as the original papers for guidance.

I've now fixed Xapian's implementation. The trick used to avoid negative per-term weight contributions is to add an extra component to each term's contribution such that it's always positive, then subtract the sum of these components so the formula gives the same result - we arrange that this sum only depends on the document so it's a term-independent weight contribution - w(doc) in this formula:

weight(doc) = sum_over_term(w(term,doc)) + w(doc)

We can also add on a fixed (for a given query and document collection) contribution to every weight (which won't affect the ranking) such that w(doc) >= 0 (for pruning).

However, this approach doesn't seem to work within Terrier's existing framework as it doesn't seem to support a term-independent weight component (LM with Jelinek-Mercer smoothing is the only LM model I've looked at which doesn't seem to need this).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants