-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Negative scores from DirichletLM #248
Comments
Looking equation 6 in the Zhai & Lafferty 2004 paper linked to from the source file, I think the formula Terrier is using isn't quite right either. As the paper notes, "the last term on the righthand side is independent of the document d , and thus can be ignored in ranking". The problem is the second term is n log αd where n is the number of terms in the query, but Terrier instead adds log αd to the formula computed for each matching term, so effectively multiplies log αd by the number of matching query terms for each document, which is different for documents which don't match all query terms. |
Hi Olly (@ojwb). Good to hear from you, and sorry for the delay. I missed this coming in. So I wasn't involved in the Dirichlet formulation in Terrier. I suspect that the original authors were aiming for an implementation that always gave a positive score, as Terrier (at that point) (and other systems?) assume that scores should be positive (e.g. for pruning). Happy to test other formulations on our test collections. |
Hi Craig, hope you're well, Meanwhile I've dug further into this, and (I think) now understand things better. For background I was trying to fix Xapian's LM weighting implementation which was also wrong, and looking at Terrier's implementation as well as the original papers for guidance. I've now fixed Xapian's implementation. The trick used to avoid negative per-term weight contributions is to add an extra component to each term's contribution such that it's always positive, then subtract the sum of these components so the formula gives the same result - we arrange that this sum only depends on the document so it's a term-independent weight contribution -
We can also add on a fixed (for a given query and document collection) contribution to every weight (which won't affect the ranking) such that However, this approach doesn't seem to work within Terrier's existing framework as it doesn't seem to support a term-independent weight component (LM with Jelinek-Mercer smoothing is the only LM model I've looked at which doesn't seem to need this). |
In
modules/core/src/main/java/org/terrier/matching/models/DirichletLM.java
a comment says:and also:
The score is calculated by:
As above,
mu
is positive; alsodocLength
should be >= 0, somu/(docLength+mu)
will be <= 1 and the second log will be <= 0. It seems to me that given a large enough value fordocLength
the calculated score can also end up negative.There are restrictions on the relationships between the possible values of some of the variables in the equation (e.g.
numberOfTokens
is the sum ofdocLength
over all documents) so let's try to find an actual example where we get a negative score.Rearranging, this happens when:
docLength > tf * numberOfTokens / super.termFrequency
Say we have 10 documents, one length 11, the rest length 1 (so
numberOfTokens
is 20). One term occurs once in all of them (so itssuper.termFrequency
is 10), and 10 other different terms occur once in the long document. This seems to satisfy the criteria I derived above:11 > 1 * 20 / 10
Checking the value calculated by the formula in
score()
in this case for the common term in the long document we indeed seem to get a negative score:Have I misunderstood something here?
The text was updated successfully, but these errors were encountered: