-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add discovery p0 test statistic #520
Conversation
Had a brief look-over. Seems fine to me. We'll need to figure out a proper way to handle switching out test statistics for these things in the future (something a bit more generic -- perhaps akin to how we handle interpolation code configuration with strings->function mapping). We will want tests for this however, with some expected outputs to assert that it's written correctly and computes the right answer. |
That sounds reasonable. So one should have then for each test statistic one function that calculates the test statistic and one function that calculates the p-values? (Since this is different for e.g qmu vs qtilde). Should i go ahead with just another flag for q0 (called it
I would tabulate some values in |
That's roughly the way to go. We'd want to make sure it isn't too slow to calculate. It might be better to add a unit test around the function itself, rather than an integration test (which test_validation does). |
thanks @nikoladze for the PR. I agree this is much-needed functionality. I'll review this. At first glance it seems like it just touches the utils.py and not interfere with #519 (which will be quite a bit of a change in the pdf object side). One thing that would be important is to add tests that check against ROOT, like we do in test_validation.py / the validation dir (here we extracted some hardcoded numbers from ROOT via run_singlepy we check against, at some point a more proper framework would be needed) |
src/pyhf/utils.py
Outdated
@@ -274,18 +351,21 @@ def hypotest( | |||
|
|||
init_pars = init_pars or pdf.config.suggested_init() | |||
par_bounds = par_bounds or pdf.config.suggested_bounds() | |||
test_stat = qmu if not use_q0 else q0 | |||
if qtilde and use_q0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably this should become a better look up generally now that we expand the number of test statistics (more a criticism of the originnal code rather than the PR)
test_stats = {
'q_mu': ..
'q_tilde_mu': ..
'q_0': ..
}
func = test_stats[mystat]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea, but there are probably more places where a distinction has to be made. For example the way how to calculate p-values from the test statistic (mainly different between q_mu and q_tilde) and if one wants cls or pvalues (cls doesn't really make sense for q_0)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best way for now is to keep it as it is, and then clean it up after this PR.
src/pyhf/utils.py
Outdated
qmu = tensorlib.where(muhatbhat[pdf.config.poi_index] > mu, [0], qmu) | ||
return qmu | ||
|
||
|
||
def q0(mu, data, pdf, init_pars, par_bounds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this have a mu
parameter? shouldn't that be fixed to 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mhh - probably there is no real use in having this as a free parameter. On the other hand - when we want to switch to selecting the test statistic from a mapping it might be simpler when they all have the same parameters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a way to do this is to have a kind of test statistic factory which takes kwargs that could be
def get_teststat(type, **opts):
..
teststat = get_teststat(type = 'qmu', mu = mu)
teststat(data, pdf)
or
teststat = get_teststat(type = 'q0')
teststat(data, pdf)
i.e. mu
is not part of the test stat API (which just takes the data and the pdf, but part of the API that gives you the teststat
I added one test for single-bin, one systematic now to have at least something working - but it involved quite some copy&pasting and some not so elegant if-switches for q_0. Also, at least for some numbers i get a quite high relative difference between pyhf and the values i tabulated from HF (had to set tolerance to 3e-4) |
Not good. We should make sure we understand the difference if there is meant to be one (probably not!) |
I'll try to check what the root of the discrepancy is |
Pun intended? ;) |
hi @nikoladze, sorry for the long silence, we're in a pretty big refactoring now, but i'll come back to this asap |
@lukasheinrich No problem - i'm also quite busy right now, but i will have a look again when i have time |
097c4bf
to
40d8243
Compare
It would be nice to revive this PR (as @lukasheinrich @kratsg and I were just talking about discovery stats today). @nikoladze, I may go ahead and do some rebasing and cleaning on your branch (so please fetch if needed). Additionally, maybe it would be nicer if we could move all of these ROOT files use for validation into the |
I put it in a separate git for now. Here is a script with wrapper functions (e.g run the Histfactory chain in a tempdir): |
I think i know more now: the strange drop in p-values for ROOT/HistFactory in this plot So i would say that issue is resolved However, the discrepancies in the validation test i set up for this PR persist (they concern a reasonable parameter range for the gamma parameter). But i think these can be attributed to differences between fitting with scipy or minuit. When i fit with minuit in pyhf i get results much closer to ROOT (1e-6 relative difference) . But - the "exact" asymptotic formula derived from the closed forms of the MLEs is closer to pyhf+scipy! The study can be found in my notebook Edit: Note i did all these plots after also patching the change from #902 to avoid that this issue comes in here |
We should work on rebasing this. @nikoladze Do you have any problems with us just taking another branch at the state of |
@nikoladze i've been working on test statistic stuff in #1005 which hopefully should make it straight forward to add |
@@ -109,30 +119,38 @@ def hypotest( | |||
) | |||
|
|||
_returns = [CLs] | |||
if use_q0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what follows after here is a little bit of a mess, now with even more if-else distinctions due to the q0
- one might consider disentangling the calculation of pvalues and CLs?
) | ||
sqrtqmu_v = tensorlib.sqrt(qmu_v) | ||
def teststat_func(*args, **kwargs): | ||
"make function signature without poi_test to be consistent with q0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for making a function signature without poi_test
is that q0
doesn't take a poi_test
argument since it is always 0. The alternative would be to change q0
to take a poi_test
argument, regardless if it only makes sense to be 0.
@@ -226,3 +226,56 @@ def tmu_tilde(mu, data, pdf, init_pars, par_bounds): | |||
+ 'Use the tmu test statistic (pyhf.infer.test_statistics.tmu) instead.' | |||
) | |||
return _tmu_like(mu, data, pdf, init_pars, par_bounds) | |||
|
|||
|
|||
def q0(data, pdf, init_pars, par_bounds): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
q0
still needs a separate function since it needs to clip for muhat < 0
instead of muhat > mu
as done in _qmu_like
@matthewfeickert, @lukasheinrich i would be fine with starting over bit-by-bit from master, but actually it's only a few places where i had to resolve minor conflicts after merging. I added a few inline comments. |
if kwargs.get('return_expected_set') or kwargs.get('return_expected'): | ||
n_sigma_list = [2, 1, 0, -1, -2] | ||
if not kwargs.get('return_expected_set'): | ||
n_sigma_list = [0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added this to avoid having to type the code for calculating p0 and CLs (and doing the if-else) a second time for the case that only one expected value should be returned
I'm going to start working on this again as it would be really nice to get this in |
@nikoladze We've closed this PR as we migrated it over to PR #1232. Thanks for your contribution here, which was a big help! 🚀 |
Description
It would be nice to have the ability to calculate discovery p values. I started implementing the q_0 test statistic for discovery of a positive signal. To avoid code duplication i moved the calculation of the profile log likelihood ratio into a separate function.
This could potentially then be included as a separate flag (maybe similar to the
qtilde
) inpyhf.utils.hypotest
but i still have to figure out what will be the best way to do this.Also i don't have any tests for this yet.
ReadTheDocs build: https://pyhf.readthedocs.io/en/docs-dev-p0-trigger
Checklist Before Requesting Reviewer
Before Merging
For the PR Assignees: