feat: Add discovery p0 test statistic #520

nikoladze · 2019-07-28T19:24:05Z

Description

It would be nice to have the ability to calculate discovery p values. I started implementing the q_0 test statistic for discovery of a positive signal. To avoid code duplication i moved the calculation of the profile log likelihood ratio into a separate function.
This could potentially then be included as a separate flag (maybe similar to the qtilde) in pyhf.utils.hypotest but i still have to figure out what will be the best way to do this.
Also i don't have any tests for this yet.

ReadTheDocs build: https://pyhf.readthedocs.io/en/docs-dev-p0-trigger

Checklist Before Requesting Reviewer

Tests are passing
"WIP" removed from the title of the pull request
Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

Summarize commit messages into a comprehensive review of the PR

coveralls · 2019-07-28T19:44:59Z

Coverage increased (+0.004%) to 93.567% when pulling a110d96 on nikoladze:dev-p0 into 5255252 on diana-hep:master.

src/pyhf/utils.py

kratsg · 2019-07-29T02:48:44Z

Had a brief look-over. Seems fine to me. We'll need to figure out a proper way to handle switching out test statistics for these things in the future (something a bit more generic -- perhaps akin to how we handle interpolation code configuration with strings->function mapping).

We will want tests for this however, with some expected outputs to assert that it's written correctly and computes the right answer.

nikoladze · 2019-07-29T10:00:10Z

Had a brief look-over. Seems fine to me. We'll need to figure out a proper way to handle switching out test statistics for these things in the future (something a bit more generic -- perhaps akin to how we handle interpolation code configuration with strings->function mapping).

That sounds reasonable. So one should have then for each test statistic one function that calculates the test statistic and one function that calculates the p-values? (Since this is different for e.g qmu vs qtilde). Should i go ahead with just another flag for q0 (called it use_q0 now due to name clash with the function) and make another iteration for improving the mapping later or resolve this as well in this PR?

We will want tests for this however, with some expected outputs to assert that it's written correctly and computes the right answer.

I would tabulate some values in test_validation.py similar to what is already there for CL_s and validate them against HistFactory?

kratsg · 2019-07-29T11:37:34Z

I would tabulate some values in test_validation.py similar to what is already there for CL_s and validate them against HistFactory?

That's roughly the way to go. We'd want to make sure it isn't too slow to calculate. It might be better to add a unit test around the function itself, rather than an integration test (which test_validation does).

lukasheinrich · 2019-08-01T12:18:24Z

thanks @nikoladze for the PR. I agree this is much-needed functionality. I'll review this. At first glance it seems like it just touches the utils.py and not interfere with #519 (which will be quite a bit of a change in the pdf object side).

One thing that would be important is to add tests that check against ROOT, like we do in test_validation.py / the validation dir (here we extracted some hardcoded numbers from ROOT via run_singlepy we check against, at some point a more proper framework would be needed)

src/pyhf/utils.py

lukasheinrich · 2019-08-01T12:24:25Z

src/pyhf/utils.py

@@ -274,18 +351,21 @@ def hypotest(

    init_pars = init_pars or pdf.config.suggested_init()
    par_bounds = par_bounds or pdf.config.suggested_bounds()
+    test_stat = qmu if not use_q0 else q0
+    if qtilde and use_q0:


probably this should become a better look up generally now that we expand the number of test statistics (more a criticism of the originnal code rather than the PR)

test_stats = { 'q_mu': .. 'q_tilde_mu': .. 'q_0': .. } func = test_stats[mystat]

This is a good idea, but there are probably more places where a distinction has to be made. For example the way how to calculate p-values from the test statistic (mainly different between q_mu and q_tilde) and if one wants cls or pvalues (cls doesn't really make sense for q_0)

I think the best way for now is to keep it as it is, and then clean it up after this PR.

lukasheinrich · 2019-08-01T12:26:05Z

src/pyhf/utils.py

    qmu = tensorlib.where(muhatbhat[pdf.config.poi_index] > mu, [0], qmu)
    return qmu


+def q0(mu, data, pdf, init_pars, par_bounds):


why does this have a mu parameter? shouldn't that be fixed to 0?

mhh - probably there is no real use in having this as a free parameter. On the other hand - when we want to switch to selecting the test statistic from a mapping it might be simpler when they all have the same parameters.

I think a way to do this is to have a kind of test statistic factory which takes kwargs that could be

def get_teststat(type, **opts): .. teststat = get_teststat(type = 'qmu', mu = mu) teststat(data, pdf) or teststat = get_teststat(type = 'q0') teststat(data, pdf)

i.e. mu is not part of the test stat API (which just takes the data and the pdf, but part of the API that gives you the teststat

nikoladze · 2019-08-02T13:11:25Z

One thing that would be important is to add tests that check against ROOT, like we do in test_validation.py / the validation dir (here we extracted some hardcoded numbers from ROOT via run_singlepy we check against, at some point a more proper framework would be needed)

I added one test for single-bin, one systematic now to have at least something working - but it involved quite some copy&pasting and some not so elegant if-switches for q_0. Also, at least for some numbers i get a quite high relative difference between pyhf and the values i tabulated from HF (had to set tolerance to 3e-4)

kratsg · 2019-08-02T18:56:24Z

I added one test for single-bin, one systematic now to have at least something working - but it involved quite some copy&pasting and some not so elegant if-switches for q_0. Also, at least for some numbers i get a quite high relative difference between pyhf and the values i tabulated from HF (had to set tolerance to 3e-4)

Not good. We should make sure we understand the difference if there is meant to be one (probably not!)

lukasheinrich · 2019-08-02T18:59:31Z

I'll try to check what the root of the discrepancy is

nikoladze · 2019-08-12T14:33:13Z

I'll try to check what the root of the discrepancy is

Pun intended? ;)
I also tried to dig around a bit. I can't really reproduce the steps ROOT does in the AsymptoticCalculator, but what the logs indicate, the largest part of the discrepancy comes from different numbers in the Asimov data set. ROOT gives [76.9869, 53.0479], while i get [76.99019042, 53.05121471] from pyhf. When i propagate the ROOT asimov values through the rest of the pyhf calculation the matching gets a bit better (~up to 2e-5 relative difference).
Interestingly the fitted parameter value for the nuisance paramater in the constrained fit for getting asimov data is pretty close between ROOT and pyhf (~1.5e-5 relative difference).
Now i'm a bit stuck in following how AsymptoticCalculator::GenerateAsimovData works ...

lukasheinrich · 2019-09-08T18:53:48Z

hi @nikoladze,

sorry for the long silence, we're in a pretty big refactoring now, but i'll come back to this asap

nikoladze · 2019-09-10T16:52:18Z

@lukasheinrich No problem - i'm also quite busy right now, but i will have a look again when i have time

matthewfeickert · 2020-06-17T20:10:07Z

It would be nice to revive this PR (as @lukasheinrich @kratsg and I were just talking about discovery stats today). @nikoladze, I may go ahead and do some rebasing and cleaning on your branch (so please fetch if needed). Additionally, maybe it would be nicer if we could move all of these ROOT files use for validation into the scikit-hep-testdata project.

nikoladze · 2020-06-20T08:45:10Z

Can you share with us (maybe in a Gist) the code that you're using to do this check and make the plots? Would be nice to look at as well.

I put it in a separate git for now. Here is a script with wrapper functions (e.g run the Histfactory chain in a tempdir):
https://github.com/nikoladze/debug_pyhf/blob/26af92c3b72191044bf36be9d225ab5d18f2747c/run_everything.py
And here is the corresponding notebook that creates the plots i posted earlier (also containing the formula for the "exact" case):
https://github.com/nikoladze/debug_pyhf/blob/26af92c3b72191044bf36be9d225ab5d18f2747c/study_diff_hf_pyhf.ipynb

nikoladze · 2020-06-23T14:51:45Z

I think i know more now: the strange drop in p-values for ROOT/HistFactory in this plot

comes from the default parameter range in HistFactory for gamma parameters (from 0 to 1 + 5 sigma). When i manually set it to the same range as in pyhf (from 1e-10 to 10) this drop goes away. In practice i think this shouldn't matter though since we will rarely run into situations where our observed data is so large that in the fit where mu is fixed to 1 the background is pulled up by more than 5 sigma ;)

So i would say that issue is resolved

However, the discrepancies in the validation test i set up for this PR persist (they concern a reasonable parameter range for the gamma parameter). But i think these can be attributed to differences between fitting with scipy or minuit. When i fit with minuit in pyhf i get results much closer to ROOT (1e-6 relative difference) . But - the "exact" asymptotic formula derived from the closed forms of the MLEs is closer to pyhf+scipy! The study can be found in my notebook

Edit: Note i did all these plots after also patching the change from #902 to avoid that this issue comes in here

matthewfeickert · 2020-08-06T16:15:43Z

We should work on rebasing this. @nikoladze Do you have any problems with us just taking another branch at the state of master and then just using git cherry-pick and hand picking the relevant sections to get this up to date? There has been a lot of movement in the last year of the internals, so this will take a long time to rebase cleanly I think.

lukasheinrich · 2020-08-06T16:51:29Z

@nikoladze i've been working on test statistic stuff in #1005 which hopefully should make it straight forward to add q0 to

nikoladze · 2020-08-07T07:41:54Z

src/pyhf/infer/__init__.py

@@ -109,30 +119,38 @@ def hypotest(
    )

    _returns = [CLs]
+    if use_q0:


what follows after here is a little bit of a mess, now with even more if-else distinctions due to the q0 - one might consider disentangling the calculation of pvalues and CLs?

nikoladze · 2020-08-07T07:44:07Z

src/pyhf/infer/calculators.py

-        )
-        sqrtqmu_v = tensorlib.sqrt(qmu_v)
+            def teststat_func(*args, **kwargs):
+                "make function signature without poi_test to be consistent with q0"


The reason for making a function signature without poi_test is that q0 doesn't take a poi_test argument since it is always 0. The alternative would be to change q0 to take a poi_test argument, regardless if it only makes sense to be 0.

nikoladze · 2020-08-07T07:46:02Z

src/pyhf/infer/test_statistics.py

@@ -226,3 +226,56 @@ def tmu_tilde(mu, data, pdf, init_pars, par_bounds):
            + 'Use the tmu test statistic (pyhf.infer.test_statistics.tmu) instead.'
        )
    return _tmu_like(mu, data, pdf, init_pars, par_bounds)
+
+
+def q0(data, pdf, init_pars, par_bounds):


q0 still needs a separate function since it needs to clip for muhat < 0 instead of muhat > mu as done in _qmu_like

nikoladze · 2020-08-07T07:49:33Z

@matthewfeickert, @lukasheinrich i would be fine with starting over bit-by-bit from master, but actually it's only a few places where i had to resolve minor conflicts after merging. I added a few inline comments.

nikoladze · 2020-08-07T07:59:47Z

src/pyhf/infer/__init__.py

+    if kwargs.get('return_expected_set') or kwargs.get('return_expected'):
+        n_sigma_list = [2, 1, 0, -1, -2]
+        if not kwargs.get('return_expected_set'):
+            n_sigma_list = [0]


added this to avoid having to type the code for calculating p0 and CLs (and doing the if-else) a second time for the case that only one expected value should be returned

matthewfeickert · 2020-09-18T23:39:27Z

I'm going to start working on this again as it would be really nice to get this in v0.5.3. For the time being I'm going to start working on a new branch from origin/master and then just copy what is needed from here. We can then decide if it is easier for me to just force push that branch here (preferable in my mind as it keeps the discussion), or if we just close out this PR and then add @nikoladze as co-author on a PR from my branch.

matthewfeickert · 2021-01-16T19:16:03Z

@nikoladze We've closed this PR as we migrated it over to PR #1232. Thanks for your contribution here, which was a big help! 🚀

add discovery p0 test statistic

2d815ca

kratsg reviewed Jul 29, 2019

View reviewed changes

src/pyhf/utils.py Outdated Show resolved Hide resolved

nikoladze added 2 commits July 29, 2019 09:24

fix typo

bb16136

add option to utils.hypotest

4b4e642

lukasheinrich reviewed Aug 1, 2019

View reviewed changes

src/pyhf/utils.py Outdated Show resolved Hide resolved

lukasheinrich reviewed Aug 1, 2019

View reviewed changes

lukasheinrich self-assigned this Aug 1, 2019

nikoladze added 4 commits August 2, 2019 11:22

add example for 1bin p0 in histfactory validation

59900c5

add histfactory dir for 1bin p0 validation

44af7fe

add test for 1bin q0

375520d

fix support for n-sigma variations of p0

aed30ae

fix docstring of

a110d96

lukasheinrich mentioned this pull request Sep 9, 2019

docs / naming: test statistics #559

Open

matthewfeickert mentioned this pull request Sep 10, 2019

pyhf 2019 into 2020 Roadmap #561

Open

41 tasks

matthewfeickert force-pushed the master branch from 097c4bf to 40d8243 Compare December 18, 2019 15:34

matthewfeickert added API Changes the public API feat/enhancement New feature or request labels Jun 17, 2020

Merge branch 'master' into dev-p0

827dca2

lukasheinrich mentioned this pull request Jul 20, 2020

feat: customizable metrics in hypotest #966

Open

4 tasks

kratsg mentioned this pull request Jul 25, 2020

Statistical interpretation of results cernopendata/cms-opendata-guide#10

Open

nikoladze added 2 commits August 7, 2020 09:32

Merge branch 'master' into dev-p0

7258893

don't pick first index of tensorlib.where result in q0 test statistic

362880b

nikoladze commented Aug 7, 2020

View reviewed changes

alexander-held mentioned this pull request Sep 1, 2020

Discovery significance calculation scikit-hep/cabinetry#100

Closed

This was referenced Dec 18, 2020

refactor: Clean up docstrings and remove qtilde from hypotest #1230

Merged

feat: Test statistic specified by string instead of using qtilde kwarg #1231

Merged

feat: Add discovery p0 test statistic #1232

Merged

kratsg closed this in #1232 Jan 16, 2021

matthewfeickert mentioned this pull request Jan 20, 2021

Update ROADMAP to mark off everything accomplished in release v0.6.0 #1270

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add discovery p0 test statistic #520

feat: Add discovery p0 test statistic #520

nikoladze commented Jul 28, 2019 •

edited by eduardo-rodrigues

Loading

coveralls commented Jul 28, 2019 •

edited

Loading

kratsg commented Jul 29, 2019

nikoladze commented Jul 29, 2019

kratsg commented Jul 29, 2019

lukasheinrich commented Aug 1, 2019

lukasheinrich Aug 1, 2019 •

edited

Loading

nikoladze Aug 2, 2019

kratsg Aug 2, 2019

lukasheinrich Aug 1, 2019

nikoladze Aug 2, 2019

lukasheinrich Aug 2, 2019

nikoladze commented Aug 2, 2019 •

edited

Loading

kratsg commented Aug 2, 2019

lukasheinrich commented Aug 2, 2019

nikoladze commented Aug 12, 2019

lukasheinrich commented Sep 8, 2019

nikoladze commented Sep 10, 2019

matthewfeickert commented Jun 17, 2020

nikoladze commented Jun 20, 2020 •

edited

Loading

nikoladze commented Jun 23, 2020 •

edited

Loading

matthewfeickert commented Aug 6, 2020

lukasheinrich commented Aug 6, 2020

nikoladze Aug 7, 2020

nikoladze Aug 7, 2020

nikoladze Aug 7, 2020

nikoladze commented Aug 7, 2020

nikoladze Aug 7, 2020

matthewfeickert commented Sep 18, 2020 •

edited

Loading

matthewfeickert commented Jan 16, 2021

feat: Add discovery p0 test statistic #520

feat: Add discovery p0 test statistic #520

Conversation

nikoladze commented Jul 28, 2019 • edited by eduardo-rodrigues Loading

Description

Checklist Before Requesting Reviewer

Before Merging

coveralls commented Jul 28, 2019 • edited Loading

kratsg commented Jul 29, 2019

nikoladze commented Jul 29, 2019

kratsg commented Jul 29, 2019

lukasheinrich commented Aug 1, 2019

lukasheinrich Aug 1, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikoladze commented Aug 2, 2019 • edited Loading

kratsg commented Aug 2, 2019

lukasheinrich commented Aug 2, 2019

nikoladze commented Aug 12, 2019

lukasheinrich commented Sep 8, 2019

nikoladze commented Sep 10, 2019

matthewfeickert commented Jun 17, 2020

nikoladze commented Jun 20, 2020 • edited Loading

nikoladze commented Jun 23, 2020 • edited Loading

matthewfeickert commented Aug 6, 2020

lukasheinrich commented Aug 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nikoladze commented Aug 7, 2020

Choose a reason for hiding this comment

matthewfeickert commented Sep 18, 2020 • edited Loading

matthewfeickert commented Jan 16, 2021

nikoladze commented Jul 28, 2019 •

edited by eduardo-rodrigues

Loading

coveralls commented Jul 28, 2019 •

edited

Loading

lukasheinrich Aug 1, 2019 •

edited

Loading

nikoladze commented Aug 2, 2019 •

edited

Loading

nikoladze commented Jun 20, 2020 •

edited

Loading

nikoladze commented Jun 23, 2020 •

edited

Loading

matthewfeickert commented Sep 18, 2020 •

edited

Loading