Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add discovery p0 test statistic #520

Closed
wants to merge 23 commits into from

Conversation

nikoladze
Copy link
Contributor

@nikoladze nikoladze commented Jul 28, 2019

Description

It would be nice to have the ability to calculate discovery p values. I started implementing the q_0 test statistic for discovery of a positive signal. To avoid code duplication i moved the calculation of the profile log likelihood ratio into a separate function.
This could potentially then be included as a separate flag (maybe similar to the qtilde) in pyhf.utils.hypotest but i still have to figure out what will be the best way to do this.
Also i don't have any tests for this yet.

ReadTheDocs build: https://pyhf.readthedocs.io/en/docs-dev-p0-trigger

Checklist Before Requesting Reviewer

  • Tests are passing
  • "WIP" removed from the title of the pull request
  • Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

  • Summarize commit messages into a comprehensive review of the PR

@coveralls
Copy link

coveralls commented Jul 28, 2019

Coverage Status

Coverage increased (+0.004%) to 93.567% when pulling a110d96 on nikoladze:dev-p0 into 5255252 on diana-hep:master.

src/pyhf/utils.py Outdated Show resolved Hide resolved
@kratsg
Copy link
Contributor

kratsg commented Jul 29, 2019

Had a brief look-over. Seems fine to me. We'll need to figure out a proper way to handle switching out test statistics for these things in the future (something a bit more generic -- perhaps akin to how we handle interpolation code configuration with strings->function mapping).

We will want tests for this however, with some expected outputs to assert that it's written correctly and computes the right answer.

@nikoladze
Copy link
Contributor Author

Had a brief look-over. Seems fine to me. We'll need to figure out a proper way to handle switching out test statistics for these things in the future (something a bit more generic -- perhaps akin to how we handle interpolation code configuration with strings->function mapping).

That sounds reasonable. So one should have then for each test statistic one function that calculates the test statistic and one function that calculates the p-values? (Since this is different for e.g qmu vs qtilde). Should i go ahead with just another flag for q0 (called it use_q0 now due to name clash with the function) and make another iteration for improving the mapping later or resolve this as well in this PR?

We will want tests for this however, with some expected outputs to assert that it's written correctly and computes the right answer.

I would tabulate some values in test_validation.py similar to what is already there for CL_s and validate them against HistFactory?

@kratsg
Copy link
Contributor

kratsg commented Jul 29, 2019

I would tabulate some values in test_validation.py similar to what is already there for CL_s and validate them against HistFactory?

That's roughly the way to go. We'd want to make sure it isn't too slow to calculate. It might be better to add a unit test around the function itself, rather than an integration test (which test_validation does).

@lukasheinrich
Copy link
Contributor

thanks @nikoladze for the PR. I agree this is much-needed functionality. I'll review this. At first glance it seems like it just touches the utils.py and not interfere with #519 (which will be quite a bit of a change in the pdf object side).

One thing that would be important is to add tests that check against ROOT, like we do in test_validation.py / the validation dir (here we extracted some hardcoded numbers from ROOT via run_singlepy we check against, at some point a more proper framework would be needed)

src/pyhf/utils.py Outdated Show resolved Hide resolved
@@ -274,18 +351,21 @@ def hypotest(

init_pars = init_pars or pdf.config.suggested_init()
par_bounds = par_bounds or pdf.config.suggested_bounds()
test_stat = qmu if not use_q0 else q0
if qtilde and use_q0:
Copy link
Contributor

@lukasheinrich lukasheinrich Aug 1, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably this should become a better look up generally now that we expand the number of test statistics (more a criticism of the originnal code rather than the PR)

test_stats = {
    'q_mu': ..
    'q_tilde_mu': ..
    'q_0': ..
}
func = test_stats[mystat]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good idea, but there are probably more places where a distinction has to be made. For example the way how to calculate p-values from the test statistic (mainly different between q_mu and q_tilde) and if one wants cls or pvalues (cls doesn't really make sense for q_0)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the best way for now is to keep it as it is, and then clean it up after this PR.

qmu = tensorlib.where(muhatbhat[pdf.config.poi_index] > mu, [0], qmu)
return qmu


def q0(mu, data, pdf, init_pars, par_bounds):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this have a mu parameter? shouldn't that be fixed to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mhh - probably there is no real use in having this as a free parameter. On the other hand - when we want to switch to selecting the test statistic from a mapping it might be simpler when they all have the same parameters.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a way to do this is to have a kind of test statistic factory which takes kwargs that could be

def get_teststat(type, **opts):
    ..
  teststat = get_teststat(type = 'qmu', mu = mu)
  teststat(data, pdf)

or 

  teststat = get_teststat(type = 'q0')
  teststat(data, pdf)

i.e. mu is not part of the test stat API (which just takes the data and the pdf, but part of the API that gives you the teststat

@lukasheinrich lukasheinrich self-assigned this Aug 1, 2019
@nikoladze
Copy link
Contributor Author

nikoladze commented Aug 2, 2019

One thing that would be important is to add tests that check against ROOT, like we do in test_validation.py / the validation dir (here we extracted some hardcoded numbers from ROOT via run_singlepy we check against, at some point a more proper framework would be needed)

I added one test for single-bin, one systematic now to have at least something working - but it involved quite some copy&pasting and some not so elegant if-switches for q_0. Also, at least for some numbers i get a quite high relative difference between pyhf and the values i tabulated from HF (had to set tolerance to 3e-4)

@kratsg
Copy link
Contributor

kratsg commented Aug 2, 2019

I added one test for single-bin, one systematic now to have at least something working - but it involved quite some copy&pasting and some not so elegant if-switches for q_0. Also, at least for some numbers i get a quite high relative difference between pyhf and the values i tabulated from HF (had to set tolerance to 3e-4)

Not good. We should make sure we understand the difference if there is meant to be one (probably not!)

@lukasheinrich
Copy link
Contributor

I'll try to check what the root of the discrepancy is

@nikoladze
Copy link
Contributor Author

I'll try to check what the root of the discrepancy is

Pun intended? ;)
I also tried to dig around a bit. I can't really reproduce the steps ROOT does in the AsymptoticCalculator, but what the logs indicate, the largest part of the discrepancy comes from different numbers in the Asimov data set. ROOT gives [76.9869, 53.0479], while i get [76.99019042, 53.05121471] from pyhf. When i propagate the ROOT asimov values through the rest of the pyhf calculation the matching gets a bit better (~up to 2e-5 relative difference).
Interestingly the fitted parameter value for the nuisance paramater in the constrained fit for getting asimov data is pretty close between ROOT and pyhf (~1.5e-5 relative difference).
Now i'm a bit stuck in following how AsymptoticCalculator::GenerateAsimovData works ...

@lukasheinrich
Copy link
Contributor

hi @nikoladze,

sorry for the long silence, we're in a pretty big refactoring now, but i'll come back to this asap

@nikoladze
Copy link
Contributor Author

@lukasheinrich No problem - i'm also quite busy right now, but i will have a look again when i have time

@matthewfeickert
Copy link
Member

It would be nice to revive this PR (as @lukasheinrich @kratsg and I were just talking about discovery stats today). @nikoladze, I may go ahead and do some rebasing and cleaning on your branch (so please fetch if needed). Additionally, maybe it would be nicer if we could move all of these ROOT files use for validation into the scikit-hep-testdata project.

@matthewfeickert matthewfeickert added API Changes the public API feat/enhancement New feature or request labels Jun 17, 2020
@nikoladze
Copy link
Contributor Author

nikoladze commented Jun 20, 2020

Can you share with us (maybe in a Gist) the code that you're using to do this check and make the plots? Would be nice to look at as well.

I put it in a separate git for now. Here is a script with wrapper functions (e.g run the Histfactory chain in a tempdir):
https://github.com/nikoladze/debug_pyhf/blob/26af92c3b72191044bf36be9d225ab5d18f2747c/run_everything.py
And here is the corresponding notebook that creates the plots i posted earlier (also containing the formula for the "exact" case):
https://github.com/nikoladze/debug_pyhf/blob/26af92c3b72191044bf36be9d225ab5d18f2747c/study_diff_hf_pyhf.ipynb

@nikoladze
Copy link
Contributor Author

nikoladze commented Jun 23, 2020

I think i know more now: the strange drop in p-values for ROOT/HistFactory in this plot
image
comes from the default parameter range in HistFactory for gamma parameters (from 0 to 1 + 5 sigma). When i manually set it to the same range as in pyhf (from 1e-10 to 10) this drop goes away. In practice i think this shouldn't matter though since we will rarely run into situations where our observed data is so large that in the fit where mu is fixed to 1 the background is pulled up by more than 5 sigma ;)

So i would say that issue is resolved

However, the discrepancies in the validation test i set up for this PR persist (they concern a reasonable parameter range for the gamma parameter). But i think these can be attributed to differences between fitting with scipy or minuit. When i fit with minuit in pyhf i get results much closer to ROOT (1e-6 relative difference) . But - the "exact" asymptotic formula derived from the closed forms of the MLEs is closer to pyhf+scipy! The study can be found in my notebook

Edit: Note i did all these plots after also patching the change from #902 to avoid that this issue comes in here

@matthewfeickert
Copy link
Member

We should work on rebasing this. @nikoladze Do you have any problems with us just taking another branch at the state of master and then just using git cherry-pick and hand picking the relevant sections to get this up to date? There has been a lot of movement in the last year of the internals, so this will take a long time to rebase cleanly I think.

@lukasheinrich
Copy link
Contributor

@nikoladze i've been working on test statistic stuff in #1005 which hopefully should make it straight forward to add q0 to

@@ -109,30 +119,38 @@ def hypotest(
)

_returns = [CLs]
if use_q0:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what follows after here is a little bit of a mess, now with even more if-else distinctions due to the q0 - one might consider disentangling the calculation of pvalues and CLs?

)
sqrtqmu_v = tensorlib.sqrt(qmu_v)
def teststat_func(*args, **kwargs):
"make function signature without poi_test to be consistent with q0"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason for making a function signature without poi_test is that q0 doesn't take a poi_test argument since it is always 0. The alternative would be to change q0 to take a poi_test argument, regardless if it only makes sense to be 0.

@@ -226,3 +226,56 @@ def tmu_tilde(mu, data, pdf, init_pars, par_bounds):
+ 'Use the tmu test statistic (pyhf.infer.test_statistics.tmu) instead.'
)
return _tmu_like(mu, data, pdf, init_pars, par_bounds)


def q0(data, pdf, init_pars, par_bounds):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

q0 still needs a separate function since it needs to clip for muhat < 0 instead of muhat > mu as done in _qmu_like

@nikoladze
Copy link
Contributor Author

@matthewfeickert, @lukasheinrich i would be fine with starting over bit-by-bit from master, but actually it's only a few places where i had to resolve minor conflicts after merging. I added a few inline comments.

if kwargs.get('return_expected_set') or kwargs.get('return_expected'):
n_sigma_list = [2, 1, 0, -1, -2]
if not kwargs.get('return_expected_set'):
n_sigma_list = [0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this to avoid having to type the code for calculating p0 and CLs (and doing the if-else) a second time for the case that only one expected value should be returned

@matthewfeickert
Copy link
Member

matthewfeickert commented Sep 18, 2020

I'm going to start working on this again as it would be really nice to get this in v0.5.3. For the time being I'm going to start working on a new branch from origin/master and then just copy what is needed from here. We can then decide if it is easier for me to just force push that branch here (preferable in my mind as it keeps the discussion), or if we just close out this PR and then add @nikoladze as co-author on a PR from my branch.

@matthewfeickert
Copy link
Member

@nikoladze We've closed this PR as we migrated it over to PR #1232. Thanks for your contribution here, which was a big help! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Changes the public API feat/enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants