Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RF] NLL from RooSimultaneous doesn't give the expected value in multi-range fits #11447

Closed
guitargeek opened this issue Sep 27, 2022 · 6 comments · Fixed by #11455
Closed

[RF] NLL from RooSimultaneous doesn't give the expected value in multi-range fits #11447

guitargeek opened this issue Sep 27, 2022 · 6 comments · Fixed by #11455

Comments

@guitargeek
Copy link
Contributor

guitargeek commented Sep 27, 2022

Describe the bug

Following up on #11396, I noticed that a NLL created from a RooSimultaneous in a multi-ranged configuration doesn't give the expected value. In #11445, I tried to implement a unit test that was checking exactly that, but the NLL from RooSimultaneous was not consistent with a mathematically equivalent sum of NLLs per category.

It is still not clear whether the difference is only a constant factor that doesn't influence the fit result, but this is potentially a but that even has an influence on the results. I suspect that it is actually biasing the fit result, because in multi-ranged fits the NLL is computed using an integral of the top-level PDF that is created with createIntegral, and I'm not sure if the integral of a RooSimultaneous is well defined in that context.

Please @AlkaidCheng, take note of this, because this is a potential problem for your workflow.

Expected behavior

The NLL value from a RooSimultaneous in a multi-range configuration should be the same value as the normalized combination of individual NLLs created in the individual categories.

To Reproduce

Comment out the final part of the MultiRangeFitWithSplitRange unit test in testRooSimultaneous.

Setup

Arch Linux with ROOT master.

Additional context

  • The priority level of this issue can be reduced if it turns out that the difference in the NLL value is a constant that doesn't affect the fit result
@AlkaidCheng
Copy link

Hi @guitargeek ,

Thanks a lot for checking this. I wonder what is the implication of this inconsistency in NLL values though. Does it mean a multi-range fit using RooFit is fundamentally flawed and should not be used?

In our use case, we are trying to obtain the values of the parameters in our background models by fitting to the data sideband. So we are more concerned about the best fit values of the variables in the workspace than the value of the NLL itself. Though from our validation plots, our background model shape obtained from the multi-range fit seems to agree well with the data.

My question is then what is the right thing to do if we want to perform a multi-range fit using RooFit? This could potentially have a big impact on the community since a lot of analysis codes/frameworks rely on this multi-range fit method.

guitargeek added a commit to guitargeek/root that referenced this issue Sep 29, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface to suggest that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like root-project#11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit to guitargeek/root that referenced this issue Sep 29, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
guitargeek added a commit to guitargeek/root that referenced this issue Sep 29, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
@guitargeek
Copy link
Contributor Author

guitargeek commented Sep 29, 2022

Multi-range fits in RooFit used to behave much different from what the user expected, until I fixed this for 6.26 with PR #7719.

Before 6.26, the relative number of events in the different ranges was not considered as information, which lead to sub-optimal fit results because often the relative number of events in the two sidebands can constrain the shape quite a lot. And this is also what people expect: you would expect a sideband fit to be equivalent to the full fit, only with some information in a signal region left out.

While I thought I have fixed this problem for good, you revealed with one of your earlier GitHub issues that I have only solved it for non-simultaneous fits, and for 6.28 I will fix it for simultaneous PDFs too, with the PR linked in this issue.

So yes, this issue affects the best fit values for multi-range fits with RooSimultaneous and you should take all these results with a grain of salt. While the results are fortunately still well defined mathematically as I explained above, they are not optimal and not what users expect.

For illustration, here I fit the multi-range simultaneous NLL from the new unit test before and after the fix that yet has to be merged to master:

Before:

  RooFitResult: minimized FCN value: 31174.6, estimated distance to minimum: 9.97488e-07
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                muCat1    4.1762e+00 +/-  1.32e-01
                muCat2    6.2315e+00 +/-  1.33e-01
            sigma_cat1    1.0064e+00 +/-  2.54e-02
            sigma_cat2    9.6606e-01 +/-  2.48e-02

After:

  RooFitResult: minimized FCN value: 1368.32, estimated distance to minimum: 9.21927e-06
                covariance matrix quality: Full, accurate covariance matrix
                Status : MINIMIZE=0 

    Floating Parameter    FinalValue +/-  Error   
  --------------------  --------------------------
                muCat1    4.0295e+00 +/-  1.88e-02
                muCat2    5.9955e+00 +/-  1.79e-02
            sigma_cat1    1.0105e+00 +/-  2.54e-02
            sigma_cat2    9.6702e-01 +/-  2.47e-02

You see that after the fix, you can expect that the precision on some parameters that are sensitive to the relative number of events in the sidebands improves.

I hope this clarifies the situation a bit, If you want to discuss and ask more about this, feel free to do so here!

guitargeek added a commit to guitargeek/root that referenced this issue Sep 29, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface to suggest that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like root-project#11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit to guitargeek/root that referenced this issue Sep 29, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
@AlkaidCheng
Copy link

Thanks @guitargeek,

Is there a workaround for users using older root versions to get the expected behavior? Or we can only wait until the patches are made e.g. in 6.28?

@guitargeek
Copy link
Contributor Author

guitargeek commented Sep 30, 2022

In 6.26, just don't use the RooSimultaneous and build your NLL as a sum of per-channel NLLs instead, using RooAddition. This is what I also do in the updated cross-check for the simultaneous fit in this PR #11455 ( see the testRooSimultaneous file).

In 6.24 or older, you would have to further all all these multi-range normalization correction integrals to the RooAddition, like I introduced it under the hood in #7719. But I would not advice doing this unless you carefully validate that you implemented the math correctly.

Then, I also have a little hack for you that would work in all ROOT versions: the whole problem does not affect extended fits, as explained in this tutorial from ROOT 6.24: https://root.cern.ch/doc/v624/rf204b__extendedLikelihood__rangedFit_8C.html

So even if you don't care about doing an extended fit, by making your PDF an extended one and then ignoring the normalization parameter you're all set. If you already do an extended fit, you also don't have to worry at all about the multi-ranged fits,even with the RooSimultaneous.

guitargeek added a commit to guitargeek/root that referenced this issue Oct 1, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface to suggest that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like root-project#11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit to guitargeek/root that referenced this issue Oct 1, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
guitargeek added a commit to guitargeek/root that referenced this issue Oct 1, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface suggesting that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like root-project#11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit to guitargeek/root that referenced this issue Oct 1, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
@AlkaidCheng
Copy link

Thanks for the helpful information. May I ask what the recommendation is after the fix (e.g. in 6.28+)? Do you expect users to always build NLL as a sum of per-channel NLLs? I believe a lot of existing packages in the community rely on building a simultaneous pdf for multi-category models, so it may not be trivial to change this tradition.

guitargeek added a commit to guitargeek/root that referenced this issue Oct 3, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface suggesting that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like root-project#11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit to guitargeek/root that referenced this issue Oct 3, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes root-project#11447.
@guitargeek
Copy link
Contributor Author

No, the idea of the fix is that the RooSimultaneous will behave as expected without changing any user code.

The advice with the sum of NLLs is only a workaround for 6.26 to make it work, and the trick with the extended fit is a trick to make it work for all ROOT versions before the upcoming 6.28.

guitargeek added a commit that referenced this issue Oct 3, 2022
Multi-range fits in RooFit are more complicated than they should be.

In principle, all that is required is to change the normalization range
of the PDF to the union of the ranges.

There is a RooAbdPdf interface suggesting that this could be done easily
like this:
```C++
pdf.setNormRange("range1,range2")
```

But this didn't wor for RooAddPdfs, which is probably why it was chosen
to implement mulit-range fits as a sum of separate RooNLLVars. In the
old test statistics framework. But in this case, the PDFs are normalized
separately, and extra terms need to be introduced to correct for that.
This resulted in lots of complicated code, and still there are issues
like #11447, i.e. is still doesn't work for simultaneous fits.

Now that in the previous commits the `RooAddPdf` behavior for
`setNormRange()` was fixed, one can actually use comma-separated
normalization ranges for a single PDF in a multi-range fit. With this
commit, this is done in the old RooFit test statistics classes.

This has sevaral advantages:

1. Fixes issues with multi-range simultaneous fits
2. It's now in harmony with the logic in the new BatchMode
3. Some speedup because there are less nodes in the computation graph
4. Less code required
5. Logic is easier to understand

Maybe there are also new bugs now, but they can be fixed later. I am
sure that already now the commit fixes more issues that it creates.
guitargeek added a commit that referenced this issue Oct 3, 2022
Now that the multi-ranged fits for RooSimultaneous got re-implemented in
a better way with hopefully less bugs, the closure check for multi-range
fits in `testRooSimultaneous` succeeds, so it should be done in the
test.

Closes #11447.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants