Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design Improved commondata fomat #1416

Closed
Zaharid opened this issue Sep 23, 2021 · 15 comments
Closed

Design Improved commondata fomat #1416

Zaharid opened this issue Sep 23, 2021 · 15 comments

Comments

@Zaharid
Copy link
Contributor

Zaharid commented Sep 23, 2021

There has been some renewed talk to improve the data implementation technology. One of the aspects is the commondata format. Here are some more or less agreed upon desiderata, that have been discussed in the past.

  • Metadata all in one place (i.e. PLOTTING file should become the header containing all relevant information that isn't the data itself). SYSYPEs should probably go here as well. Should be separate from the data itself. There was some not rough prior art here https://github.com/NNPDF/buildmaster/pull/101 that never got merged.
  • References in the metadata that are used by vp to produce tables (including in latex).
  • Support for N kinematics (this will require some changes in vp and probably retiring the cpp codepath).
  • Bins as opposed to central values. Add binning information to the commondata format #1006
  • Support variants e.g. for bugfixes Dataset variants #494. This ties in with the defaults cc @siranipour.

On top of that there are concerns about duplication of information between commondata and the theory predictions toolchain. To me the logical conclusion of that train of though is that if we don't want duplication (e.g. in regards to the binning) then the commondata format (metadata) should contain all the information required to make a theory prediction except for the theory parameters such as alpha_s and PDF. The idea would be than it could then be mechanically converted into the partial input for some montecarlo. We should keep in mind that a monte carlo run is more or less in one to one correspondence with an fktable, but an experimental measurement could involve a bunch of these, reduced with some operation, such as e.g. ratios.

I also don't really know what other things would be required. This would certainly eat into other creative formats that have accumulated such as the COMPOUND files, which in this picture would also be absorbed into the metadata, and probably together with much of what is in the current "runcards" repository. I don't know if this is a good idea but certainly see a number of advantages.

@Zaharid
Copy link
Contributor Author

Zaharid commented Sep 23, 2021

There is the question of how to layout metadata, binning and measurement with uncertainties.

I get the impression that these should all be separate files that might be consumed independently by different tools. E.g. a monte carlo needs kinematics and metadata but not measurement o uncertainties. And we could have multiple variants of measurements. The validphys loader would use the metadata only, and different actions would pull other relevant files as needed.

@Zaharid Zaharid self-assigned this Sep 23, 2021
@Zaharid Zaharid changed the title Improve commondata fomat Design Improved commondata fomat Sep 23, 2021
@cschwan
Copy link
Contributor

cschwan commented Sep 24, 2021

Hi @Zaharid , thanks for adding me to the discussion here.

As I see it, this is how we should proceed (but feel free to disagree):

  1. Add support for .pineappl files instead of the .dat FK table files (either all files are .dat or .pineappl) in the theory data files. All other files, CFACTOR and COMPOUND for instance, are left untouched.
    The advantage is that we can move the theory predictions over to the new format one by one, which make testing much easier. I've already had a look at the interfacing code with @scarlehoff , and the changes needed there should be minimal thanks to @alecandido's and @felixhekhorn's efforts with the PineAPPL Python interface, which now has been merged into PineAPPL master. The new grids will have much more metadata (see below)
  2. When we'll have implenented the previous point, we should at least verify that the duplicated bin information is correctly duplicated. As I see it duplication isn't a big problem, only when we incorrectly duplicate it is.
  3. As a final step (or rather steps) we can remove the duplication. How to best do this I'm not sure yet, I think I first have to implement (the experimental data of) a dataset myself to get a feeling for it. However, I think the entire aim here should be to make the implementation of new data easy.

Now let me reply to @Zaharid's points above:

  • with a PineAPPL grid we typically already have the metadata already in one place, namely in the grid itself. See the description of the metadata in the https://github.com/NNPDF/runcards repository, see here (scroll down to the description of metadata.txt). In particular, there's x1_label_tex for the label of the first observable, for use in LaTeX, arxiv and hepdata for the arxiv ID and data storage that we could use to produce the tables now shown in the NNPDF4.0 appendix B, etc.
    The interesting question probably is where we should store the metadata in the first place, where it should be read from when a new measurement/theory prediction is implemented. For the time being, they are in the runcards repository, but I could imagine that we want to streamline the implementation of a new dataset; having to touch several repositories is something that we should avoid if we can
  • the complete bin information is also stored in a PineAPPL grid. If you perform a convolution, for instance, you can see the bin (upper and lower) limits for each dimension, where the number of dimensions is arbitrary

@scarlehoff scarlehoff self-assigned this Sep 24, 2021
@Zaharid
Copy link
Contributor Author

Zaharid commented Sep 24, 2021

My initial idea had been to try and keep this discussion from the pinappl one separate, but I realize there are going to be overlaps. I am also unclear on the relative timescales of the various components and how they should fit together, but it may well be sensible to do as you say and integrate pinappl first.

Experience shows that having the metadata separate from the data is rather advantageous (and potentially enables things like having multiple alternatives as to what the data is), so I certainly thing there should be an independent metadata header somewhere. Similarly I think that we would like to have a way of representing bins independently of both theory predictions and experimental data.

@alecandido
Copy link
Member

alecandido commented Sep 24, 2021

First of all (sorry if I mention time related information in an issue) I believe that all of this would be properly discussed in a Wednesday meeting, I would propose next Wednesday.

On a longer time scale I would assign some goals to people (e.g.: few of them will be included in our October deadline to provide a first working toolchain) and I would reassess in the in-presence code meeting (i.e. before the next collaboration meeting).

Experience shows that having the metadata separate from the data is rather advantageous (and potentially enables things like having multiple alternatives as to what the data is), so I certainly thing there should be an independent metadata header somewhere. Similarly I think that we would like to have a way of representing bins independently of both theory predictions and experimental data.

Speaking of a specific point I believe that you both are suggesting meaningful options, and I would do both, in the following way:

  • for handmade objects (like the data implementation) I would keep data and metadata separately: data are generated mechanically from experimentally provided ones, while metadata are usually handwritten by someone
  • for generated objects (like pineapplgrid) I would keep the relevant metadata that are consumed in producing them inside the object itself

The rational is that commondata live in a single place, and everyone will know where to take what he needs. Instead I would not impose to pineapplgrids to mandatory be stored in a single place from the beginning, alongside their metadata (they will be for cached, but this is on top), so pineapplgrids are more "flowing" objects, and I would make them flow in a single bundle with their metadata, in order to never decouple the two.

@alecandido alecandido self-assigned this Sep 24, 2021
@Zaharid
Copy link
Contributor Author

Zaharid commented Sep 24, 2021

I think pinnaple grids are outside of this discussion indeed: We want many versions of the grids and these are going to be stored differently. Also it makes sense for us to group them by "theories", which is what is going to be used in the fit (would be nice if grids could be tracked and updated individually though). So I think the mechanism should be somewhat similar to the current one, where fktables know how to attach themselves to the corresponding commondata.

On the other hand it seems that the format discussed here should be part of the input used to produce theoretical predictions. To that one would attach the various theory parameters (which may include the evolution ones) and store those in the grid itself.

@alecandido
Copy link
Member

I want to reiterate my proposal of packaging separately the commondata manager, but this time for a new reason:

  • we want to depend on it in runcardsrunner, because in general we want not to duplicate commondata
  • we decided that the automated tool-chain will be managed by validphys as an action, and thus validphys itself has to depend on pineappl and runcardsrunner (at least, most likely even on eko and pineko)

If we are going to have commondata as a not separable part of validphys we'll end up in a cyclic dependency.

Of course we could think about runcardsrunner to be fed with data directly by validphys, without any specific mention to commondata, but this will make impossible to run runcardsrunner in isolation, breaking one of our design principle (i.e. we want actors like yadism, eko and runcardsrunner, to run locally independently of anything else, just fed with data files; the principle is of course chosen by us, but it looks sane).

In any case I'm open to alternative solutions.

@cschwan
Copy link
Contributor

cschwan commented Sep 29, 2021

Why do we have theory predictions and experimental measurements in separate places? Wouldn't it make sense to bundle them together (that would make any kind of duplication easy to get rid off) in a single container?

@scarlehoff
Copy link
Member

The experimental measurements are what they are, but the theory is arbitrary (i.e., all the as variations are different theories that are to be compared to the same measurements).

@Zaharid
Copy link
Contributor Author

Zaharid commented Sep 29, 2021

Why do we have theory predictions and experimental measurements in separate places? Wouldn't it make sense to bundle them together (that would make any kind of duplication easy to get rid off) in a single container?

I can think of a couple of reasons:

  • I don't think we are expecting that the theory predictions that we are going to use are done by experimentalists, so there needs to be something separate for that.
  • We rarely want to store all the theory predictions (as in all the fktables) for a given measurement in the same place. Experimental data is small and cheap to store (even in git) and theory grids are huge.

Also it is like we have been doing it and change is expensive :)

@enocera
Copy link
Contributor

enocera commented Oct 6, 2021

There has been some renewed talk to improve the data implementation technology. One of the aspects is the commondata format. Here are some more or less agreed upon desiderata, that have been discussed in the past.

* Metadata all in one place (i.e. PLOTTING file should become the header containing all relevant information that isn't the data itself). SYSYPEs should probably go here as well. Should be separate from the data itself. There was some not rough prior art here [[WIP] Merging PLOTTING_*.yaml to meta/*.yaml buildmaster#101](https://github.com/NNPDF/buildmaster/pull/101) that never got  merged.

* References in the metadata that are used by vp to produce tables (including in latex).

* Support for N kinematics (this will require some changes in vp and probably retiring the cpp codepath).

* Bins as opposed to central values. [Add binning information to the commondata format #1006](https://github.com/NNPDF/nnpdf/issues/1006)

I think that we should have both. Bins edges are useful for making nice plots. Central values are sometimes useful to make predictions, and it is not guaranteed that the central value (as experimentalists intend it) stands at the centre of each bin.

* Support variants e.g. for bugfixes [Dataset variants #494](https://github.com/NNPDF/nnpdf/issues/494). This ties in with the defaults cc @siranipour.

Here I'd like to add two additional points.

  • Do we need to store uncertainties twice in the commondata files, once as absolute values and once as percentage values? I'd say no.
  • What do we do with nuclear uncertainties? Will we continue to treat them as other systematic uncertainties and append them to the other experimental uncertainties? The thing that annoys me a little is that, if the nuclear PDF set used to estimate nuclear uncertainties changes frequently, then you'll have a large amount of variant commondata files. Which is perhaps fine if we have a clever way of labeling variant files that differ only for nuclear uncertainties.

On top of that there are concerns about duplication of information between commondata and the theory predictions toolchain. To me the logical conclusion of that train of though is that if we don't want duplication (e.g. in regards to the binning) then the commondata format (metadata) should contain all the information required to make a theory prediction except for the theory parameters such as alpha_s and PDF. The idea would be than it could then be mechanically converted into the partial input for some montecarlo. We should keep in mind that a monte carlo run is more or less in one to one correspondence with an fktable, but an experimental measurement could involve a bunch of these, reduced with some operation, such as e.g. ratios.

I also don't really know what other things would be required. This would certainly eat into other creative formats that have accumulated such as the COMPOUND files, which in this picture would also be absorbed into the metadata, and probably together with much of what is in the current "runcards" repository. I don't know if this is a good idea but certainly see a number of advantages.

I agree on the general framework, and specifically on the fact that the architrave of the whole design is the metadata. I guess that the remark made by Christopher implicitly assumes that one may want to distribute PineAPPL grids (similarly to what is done with APPLgrids) and that some PineAPPL grids (including EW corrections) may work only with a specific variant of the data. But I don't see why one shouldn't distribute both metadata and PineAPPL grids.

@Zaharid
Copy link
Contributor Author

Zaharid commented Oct 7, 2021

What do we do with nuclear uncertainties? Will we continue to treat them as other systematic uncertainties and append them to the other experimental uncertainties? The thing that annoys me a little is that, if the nuclear PDF set used to estimate nuclear uncertainties changes frequently, then you'll have a large amount of variant commondata files. Which is perhaps fine if we have a clever way of labeling variant files that differ only for nuclear uncertainties.

I think there are two separate problems with this at the moment: One is that the way in which the nuclear uncertainties are made changes quickly and is not so how to names and specify the versions and the other is that they end up requiring modifying the whole commondata file in a way that is not so easy to grok. And here we are discussing something that is provided by experimentalists at least conceptually, and that we would like to keep more or less immutable.

I think the solution to that second problem could be supporting "extrinsic" uncertainty files that would be combined at the end to give a total covmat. These would then be enabled at the level of a runcard or better still in some default specification.

@cschwan
Copy link
Contributor

cschwan commented Oct 7, 2021

@Zaharid Sorry, only now I realized that this Issue indeed has nothing (very little) to do with the theory predictions. But I'm still interested in this, although I'm sure I should implement a dataset in buildmaster to understand the problem space ...

@Zaharid
Copy link
Contributor Author

Zaharid commented Oct 7, 2021

@Zaharid Sorry, only now I realized that this Issue indeed has nothing (very little) to do with the theory predictions. But I'm still interested in this, although I'm sure I should implement a dataset in buildmaster to understand the problem space ...

I suppose it has to do insofar we would like to use this stuff as input for theory predictions.

@Zaharid
Copy link
Contributor Author

Zaharid commented Oct 13, 2021

Incidentally I get the impression that we need the concept of a correlated systematic that is more global than a dataset. I would only think in terms of completely correlated systematics as that is somewhat more natural (i.e. the "systematics matrix" in the covmat paper) and one can always be divided into a correlated and an uncorrelated part with different "systematics".

As for what to do implementation-wise, the simple solution is the same as now, namely match them by (global) name but otherwise remain within the dataset and the deluxe option is that there should be an separated lookup table that might even contain some metadata. The later might be beneficial in that it could help experimentalists provide that information cleanly.

cc @enocera

@scarlehoff
Copy link
Member

This is superseded by #1709

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants