-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design Improved commondata fomat #1416
Comments
There is the question of how to layout metadata, binning and measurement with uncertainties. I get the impression that these should all be separate files that might be consumed independently by different tools. E.g. a monte carlo needs kinematics and metadata but not measurement o uncertainties. And we could have multiple variants of measurements. The validphys loader would use the metadata only, and different actions would pull other relevant files as needed. |
Hi @Zaharid , thanks for adding me to the discussion here. As I see it, this is how we should proceed (but feel free to disagree):
Now let me reply to @Zaharid's points above:
|
My initial idea had been to try and keep this discussion from the pinappl one separate, but I realize there are going to be overlaps. I am also unclear on the relative timescales of the various components and how they should fit together, but it may well be sensible to do as you say and integrate pinappl first. Experience shows that having the metadata separate from the data is rather advantageous (and potentially enables things like having multiple alternatives as to what the data is), so I certainly thing there should be an independent metadata header somewhere. Similarly I think that we would like to have a way of representing bins independently of both theory predictions and experimental data. |
First of all (sorry if I mention time related information in an issue) I believe that all of this would be properly discussed in a Wednesday meeting, I would propose next Wednesday. On a longer time scale I would assign some goals to people (e.g.: few of them will be included in our October deadline to provide a first working toolchain) and I would reassess in the in-presence code meeting (i.e. before the next collaboration meeting).
Speaking of a specific point I believe that you both are suggesting meaningful options, and I would do both, in the following way:
The rational is that commondata live in a single place, and everyone will know where to take what he needs. Instead I would not impose to pineapplgrids to mandatory be stored in a single place from the beginning, alongside their metadata (they will be for cached, but this is on top), so pineapplgrids are more "flowing" objects, and I would make them flow in a single bundle with their metadata, in order to never decouple the two. |
I think pinnaple grids are outside of this discussion indeed: We want many versions of the grids and these are going to be stored differently. Also it makes sense for us to group them by "theories", which is what is going to be used in the fit (would be nice if grids could be tracked and updated individually though). So I think the mechanism should be somewhat similar to the current one, where fktables know how to attach themselves to the corresponding commondata. On the other hand it seems that the format discussed here should be part of the input used to produce theoretical predictions. To that one would attach the various theory parameters (which may include the evolution ones) and store those in the grid itself. |
I want to reiterate my proposal of packaging separately the commondata manager, but this time for a new reason:
If we are going to have commondata as a not separable part of validphys we'll end up in a cyclic dependency. Of course we could think about In any case I'm open to alternative solutions. |
Why do we have theory predictions and experimental measurements in separate places? Wouldn't it make sense to bundle them together (that would make any kind of duplication easy to get rid off) in a single container? |
The experimental measurements are what they are, but the theory is arbitrary (i.e., all the as variations are different theories that are to be compared to the same measurements). |
I can think of a couple of reasons:
Also it is like we have been doing it and change is expensive :) |
I think that we should have both. Bins edges are useful for making nice plots. Central values are sometimes useful to make predictions, and it is not guaranteed that the central value (as experimentalists intend it) stands at the centre of each bin.
Here I'd like to add two additional points.
I agree on the general framework, and specifically on the fact that the architrave of the whole design is the metadata. I guess that the remark made by Christopher implicitly assumes that one may want to distribute PineAPPL grids (similarly to what is done with APPLgrids) and that some PineAPPL grids (including EW corrections) may work only with a specific variant of the data. But I don't see why one shouldn't distribute both metadata and PineAPPL grids. |
I think there are two separate problems with this at the moment: One is that the way in which the nuclear uncertainties are made changes quickly and is not so how to names and specify the versions and the other is that they end up requiring modifying the whole commondata file in a way that is not so easy to grok. And here we are discussing something that is provided by experimentalists at least conceptually, and that we would like to keep more or less immutable. I think the solution to that second problem could be supporting "extrinsic" uncertainty files that would be combined at the end to give a total covmat. These would then be enabled at the level of a runcard or better still in some default specification. |
@Zaharid Sorry, only now I realized that this Issue indeed has nothing (very little) to do with the theory predictions. But I'm still interested in this, although I'm sure I should implement a dataset in buildmaster to understand the problem space ... |
I suppose it has to do insofar we would like to use this stuff as input for theory predictions. |
Incidentally I get the impression that we need the concept of a correlated systematic that is more global than a dataset. I would only think in terms of completely correlated systematics as that is somewhat more natural (i.e. the "systematics matrix" in the covmat paper) and one can always be divided into a correlated and an uncorrelated part with different "systematics". As for what to do implementation-wise, the simple solution is the same as now, namely match them by (global) name but otherwise remain within the dataset and the deluxe option is that there should be an separated lookup table that might even contain some metadata. The later might be beneficial in that it could help experimentalists provide that information cleanly. cc @enocera |
This is superseded by #1709 |
There has been some renewed talk to improve the data implementation technology. One of the aspects is the commondata format. Here are some more or less agreed upon desiderata, that have been discussed in the past.
On top of that there are concerns about duplication of information between commondata and the theory predictions toolchain. To me the logical conclusion of that train of though is that if we don't want duplication (e.g. in regards to the binning) then the commondata format (metadata) should contain all the information required to make a theory prediction except for the theory parameters such as alpha_s and PDF. The idea would be than it could then be mechanically converted into the partial input for some montecarlo. We should keep in mind that a monte carlo run is more or less in one to one correspondence with an fktable, but an experimental measurement could involve a bunch of these, reduced with some operation, such as e.g. ratios.
I also don't really know what other things would be required. This would certainly eat into other creative formats that have accumulated such as the COMPOUND files, which in this picture would also be absorbed into the metadata, and probably together with much of what is in the current "runcards" repository. I don't know if this is a good idea but certainly see a number of advantages.
The text was updated successfully, but these errors were encountered: