Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New metadata format #1684

Conversation

t7phy
Copy link
Member

@t7phy t7phy commented Mar 1, 2023

This PR implements CMS_ttBar_8TeV dataset with the new metadata as discussed in Ams code meeting. @scarlehoff this builds upon the commondata reader branch, however, could you please tweak the reader script in this branch and test out if it works on the new format?
backward_compatbility.yaml in the buildmaster directory is the dictionary as you had proposed, please have a look at that too.

@t7phy t7phy requested a review from enocera March 1, 2023 12:53
@scarlehoff scarlehoff changed the base branch from master to final_reader_for_new_commondata_mk2 March 1, 2023 12:56
@scarlehoff
Copy link
Member

Thanks. If I'm not mistaken the only change with respect to the previous version of the commondata is the metadata.yaml file right?

@scarlehoff scarlehoff requested a review from Zaharid March 1, 2023 12:57
@t7phy
Copy link
Member Author

t7phy commented Mar 1, 2023

Thanks. If I'm not mistaken the only change with respect to the previous version of the commondata is the metadata.yaml file right?

Yes

@t7phy t7phy marked this pull request as ready for review March 8, 2023 13:08
@t7phy t7phy mentioned this pull request Mar 29, 2023
3 tasks
@scarlehoff
Copy link
Member

scarlehoff commented Mar 30, 2023

I have a few comments about the files in this PR, and some more (at the bottom) that are more specific and that should be answered I guess by @enocera

Please don't leave version empty. It should be 1 if it is the initial version. For instance

version: 1
version_comment: "initial version"

The figure_by field is missing information for all observables.

Rather than having an empty theory I would prefer if you just leave the theory out when it does not exist.
Related to this, the "theory" field must always include an operation. For instance, for the observable in which you do have information would be:

FK_tables:
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTTRAP
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTOT
operation: RATIO

The theory information can be found in this repository: https://github.com/NNPDF/theories/blob/main/data/yamldb/400/CMSTOPDIFF8TEVTTRAPNORM.yaml


For the following I would like to have also @enocera input:

There's "a problem" with the kinematics. In the past we always had three kinematic variables, however in the observables in this PR there is only two. It is of course reasonable that different datasets have a different number of kinematic variables (then I will fill in one with 0s or something since there are several points in vp in which three are assumed).

However, looking at the "old commondata" version of CMSTOPDIFF8TEVTTRAPNORM there are three:

from validphys.api import API
cd = API.dataset(dataset_input={"dataset": "CMSTOPDIFF8TEVTTRAPNORM"}, use_cuts="nocuts", theoryid=200).load_commondata()
cd.commondata_table

RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

Also, inspecting the old version of CMSTOPDIFF8TEVTTRAPNORM I am very confused at how can I compare the older version with the new ones. If I look at the uncertainties in CMSTOPDIFF8TEVTTRAPNORM I find cd.nsys=21 but cd.systematics_table has 42 columns (which happens to be double nsys). What are those 42 entries exactly? Maybe this a question more for @Zaharid ? Inspecting the 42 entries in the old version and the 21 of the new one I see some that match, some that doesn't and some for which there are a factor between them (for instance for ArtUnc_{2,3,4} there is a -1 factor between the old one and the new).

So I guess I have two requests:
1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says?
2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

@enocera
Copy link
Contributor

enocera commented Mar 30, 2023

For the following I would like to have also @enocera input:

There's "a problem" with the kinematics. In the past we always had three kinematic variables, however in the observables in this PR there is only two.

The number of kinematic variables in the new format should never be smaller than the number of kinematic variables in the old format. For plotting purposes, the three variables are all needed. Let me fix this with @t7phy .

It is of course reasonable that different datasets have a different number of kinematic variables (then I will fill in one with 0s or something since there are several points in vp in three are assumed).

However, looking at the "old commondata" version of CMSTOPDIFF8TEVTTRAPNORM there are three:

from validphys.api import API
cd = API.dataset(dataset_input={"dataset": "CMSTOPDIFF8TEVTTRAPNORM"}, use_cuts="nocuts", theoryid=200).load_commondata()
cd.commondata_table

RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

Also, inspecting the old version of CMSTOPDIFF8TEVTTRAPNORM I am very confused at how can I compare the older version with the new ones. If I look at the uncertainties in CMSTOPDIFF8TEVTTRAPNORM I find cd.nsys=21 but cd.systematics_table has 42 columns (which happens to be double nsys). What are those 42 entries exactly? Maybe this a question more for @Zaharid ? Inspecting the 42 entries in the old version and the 21 of the new one I see some that match, some that doesn't and some for which there are a factor between them (for instance for ArtUnc_{2,3,4} there is a -1 factor between the old one and the new).

So I guess I have two requests: 1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says? 2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

@scarlehoff Let me check all these issues with Tanishq (I'm going to talk to him in 2 mins) and we will come back to you.

@enocera
Copy link
Contributor

enocera commented Mar 30, 2023

So I guess I have two requests: 1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says? This applies to ALL data sets, so you should encounter similar mismatches also for the NMC and DY data sets.

@scarlehoff The old commondata format is as follows {idat, kin1, kin2, kin3, data, stat, {systematics}}, but {systematics} is twice the number of systematic uncertainties in the (old) metadata. The reason being that the first is the absolute value of the sys uncertainty, and the second is the percentage value, see https://docs.nnpdf.science/data/exp-data-files.html?highlight=commondata#commondata-file-format. Does this answer your question?

2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

Let me check carefully the implementation and I will tell you when you can proceed with the testing further.

@t7phy
Copy link
Member Author

t7phy commented Mar 30, 2023

I have a few comments about the files in this PR, and some more (at the bottom) that are more specific and that should be answered I guess by @enocera

Please don't leave version empty. It should be 0 if it is the initial version. For instance

version: 0
version_comment: "initial version"

This is done

The figure_by field is missing information for all observables.

I will do in next commits

Rather than having an empty theory I would prefer if you just leave the theory out when it does not exist. Related to this, the "theory" field must always include an operation. For instance, for the observable in which you do have information would be:

I would prefer to leave them empty, because the goal is to generate all the theories for which data is there and implemented. This emptiness will be a reminder to generate them, and eventually have this field populated, as opposed to throwing away the theory field itself.

FK_tables:
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTTRAP
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTOT
operation: RATIO

I will modify this in the next commits

RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

@scarlehoff The final convention is that we use definitions for the top level, as it is consistent with others, for e.g. bin_s_
For the 2nd level, it will be description, this is consistent in this PR and in NMC. I am not sure about FTDY (but if it is different here, please modify it. In short this is how it should be:
definitions:
error_name:
description:
treatment:
type:

@scarlehoff
Copy link
Member

scarlehoff commented Mar 30, 2023

@enocera

Does this answer your question?

Yes, and no. It's still not clear to me how the convention on the old commondatas are transported to the new ones since now we have only the "real (used) one". Does that mean upon loading a variant of the uncertainties how each uncertainty is to be treated is fixed forever? (now that I'm writing it down I'm getting flashes, I guess this is @Zaharid's point in #1691 but I need to fully remember all this if I'm to document it...)

@t7phy

I would prefer to leave them empty, because the goal is to generate all the theories for which data is there and implemented. This emptiness will be a reminder to generate them, and eventually have this field populated, as opposed to throwing away the theory field itself.

I would insist on throwing away the field itself. That way I have freedom deciding which fields are or not optional in the theory (plus the format itself) and the people implementing the data has the freedom to add new datasets without knowing what the theory format is.

I am not sure about FTDY (but if it is different

I'll update to follow the convention used here (also for the other fields if they change).

Don't worry too much about NMC since I guess that one is going to be redone (btw, if you want to cherry-pick the CMS dataset from this branch and rebase it on top of master or utils, please do so).

@scarlehoff scarlehoff mentioned this pull request Mar 30, 2023
5 tasks
@scarlehoff scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from a6c63d9 to 56fe007 Compare April 28, 2023 10:06
@scarlehoff scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch 2 times, most recently from 6410bca to 26948a2 Compare June 21, 2023 13:46
@scarlehoff scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from 26948a2 to 8725a81 Compare September 21, 2023 10:19
@scarlehoff scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch 2 times, most recently from b117b34 to 639e6c6 Compare October 11, 2023 17:11
@t7phy
Copy link
Member Author

t7phy commented Oct 24, 2023

closing this as we now have the agreed upon format for metadata with implementation examples in #1813

@t7phy t7phy closed this Oct 24, 2023
@scarlehoff scarlehoff deleted the more_efficient_metadata_for_new_commondata branch November 14, 2024 10:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants