New metadata format #1684

t7phy · 2023-03-01T12:53:13Z

This PR implements CMS_ttBar_8TeV dataset with the new metadata as discussed in Ams code meeting. @scarlehoff this builds upon the commondata reader branch, however, could you please tweak the reader script in this branch and test out if it works on the new format?
backward_compatbility.yaml in the buildmaster directory is the dictionary as you had proposed, please have a look at that too.

scarlehoff · 2023-03-01T12:57:37Z

Thanks. If I'm not mistaken the only change with respect to the previous version of the commondata is the metadata.yaml file right?

t7phy · 2023-03-01T12:59:07Z

Thanks. If I'm not mistaken the only change with respect to the previous version of the commondata is the metadata.yaml file right?

Yes

buildmaster/CMS_ttBar_lj_8TeV_dif/metadata.yaml

scarlehoff · 2023-03-30T11:57:40Z

I have a few comments about the files in this PR, and some more (at the bottom) that are more specific and that should be answered I guess by @enocera

Please don't leave version empty. It should be 1 if it is the initial version. For instance

version: 1
version_comment: "initial version"

The figure_by field is missing information for all observables.

Rather than having an empty theory I would prefer if you just leave the theory out when it does not exist.
Related to this, the "theory" field must always include an operation. For instance, for the observable in which you do have information would be:

FK_tables:
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTTRAP
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTOT
operation: RATIO

The theory information can be found in this repository: https://github.com/NNPDF/theories/blob/main/data/yamldb/400/CMSTOPDIFF8TEVTTRAPNORM.yaml

For the following I would like to have also @enocera input:

There's "a problem" with the kinematics. In the past we always had three kinematic variables, however in the observables in this PR there is only two. It is of course reasonable that different datasets have a different number of kinematic variables (then I will fill in one with 0s or something since there are several points in vp in which three are assumed).

However, looking at the "old commondata" version of CMSTOPDIFF8TEVTTRAPNORM there are three:

from validphys.api import API
cd = API.dataset(dataset_input={"dataset": "CMSTOPDIFF8TEVTTRAPNORM"}, use_cuts="nocuts", theoryid=200).load_commondata()
cd.commondata_table

RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

Also, inspecting the old version of CMSTOPDIFF8TEVTTRAPNORM I am very confused at how can I compare the older version with the new ones. If I look at the uncertainties in CMSTOPDIFF8TEVTTRAPNORM I find cd.nsys=21 but cd.systematics_table has 42 columns (which happens to be double nsys). What are those 42 entries exactly? Maybe this a question more for @Zaharid ? Inspecting the 42 entries in the old version and the 21 of the new one I see some that match, some that doesn't and some for which there are a factor between them (for instance for ArtUnc_{2,3,4} there is a -1 factor between the old one and the new).

So I guess I have two requests:
1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says?
2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

enocera · 2023-03-30T12:28:41Z

For the following I would like to have also @enocera input:

There's "a problem" with the kinematics. In the past we always had three kinematic variables, however in the observables in this PR there is only two.

The number of kinematic variables in the new format should never be smaller than the number of kinematic variables in the old format. For plotting purposes, the three variables are all needed. Let me fix this with @t7phy .

It is of course reasonable that different datasets have a different number of kinematic variables (then I will fill in one with 0s or something since there are several points in vp in three are assumed).

However, looking at the "old commondata" version of CMSTOPDIFF8TEVTTRAPNORM there are three:
from validphys.api import API
cd = API.dataset(dataset_input={"dataset": "CMSTOPDIFF8TEVTTRAPNORM"}, use_cuts="nocuts", theoryid=200).load_commondata()
cd.commondata_table
RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

Also, inspecting the old version of CMSTOPDIFF8TEVTTRAPNORM I am very confused at how can I compare the older version with the new ones. If I look at the uncertainties in CMSTOPDIFF8TEVTTRAPNORM I find cd.nsys=21 but cd.systematics_table has 42 columns (which happens to be double nsys). What are those 42 entries exactly? Maybe this a question more for @Zaharid ? Inspecting the 42 entries in the old version and the 21 of the new one I see some that match, some that doesn't and some for which there are a factor between them (for instance for ArtUnc_{2,3,4} there is a -1 factor between the old one and the new).

So I guess I have two requests: 1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says? 2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

@scarlehoff Let me check all these issues with Tanishq (I'm going to talk to him in 2 mins) and we will come back to you.

enocera · 2023-03-30T14:16:02Z

So I guess I have two requests: 1 - Explain to me how the old format was written down. If I look at the .dat file I can see the 42 entries and (I guess?) it is loaded as index process kin1, kin2, kin3, data, stat, {systematics} but why is the length of the systematics columns much longer than what the systype file says? This applies to ALL data sets, so you should encounter similar mismatches also for the NMC and DY data sets.

@scarlehoff The old commondata format is as follows {idat, kin1, kin2, kin3, data, stat, {systematics}}, but {systematics} is twice the number of systematic uncertainties in the (old) metadata. The reason being that the first is the absolute value of the sys uncertainty, and the second is the percentage value, see https://docs.nnpdf.science/data/exp-data-files.html?highlight=commondata#commondata-file-format. Does this answer your question?

2 - When bugs have been corrected, please mention that in the definition/description in some way, otherwise the task of checking old vs new will be very hard for me.

Let me check carefully the implementation and I will tell you when you can proceed with the testing further.

t7phy · 2023-03-30T14:22:18Z

I have a few comments about the files in this PR, and some more (at the bottom) that are more specific and that should be answered I guess by @enocera

Please don't leave version empty. It should be 0 if it is the initial version. For instance
version: 0
version_comment: "initial version"

This is done

The figure_by field is missing information for all observables.

I will do in next commits

Rather than having an empty theory I would prefer if you just leave the theory out when it does not exist. Related to this, the "theory" field must always include an operation. For instance, for the observable in which you do have information would be:

I would prefer to leave them empty, because the goal is to generate all the theories for which data is there and implemented. This emptiness will be a reminder to generate them, and eventually have this field populated, as opposed to throwing away the theory field itself.

FK_tables:
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTTRAP
- - CMSTOPDIFF8TEVTTRAPNORM-TOPDIFF8TEVTOT
operation: RATIO

I will modify this in the next commits

RE the uncertainties, in the previous versions (FTDY or NMC) it was called definition instead of description. Both are fine, but please stick to one. Same for the top-level name. In the previous versions it was called definition but here is definitions.

@scarlehoff The final convention is that we use definitions for the top level, as it is consistent with others, for e.g. bin_s_
For the 2nd level, it will be description, this is consistent in this PR and in NMC. I am not sure about FTDY (but if it is different here, please modify it. In short this is how it should be:
definitions:
error_name:
description:
treatment:
type:

scarlehoff · 2023-03-30T14:39:12Z

@enocera

Does this answer your question?

Yes, and no. It's still not clear to me how the convention on the old commondatas are transported to the new ones since now we have only the "real (used) one". Does that mean upon loading a variant of the uncertainties how each uncertainty is to be treated is fixed forever? (now that I'm writing it down I'm getting flashes, I guess this is @Zaharid's point in #1691 but I need to fully remember all this if I'm to document it...)

@t7phy

I would prefer to leave them empty, because the goal is to generate all the theories for which data is there and implemented. This emptiness will be a reminder to generate them, and eventually have this field populated, as opposed to throwing away the theory field itself.

I would insist on throwing away the field itself. That way I have freedom deciding which fields are or not optional in the theory (plus the format itself) and the people implementing the data has the freedom to add new datasets without knowing what the theory format is.

I am not sure about FTDY (but if it is different

I'll update to follow the convention used here (also for the other fields if they change).

Don't worry too much about NMC since I guess that one is going to be redone (btw, if you want to cherry-pick the CMS dataset from this branch and rebase it on top of master or utils, please do so).

buildmaster/CMS_TTBAR_8TEV_LJ_DIF/metadata.yaml

t7phy · 2023-10-24T13:21:05Z

closing this as we now have the agreed upon format for metadata with implementation examples in #1813

scarlehoff and others added 10 commits February 23, 2023 14:25

new commondata reader

c0b0501

improve the notebook

b901e23

remove duplicated functions

870f849

make theory optional

5290ba1

added experiment

c51c659

rawdata added

0e12066

dataset implemented

047f469

metedata corrections

558a35c

backward compatibility dict

a825d27

added systematics breakdown

9940765

t7phy requested a review from enocera March 1, 2023 12:53

scarlehoff changed the base branch from master to final_reader_for_new_commondata_mk2 March 1, 2023 12:56

scarlehoff requested a review from Zaharid March 1, 2023 12:57

t7phy added 2 commits March 1, 2023 19:42

minor utils func modification

04fba08

consistency imporvements and better utils

3876ba2

t7phy marked this pull request as ready for review March 8, 2023 13:08

more descriptive setname

0fb4ae9

felixhekhorn reviewed Mar 8, 2023

View reviewed changes

buildmaster/CMS_ttBar_lj_8TeV_dif/metadata.yaml Outdated Show resolved Hide resolved

dataset name consistency

eaaee55

t7phy mentioned this pull request Mar 13, 2023

Utils for the new commondata #1693

Closed

scarlehoff added the data toolchain label Mar 14, 2023

scarlehoff mentioned this pull request Mar 23, 2023

[New commondata format] FTDY #1679

Closed

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from 5290ba1 to 9849418 Compare March 29, 2023 15:10

updated to new naming conv

4e894af

t7phy mentioned this pull request Mar 29, 2023

New CommonData Reader #1678

Merged

3 tasks

version added

7c83a05

scarlehoff mentioned this pull request Mar 30, 2023

Docs for new commondata format #1708

Merged

5 tasks

3 kin var

e5cbec7

felixhekhorn reviewed Mar 31, 2023

View reviewed changes

buildmaster/CMS_TTBAR_8TEV_LJ_DIF/metadata.yaml Show resolved Hide resolved

Radonirinaunimi mentioned this pull request Apr 19, 2023

Validphys script to download hepdata tables #1717

Closed

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from a6c63d9 to 56fe007 Compare April 28, 2023 10:06

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch 2 times, most recently from 6410bca to 26948a2 Compare June 21, 2023 13:46

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch from 26948a2 to 8725a81 Compare September 21, 2023 10:19

scarlehoff force-pushed the final_reader_for_new_commondata_mk2 branch 2 times, most recently from b117b34 to 639e6c6 Compare October 11, 2023 17:11

t7phy closed this Oct 24, 2023

scarlehoff deleted the more_efficient_metadata_for_new_commondata branch November 14, 2024 10:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New metadata format #1684

New metadata format #1684

t7phy commented Mar 1, 2023

scarlehoff commented Mar 1, 2023

t7phy commented Mar 1, 2023

scarlehoff commented Mar 30, 2023 •

edited

Loading

enocera commented Mar 30, 2023

enocera commented Mar 30, 2023 •

edited

Loading

t7phy commented Mar 30, 2023

scarlehoff commented Mar 30, 2023 •

edited

Loading

t7phy commented Oct 24, 2023

New metadata format #1684

New metadata format #1684

Conversation

t7phy commented Mar 1, 2023

scarlehoff commented Mar 1, 2023

t7phy commented Mar 1, 2023

scarlehoff commented Mar 30, 2023 • edited Loading

enocera commented Mar 30, 2023

enocera commented Mar 30, 2023 • edited Loading

t7phy commented Mar 30, 2023

scarlehoff commented Mar 30, 2023 • edited Loading

t7phy commented Oct 24, 2023

scarlehoff commented Mar 30, 2023 •

edited

Loading

enocera commented Mar 30, 2023 •

edited

Loading

scarlehoff commented Mar 30, 2023 •

edited

Loading