Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap for experiment API #483

Closed
5 tasks
scarrazza opened this issue Jun 16, 2019 · 6 comments
Closed
5 tasks

Roadmap for experiment API #483

scarrazza opened this issue Jun 16, 2019 · 6 comments
Assignees

Comments

@scarrazza
Copy link
Member

Following the discussion in #476 the next steps require at least:

Integration with validphys

  • replace CommonData object calls with the new dataframe
  • implement conversion mapping between experiments and the new data key.

New features

  • implement filtering
  • implement artificial replica generation
  • implement covariance matrix computation
@Zaharid
Copy link
Contributor

Zaharid commented Jun 18, 2019

These all seem pretty sensible and uncontroversial. It would be good to have some of this done for the meeting so we can focus on the stuff that actually needs to be discussed, like e.g. a more robust defaults system.

@wilsonmr
Copy link
Contributor

My understanding is that data can at least be thought of as a flat list of datasets - correct me if I'm wrong

For computing total chi2 it is clear that sum_group chi2^(group) is not necessarily equal to total chi2 unless the covariance matrix arranged by groups is block diagonal. In the most extreme case - like the fits with theory covariance the correct total chi2 is only given by taking all of data and calculating chi2 on it. Can anyone think of a reasonable example where calculating total chi2 on data would give the wrong answer?

If not then surely config.py can be modified such that experiments key is read from runcard, user is warned that experiments key is deprecated and the essentially two level runcard list of experiments is flattened and then parsed as data, then we have backwards compatibility.

Then the grouping could be specified manually with similar syntax as the experiments, but slightly simpler - since the datset input was already specified in data, so only names are required:

data: fromfit
grouping:
 - name: group1
    datasets:
     - datasetname 1
     - datasetname 2

alternatively we have the groupby option which then allows you to groupby a metadata key:

data: fromfit
grouping:
    groupby: nnpdf31_process

the groupby is expanded (and later stored in lock file #496) and the grouping is a namespace list of groups (maybe there would need to be a GroupSpec which probably shares a lot in common with ExperimentSpec?) which in turn are namespace lists of datasets

In this way the experiments key gets nicely deprecated whilst retaining backward compatibility and the arbitrary group of datasets is easier to understand (than the current hack of using ExperimentSpec).

@Zaharid
Copy link
Contributor

Zaharid commented Jul 18, 2019

My understanding is that data can at least be thought of as a flat list of datasets - correct me if I'm wrong

For computing total chi2 it is clear that sum_group chi2^(group) is not necessarily equal to total chi2 unless the covariance matrix arranged by groups is block diagonal. In the most extreme case - like the fits with theory covariance the correct total chi2 is only given by taking all of data and calculating chi2 on it. Can anyone think of a reasonable example where calculating total chi2 on data would give the wrong answer?

The total chi2 on data is what we typically want. If you neglect the correlations, you can get very different results (and you cannot in general approximate the sum to any reasonable precision). This is discussed a bit in the alpha_s paper.

https://arxiv.org/abs/1802.03398

If not then surely config.py can be modified such that experiments key is read from runcard, user is warned that experiments key is deprecated and the essentially two level runcard list of experiments is flattened and then parsed as data, then we have backwards compatibility.

I agree with this, but I don't really see the relation with the previous point?

Then the grouping could be specified manually with similar syntax as the experiments, but slightly simpler - since the datset input was already specified in data, so only names are required:

data: fromfit
grouping:
 - name: group1
    datasets:
     - datasetname 1
     - datasetname 2

That is what we had for alpha_s. See
https://vp.nnpdf.science/6YsazxTxRIibUCb24zOFDA==/input/runcard.yaml
It would be good to have this for quick tests and speculative things, but once you get into "production" it really is a pain to write the same grouping each time, and even more of a pain that it has to change slightly and then you have to make sure you are copying from the right place.

alternatively we have the groupby option which then allows you to groupby a metadata key:

data: fromfit
grouping:
    groupby: nnpdf31_process

That is what we should have done from the start.

the groupby is expanded (and later stored in lock file #496) and the grouping is a namespace list of groups (maybe there would need to be a GroupSpec which probably shares a lot in common with ExperimentSpec?) which in turn are namespace lists of datasets

In this way the experiments key gets nicely deprecated whilst retaining backward compatibility and the arbitrary group of datasets is easier to understand (than the current hack of using ExperimentSpec).

@wilsonmr
Copy link
Contributor

wilsonmr commented Jul 18, 2019

If not then surely config.py can be modified such that experiments key is read from runcard, user is warned that experiments key is deprecated and the essentially two level runcard list of experiments is flattened and then parsed as data, then we have backwards compatibility.

I agree with this, but I don't really see the relation with the previous point?

Well I was checking I wasn't missing something, because the checkpoint:

implement conversion mapping between experiments and the new data key.

is basically trivial as def parse_experiments(self, exps): -> data

alternatively we have the groupby option which then allows you to groupby a metadata key:

data: fromfit
grouping:
    groupby: nnpdf31_process

That is what we should have done from the start.

Of course. Both options should be available.. I'm just thinking it would be nice to make some progress with this. From what I can tell, the data keyword can be done completely in parallel with the #476 since that now just refers to defining what the common data object is and how it is loaded rather than how we group it

@Zaharid
Copy link
Contributor

Zaharid commented Jul 18, 2019

Well yes, progress here has been slow and would be welcome. Ideally we should have had this before merging #499 but I already gave up on that.

@Zaharid
Copy link
Contributor

Zaharid commented Mar 26, 2021

Going to close this as we have more up to date issues for discussing it.

@Zaharid Zaharid closed this as completed Mar 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants