Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

YODA2 writer #53

Merged
merged 22 commits into from
Nov 3, 2023
Merged

YODA2 writer #53

merged 22 commits into from
Nov 3, 2023

Conversation

20DM
Copy link
Collaborator

@20DM 20DM commented Jan 30, 2023

New YODA writers to work with upcoming YODA2 release.

A new YODA1 writer is introduced, intended as a temporary legacy writer that will convert data tables to YODA1-style scatters (but without the error breakdown).

The current YODA writer is modified to convert the data tables to the new BinnedEstimateND objects. Discrete integer and string axes are now also supported.

@20DM
Copy link
Collaborator Author

20DM commented Jan 30, 2023

Dear @GraemeWatt - here's my attempt at tweaking the YODA converter to support the upcoming YODA2 release.

I'm still in the process of validating it locally: My plan was to download the YAML inputs for the Inspire IDs in the usual analyses.json that we keep around in Rivet and use a local installation of YODA2 and the hepdata_converter to convert these to YODA2-style outputs.

I noticed occasional crashes where the YAML-parsing of the YAML input downloaded from HepData fails, e.g. similar to this:

    raise RuntimeError(
RuntimeError: Data file (yaml_inputs/HEPData-ins421984-v1-yaml/Table1.yaml) did not pass validation: yaml_inputs/HEPData-ins421984-v1-yaml/Table1.yaml: Uncertainties should not all be zero in 'dependent_variables.values[72].errors' | Uncertainties should not all be zero in 'dependent_variables.values[75].errors' | Uncertainties should not all be zero in 'dependent_variables.values[77].errors' | Uncertainties should not all be zero in 'dependent_variables.values[78].errors'

Is this expected / should I just skip these?

Thanks!

@GraemeWatt
Copy link
Member

@20DM, thanks for making a start on this!

The problem with the YAML parsing is because the validation has become more strict over time, so some older records can fail validation against the most recent schema. The workaround is to validate against the original schema using an option 'validator_schema_version': '0.1.0' (see test_yodawriter.py). This is what we do for the converter in production and so you should use the same option.

@20DM
Copy link
Collaborator Author

20DM commented Jan 31, 2023

@20DM, thanks for making a start on this!

The problem with the YAML parsing is because the validation has become more strict over time, so some older records can fail validation against the most recent schema. The workaround is to validate against the original schema using an option 'validator_schema_version': '0.1.0' (see test_yodawriter.py). This is what we do for the converter in production and so you should use the same option.

Thanks @GraemeWatt ! That fixed a lot of the parsing errors. There are still O(10) of the following type left:

ERROR:hepdata_converter.writers.array_writer:TypeError encountered when parsing 2.300e-003f

Are they also "fixable" with an option or shall I skip them?

@GraemeWatt
Copy link
Member

That particular case came up in a previous discussion. I just fixed the corresponding table by directly editing the YAML data file stored on disk to remove the extra f at the end of the uncertainties.

We don't have other options to skip errors, but in most cases a YODA file will still be produced. In the case you mentioned (with an extra f appended to uncertainties), the YODA file was written with zero uncertainties and ErrorBreakdown: {0: {stat: {}}, 1: {stat: {}}, 2: {stat: {}}, 3: {stat: {}}, 4: {stat: {}}, 5: {stat: {}}, 6: {stat: {}}, 7: {stat: {}}, 8: {stat: {}}}.

It would be helpful if you could list the O(10) records that still give an error with the corresponding error message. These are likely caused by incorrectly formatted records, so the fix is to either update the records or to modify the converter to tolerate the incorrect formatting. Some errors might already be logged as open issues in this repo.

@20DM
Copy link
Collaborator Author

20DM commented Feb 1, 2023

Adding some more printout I released that all 9 messages came from the same entry, which was indeed the STAR measurement (793126) which you've just fixed. I've synced the yaml and the messages have disappeared! Apologies for the noise and thanks for fixing!

@20DM
Copy link
Collaborator Author

20DM commented Feb 1, 2023

Hi @GraemeWatt - here's another odd one I've come across, cf. second independent column of Table 18 https://www.hepdata.net/record/56218
The crash comes from the array writer (I've removed the constraint on non-numerical input) because "value" is a string "2.1 +- 0.1" and "low" and "high" are floats which doesn't work so well with the + and - operators when trying to calculate asymmetric errors 😅

I could try editing the array writer to give preference to the low/high edges if they exist and there's a type mismatch with the value, but perhaps fixing the yaml directly is the preferred approach here. Any preference?

@20DM
Copy link
Collaborator Author

20DM commented Feb 1, 2023

PS - this sort of patch is what I had in mind:

--- a/hepdata_converter/writers/array_writer.py
+++ b/hepdata_converter/writers/array_writer.py
@@ -203,8 +203,14 @@ class ArrayWriter(Writer, metaclass=abc.ABCMeta):
                     min_errs.append(sqrt(errors_min))
                     max_errs.append(sqrt(errors_max))
                 elif 'low' in entry and 'high' in entry:
-                    min_errs.append(entry['value'] - entry['low'])
-                    max_errs.append(entry['high'] - entry['value'])
+                    if type(entry['value']) != type(entry['low']):
+                        middle_val = (entry['high'] - entry['low']) * 0.5 + entry['low']
+                        values[-1] = float(middle_val)
+                        min_errs.append(middle_val - entry['low'])
+                        max_errs.append(entry['high'] - middle_val)
+                    else:
+                      min_errs.append(entry['value'] - entry['low'])
+                      max_errs.append(entry['high'] - entry['value'])
                 else:
                     min_errs.append(0.0)
                     max_errs.append(0.0)

@GraemeWatt
Copy link
Member

GraemeWatt commented Feb 1, 2023

That's a strange one. The combination of a string value and numeric low and high bin limits is not allowed by the current validator. Moreover, the difference high - low is half what you'd expect from the value, e.g. the first bin has value: 2.1 +- 0.1 but low: 2.05 and high: 2.15. The record was encoded in 2010 by Mike Whalley using text files supplied by ALICE matching Table 2 of the publication, which only gives 2.1 +- 0.1 (for the first bin) and not low and high bin limits. I found Mike's original files used to prepare the submission, but the low and high bin limits are missing, so they must have been (incorrectly?) added at some later stage. I think the table should be edited to correct the bin limits and give value as a number, e.g. {high: 2.2, low: 2.0, value: 2.1}. Alternatively, the second independent variable could be moved to a dependent variable with e.g. value: 2.1 and symerror: 0.1. Any preference?

@20DM
Copy link
Collaborator Author

20DM commented Feb 1, 2023

Hi @GraemeWatt - ah, I hadn't even clocked that the uncertainty intervals don't match. Even worse - but good that you found Mike's original submission! Looking at the paper now, I would argue for the second option: the <n_ch> values have been extracted from the MC simulation as a function of the (independent) n_acc bin, so they shouldn't be considered independent.

@GraemeWatt
Copy link
Member

I agree it's better as a dependent variable, so I've reformatted the table. It's an open issue to allow two dependent variables to be plotted against each other, as seems to be common in heavy-ion publications. I spotted another mistake in the qualifier {name: PT, units: GeV, value: 0.15-10.0}, but the caption for Table 2 of the publication indicates a range 0.15-4.0 GeV for the spectra binned in multiplicity, so I've corrected that too.

@20DM
Copy link
Collaborator Author

20DM commented Feb 1, 2023

Thanks @GraemeWatt - just to confirm that this fixes the conversion on my side as well. 👍

@20DM
Copy link
Collaborator Author

20DM commented Feb 2, 2023

Hi @GraemeWatt - I found one more like this cf Table 11 here. I believe this is the last one with a pseudo-independent variable.

@GraemeWatt
Copy link
Member

That record was encoded by me in 2013, with data from Table 11 of the record taken from Table 1 of the publication. The first row of the original input file has 356.1 +- 3.6 but this was translated to {high: 359.7, low: 352.5, value: 356.1 +- 3.6} in the conversion to YAML before migrating from the old HepData site to the new hepdata.net site. At least the low and high values look correct in this case. I've edited the YAML data file to give {high: 359.7, low: 352.5, value: 356.1}. I'd prefer to keep MEAN(Npart) as an independent variable, since all the dependent variables in this table are a different observable DN/DYRAP. If there are more cases like this, you might need a patch as you proposed. By the way, I get email notifications for all comments, so you don't need to tag me explicitly in every reply.

@20DM
Copy link
Collaborator Author

20DM commented May 26, 2023

Hi Graeme,

apologies for the delay on here. It took me longer than expected, but I've got Rivet up and running with YODA2 now, and I'm currently in the process of trying to get the full suite of routines to run with the reference data coming from this draft converter. It seems to me that we can get rid of a lot of the custom Python patches that we apply as part of the HepData sync (which was the aim of the game of course), but there's a few entries that need a bit of thought.

One example is this record from AMY where nearly all the tables have a binned differential cross-section with an additional overlapping bin at the end for the average.

The averages should have probably gone into separate tables as opposed to superimposing them onto the differential versions. In fact, we currently use a Python patch to remove this last bin since we don't need/want a histo bin corresponding to the average; that's of course a histogram-level quantity that would normally be calculated at the very end of an analysis run. Could we just split the averages off into a dedicated table?

@GraemeWatt
Copy link
Member

GraemeWatt commented May 26, 2023

I agree that the averages, for example, the last bin of this HEPData table, would be better given in a separate HEPData table. The encoding just reflects the presentation of the journal publication (Table 5). It was not foreseen in 1990 how the data would be used in 2023. If there are only a small number of affected HEPData records relevant for Rivet, I can upload revised versions with the averages moved to separate tables (or I can give you Uploader permissions). Otherwise, it is probably possible to write code to automatically remove the inclusive bins in the YAML to YODA conversion. For example, if there are more than two bins, find the low and high values of all bins, then if the low and high values of a single bin match the low and high values of all bins, that single bin should be removed.

@20DM
Copy link
Collaborator Author

20DM commented Jul 6, 2023

Hi Graeme - just to keep you updated: I'm just on the final stretch in terms of converting the reference data files from Y1->Y2, still ironing out a few more edge cases with small tweaks to YODA and/or the converter.

You may be pleased to hear that the vast majority of our custom Python patches in Rivet become obsolete thanks to the new types, bringing it down by an order of magnitude from O(150) to O(15) currently. 😅

Some things to flag up:

  • There are about half a dozen entries with averages (cf. my last messages), but the records differ slightly in how the averages are presented. I'm just running another tweak to see if I can catch more of those on the fly. Will get back to you on this.
  • So far I haven't addressed the cases where we have reference data in Rivet but not on HepData. I would propose we come back to those after the new YODA is released, so we can get this out. I would probably then just sit down and convert those to HepData YAML, so we can get those uploaded into HepData. For now I'd just keep the converted files in Rivet.
    • Incidentally, this includes a weird one like Tasso-194774 where HepData is missing the numerical values for T1, but we have them in Rivet. To be fixed later, unless it's obvious to you what could have happened to those values.
  • There is an error in Pluto-165122: The upper edge in the first point of T5 should be 0.1 and not 1.0. Probably best if we fix this one on HepData directly.

There are a handful more that I'm looking into.

@GraemeWatt
Copy link
Member

GraemeWatt commented Jul 6, 2023

Sounds like great progress! Brief response to your comments:

  • I prefer that averages are removed on-the-fly if possible, but we can look into revising HEPData records where this is difficult.
  • For converting YODA files to HEPData YAML, it would be nice if hepdata_lib could read in YODA files. I think we discussed this at the IPPP meeting in December 2022. I've opened Add ability to read in YODA files hepdata_lib#229 and maybe you could help to address it.
  • For Table 1 of ins194774, I think it's a case where the table was left empty as a placeholder since the numbers were not given in the publication. The Rivet TASSO_1984_I194774.plot file excludes the branching ratio given in Figure 3 of the publication, but the numbers in the TASSO_1984_I194774.yoda file also differ in shape compared to Figure 3. So the relation between the YODA file and Figure 3 is not obvious to me. I can give you Uploader permissions to upload a new "version 2" of the HEPData record with Table 1 filled in if you can work out the reason for the differences.
  • I've corrected the first bin of Table 5 of ins165122. Thanks for spotting the mistake.

@20DM
Copy link
Collaborator Author

20DM commented Jul 6, 2023

Another weird one that I wanted to point out:

The lower bin edges in T12 of Tasso-266893 seem to be messed up. I think they may have unintentionally ended up copying the lower bin edges from T6, but put the correct upper edges - the plot makes little sense to me otherwise. If I keep the leading 0.0 as the overall lowest bin edge and then only accept the upper bin edges, I get a reasonable binning for this distribution:

[ 0.0, 0.08, 0.165, 0.273, 0.451, 1.91 ]

I almost didn't notice this one because this is how the HepData-to-YODA converter in practice constructs the edges, and hence gets it right, but we might wanna fix the record anyway. (The paper is behind a paywall unfortunately, so can't easily double check.)

…previous estimate for a given set of local indices
@20DM
Copy link
Collaborator Author

20DM commented Jul 6, 2023

Just to add that with the latest logic I pushed, I don't see any issues with the various records that include the average bins along with the main distribution. The logic is a little slower CPU-wise as it checks previous indices/edges with oven gloves, but it seems to work just fine as far as I can tell. Let me know if you have any suggestions for improvements, though.

I also introduced two new options for (de-)selecting specific table identifiers. This is useful for a few records that contain huge covariance matrices that would otherwise take a long time to render, but wouldn't be used in practice. They are still being included by default, but e.g. if we know in advance that we don't want to ship them with every Rivet release, we can now disable their conversion/inclusion via the analysis info file and save ourselves some time & disk space when doing the HepData syncing in the future.

@GraemeWatt
Copy link
Member

Regarding Table 12 of ins266893, the mistake looks to be in the tabulated data of the original publication:
Screenshot 2023-07-06 at 21 44 49
I agree with your interpretation that the upper bin limits are correct and the lower bin limits have erroneously been copied from the lower-energy measurement. I have made your suggested correction, which is consistent with Fig. 4c of the publication:
Screenshot 2023-07-06 at 21 56 52

@20DM
Copy link
Collaborator Author

20DM commented Jul 7, 2023

Could we please remove the hyperlinks from the error labels in CDF-552797? I can try and escape the extra quotes in the converter, but I don't think the error labels should allow for hyperlinks to be submitted in the first place... This feels like a security issue - I'm assuming the uploader has some sanity checks to prevent people from uploading dodgy strings with code in it?

@GraemeWatt
Copy link
Member

I've shortened the error label from sys,Overall normalization error//A <A href=/spires/hepdata/misc/4570/sys.txt target="_blank"> full discussion of the other systematic errors </a> is reported in the paper and summarisedhere to just sys,Overall normalization error in Table 1 of ins552797. The linked file refers to the old HepData filesystem and I no longer have access to it. HTML tags are escaped in HEPData error labels, i.e. the hyperlink was not functional on the HEPData record page, so there is no security issue. We do permit a subset of HTML tags in the comment and description fields (see table in submission Tips). Tags not in this list (like <script>) will be escaped. The previous error label (with HTML markup) doesn't seem to cause any problems with the current YODA export, i.e. it appears in the ErrorBreakdown. In the new YODA export, any valid string (including special characters, quote marks, etc.) should be supported as an error label.

@20DM
Copy link
Collaborator Author

20DM commented Sep 14, 2023

Hi Graeme - the error breakdowns in the three tables of ins1294140 appear to have been merged with the central value...? Can we fix this one?

@20DM
Copy link
Collaborator Author

20DM commented Sep 27, 2023

Hi Graeme,

just to update you on the status:

I managed worked around the remaining issues with Python patches (will continue with the syncing after the releases are out) and got a Rivet branch with all the reference data converted to the new YODA2 formats using the HepData converter from this pull request.

I copied the original converter into a "YODA1" converter and for the time being we'll provide a YODA1-style writer in the new YODA as well. The is legacy writer is marked deprecated and the plan would be to remove it eventually (e.g. in a few years) once users had a chance to migrate their setups to the Rivet+YODA versions.

There's a couple of things I still wanna look into, but I reckon we'll be ready to make an alpha release for YODA2 in the coming days - of course, it would be great if we could make sure that everything is working on the HepData side once that alpha is out / before the actual production release.

I'll undraft this pull request now -- feel free to review whenever is good for you.

Cheers,
Chris

@20DM 20DM marked this pull request as ready for review September 27, 2023 12:48
@20DM 20DM changed the title Draft: Yoda2 writer YODA2 writer Sep 27, 2023
@GraemeWatt
Copy link
Member

Thanks Chris! I'll look into this soon. Just one comment for now. Could you please update the testsuite to work with the renamed yoda1_writer.py file and the new yoda_writer.py file? Please check that you can run the tests locally as described in the docs. For the CI to work, I'll need to prepare a new Docker image (see Dockerfile), which will be easier after the first alpha release of YODA2. The CI calculates the Python test coverage and I don't tend to merge PRs where this is reduced. So adding new code generally requires adding new tests.

@20DM
Copy link
Collaborator Author

20DM commented Sep 27, 2023

Tests re-enabled now!

...
----------------------------------------------------------------------
Ran 35 tests in 17.636s

OK

* Upgrade actions in ci.yml and use Python 3.10 (not 3.8).
* Use hepdata/hepdata-converter:latest Docker image (until new tag).
* Python requirements now need to be installed as not in Docker image.
* Renamed 'master' branch to 'main' as with other HEPData repositories.
* Update 'language' and 'intersphinx_mapping' for newer Sphinx versions.
* Replace virtualenvwrapper commands by standard Python 'venv' module.
* Update Read the Docs configuration file '.readthedocs.yml'.
* Also add 'yoda1' to the output formats in 'convert' docstring.
@GraemeWatt
Copy link
Member

Thanks for adding the tests! I prepared a Docker image with YODA v2.0.0alpha installed. If you want to upgrade the YODA version in future, you could submit a PR updating the Dockerfile in the HEPData/hepdata-converter-docker repository. I made some unrelated improvements to the CI, documentation and setup of your branch, since the hepdata-converter repository has been neglected in the last three years. I also renamed the master branch to main as for other HEPData repositories. You can see the links to the output of the CI, Coveralls and Read the Docs build above. The Python test coverage decreased from 91.194% to 90.972% in going from main to your branch. While this is only a small amount, it might be good to add more tests for the new code. Could you please look at the coverage for the new yoda_writer.py and see if you could write tests for the new lines not already in the original yoda_writer.py, e.g. for the new _pattern_check function? After that, I'd be happy to merge this PR.

Could you please also clarify the timescale for when you want the new hepdata-converter to be deployed in production on hepdata.net, e.g. should we wait for the official YODA v2.0.0 release? Did we agree that the HEPData web app should have options to download both "yoda" (i.e. YODA2) and "yoda1" (original YODA) formats? If so, it will take some time to modify the HEPData web app to allow both download options. Allowing only one YODA download format in the web app would be simpler. However, it might be confusing for users if the "yoda" option suddenly starts giving the new YODA2 format that they weren't expecting, and they don't have an option to download the original "yoda1" format.

@20DM
Copy link
Collaborator Author

20DM commented Oct 18, 2023

Hi Graeme,

thanks for the feedback! I'll look into additional test coverage in the next few days.

I agree regarding the download options. I think realistically we'll need the option to download the new and the old version, since it will take some time for people to migrate and ideally we don't want to hinder them submitting routines in the old format for the time being. Would you have an estimate for how long introducing the additional download option might take? For the proper YODA 2.0.0 release, we were thinking that we should probably wait for HepData to support the new format, as otherwise there isn't too much motivation for people to make the switch.

@20DM
Copy link
Collaborator Author

20DM commented Oct 24, 2023

Added a few more unit tests to cover the pattern (un-)matching as well as adding an example that includes a 0-dimensional objects (i.e. one that has a dependent axis but no independent axis, such as an integrated cross-sections for example). Should hopefully increase the coverage to an acceptable level.

* Add "html_title" to avoid warning when building epub format:
  WARNING: conf value "epub_title" (or "html_title") should not be
  empty for EPUB3
* Explain how to build docs locally using Sphinx in installation.rst.
* Add "SPHINXOPTS = -W --keep-going" to Makefile for Sphinx builds.
*
@GraemeWatt GraemeWatt merged commit 4ba4845 into HEPData:main Nov 3, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants