-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add new "fields"
key to RecordForm
#1773
Conversation
This PR will prevent v2 record (not tuple) forms from being read properly by v1. This is not ideal, but I think we do need to fix this. If we were keen on being able to read v2 forms in v1, then we could hold off on changing form serialisation until a later PR; allowing v2 to read this new schema so that in future we can switch painlessly. However, I don't think we are as worried about Additionally, we don't have to require that the entire-form is a single style. I just think that seems like a good way to catch accidental uses of this (safety). |
we want to move away from legacy-style `RecordArray` forms, so old tests must be changed. There is now a test to ensure that old-style forms are compatible with new-style forms.
02d4b17
to
9f09ea7
Compare
Codecov Report
Additional details and impacted files
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one requested change (described below): we don't need to raise an error if old-style and new-style RecordArray contents
are mixed.
Overall, inspired by the point @henryiii made on this morning's call, this change in the Form-JSON format could be accompanied by the writer's Awkward version number: "version": ak.__version__ or a protocol compatibility number, which is more granular than Awkward versions (like the ABI versions in pybind11, or Python's pickle protocol numbers): "format-compatibility": "v2" (starting with
@henryiii mentioned the JSON format for hist, but while hist gets new features all the time, Forms intentionally don't. The second point about what non-Awkward writers should do comes up because Uproot is similarly at a loss to say what "version of ROOT" it is when it writes a file. It's semantically off-kilter. It's hard to make an argument that a format is going to be slowly changing, if at all, at a moment in time when you're changing it. But here's the big picture:
As you can see, after v0, this Form-JSON format has not evolved very much and it's not expected to evolve in the future. We have complete forward and backward compatibility from April 2020 until now (breaking the ability of v1 to read v2 RecordForms), because the other changes were for optional fields or peripheral, not directly affecting the JSON-Form format. An optional "format-compatibility": "v2" could be prudent, in a preparing-for-unseen-disasters sense, not an expectation of ever having to use it. However, Form-JSON doesn't have a top-level; it's self-similar. If we were to add this, it would have to be added in every node. Given that needing this is such a long shot and we do have format compatibility from v1 to v2, I would vote against adding this format-compatibility marker. |
@jpivarski you've written a good summary of the version discussion and its pros and cons. Another option is to go for a monotonic integer that is deliberately meaningless - it just carries enough information for us to define epochs in the metadata space. As you say, we really aren't likely to change our metadata in a backwards incompatible (can't read old files) way, and although this is a forward incompatible change, it's the only one in some time to resolve a longstanding bug. Formats like protobuf have schema versions because they do change all the time. |
@jpivarski if you wouldn't mind checking over these last commits I'd be much obliged. If they're good, you're welcome to merge! |
if isinstance(input["contents"], dict): | ||
# New serialisation | ||
if "fields" in input: | ||
if isinstance(input["contents"], Mapping): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: since this is coming from json.loads
or equivalent, we can make a hard assumption that JSON objects are dicts. Being more general doesn't hurt, but it's why the rest of this function is checking isinstance(_, list)
and isinstance(_, dict)
everywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This warning is different to the rest of the function body, which (as you point out) is much stricter with types.
In reality, this is a high level functions that users can call. My temptation is to relax the constraints from dict
to Mapping
, and list
to sized_iterable
. How do you feel about that?
Yes, this is good and I'll merge it now. Regarding the |
Quick, not looking at this in detail, comment:
This is what boost-histogram does. The first item in the tuple is a monotonic number and we increase it by one if we add a field to the pickle, and the |
The "format-compatibility" string and monotonic integer both decouple the format version from the code version, which I think is good. (I see that pybind11's ABI version and pickle's protocol version both do this.) To fill you in, @henryiii, on what went wrong with our pickle compatibility: it had nothing to do with the format changing, since the format didn't change, so a version marker wouldn't have helped. The problem was that the state tuple contained a Python class: awkward/src/awkward/highlevel.py Lines 1455 to 1462 in f2092ee
where What I wanted and intended to do, way back in April 2020, was to use We could release 1.x patch that fixes this, but that only fixes pickles made with the patch release, not all the ones made since the format was introduced, which is the point of long-timescale compatibility. We could also introduce dummy classes with the right names for Python to fill, then immediately convert them over to our actual classes, but this is not maintainer-friendly. I think it's a better overall decision to give up on pickle compatibility, rather than do that. It's frustrating because just one line of code and a time machine are needed to fix it. That's all! |
Yes, we could introduce shim types that just define |
We can revisit it if somebody complains. |
This PR fixes #1766 by introducing a new
RecordForm
key:fields
:fields
key indicates that Awkward should read theRecordForm
as a new-style (non-legacy)Form
.ValueError
.