Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add new "fields" key to RecordForm #1773

Merged
merged 8 commits into from
Oct 6, 2022

Conversation

agoose77
Copy link
Collaborator

@agoose77 agoose77 commented Oct 6, 2022

This PR fixes #1766 by introducing a new RecordForm key: fields:

  • If present, the fields key indicates that Awkward should read the RecordForm as a new-style (non-legacy) Form.
  • If users mix form styles, Awkward will raise a ValueError.

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 6, 2022

This PR will prevent v2 record (not tuple) forms from being read properly by v1. This is not ideal, but I think we do need to fix this. If we were keen on being able to read v2 forms in v1, then we could hold off on changing form serialisation until a later PR; allowing v2 to read this new schema so that in future we can switch painlessly. However, I don't think we are as worried about v2v1 as the other way around.

Additionally, we don't have to require that the entire-form is a single style. I just think that seems like a good way to catch accidental uses of this (safety).

we want to move away from legacy-style `RecordArray` forms, so old tests must be changed. There is now a test to ensure that old-style forms are compatible with new-style forms.
@agoose77 agoose77 force-pushed the agoose77/form-order-dependent branch from 02d4b17 to 9f09ea7 Compare October 6, 2022 09:22
@codecov
Copy link

codecov bot commented Oct 6, 2022

Codecov Report

Merging #1773 (02be6b4) into main (df377bd) will increase coverage by 0.00%.
The diff coverage is 38.70%.

Additional details and impacted files
Impacted Files Coverage Δ
src/awkward/forms/form.py 85.39% <32.14%> (+0.46%) ⬆️
src/awkward/forms/recordform.py 88.57% <100.00%> (+0.08%) ⬆️
src/awkward/contents/indexedarray.py 79.76% <0.00%> (-0.19%) ⬇️
src/awkward/contents/recordarray.py 83.72% <0.00%> (-0.18%) ⬇️
src/awkward/_util.py 81.95% <0.00%> (-0.18%) ⬇️
src/awkward/contents/indexedoptionarray.py 88.75% <0.00%> (-0.16%) ⬇️
src/awkward/_connect/avro.py 87.17% <0.00%> (-0.14%) ⬇️
src/awkward/types/_awkward_datashape_parser.py 47.72% <0.00%> (-0.01%) ⬇️
src/awkward/contents/content.py 77.17% <0.00%> (+0.10%) ⬆️
src/awkward/highlevel.py 75.95% <0.00%> (+0.16%) ⬆️

Copy link
Member

@jpivarski jpivarski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one requested change (described below): we don't need to raise an error if old-style and new-style RecordArray contents are mixed.

src/awkward/forms/form.py Outdated Show resolved Hide resolved
tests/test_1766-record-form-fields.py Outdated Show resolved Hide resolved
@jpivarski
Copy link
Member

Overall, inspired by the point @henryiii made on this morning's call, this change in the Form-JSON format could be accompanied by the writer's Awkward version number:

"version": ak.__version__

or a protocol compatibility number, which is more granular than Awkward versions (like the ABI versions in pybind11, or Python's pickle protocol numbers):

"format-compatibility": "v2"

(starting with "v2" because this is Awkward v2). Between these two, I prefer the latter because

  • this Form-JSON format should evolve very slowly, if at all
  • non-Awkward writers of Form-JSON need to know what to put there.

@henryiii mentioned the JSON format for hist, but while hist gets new features all the time, Forms intentionally don't. The second point about what non-Awkward writers should do comes up because Uproot is similarly at a loss to say what "version of ROOT" it is when it writes a file. It's semantically off-kilter.

It's hard to make an argument that a format is going to be slowly changing, if at all, at a moment in time when you're changing it. But here's the big picture:

  • June 2018 ‒ April 2020: Awkward v0 had very active JSON metadata, with a version number. It was active in the sense that it could run arbitrary Python code: the JSON was one-to-one with LISP S-expressions that picked out functions in the codebase and ran them, so that improvements could be made in metadata that would work in old versions of Awkward. Then I started worrying about metadata, added a whitelist, and it was pretty bad.
  • April 2020 ‒ June 2021: the Form-JSON format created for Awkward v1 was first implemented (in C++) in 1c18425, and since it matches the structure of the Forms themselves (set of constructor arguments/properties defining the "dataclass"), which also hasn't changed, it's been stable. Replacing ak.from_arrayset/ak.to_arrayset with ak.from_buffers/ak.to_buffers (Replace to_arrayset/from_arrayset with to_buffers/from_buffers and deprecate the original. #592, introduced in version 1.0.1, old was removed in version 1.1.0) was a change in how pickling works because the length of the array is now recognized as necessary information to reconstruct an array, but that didn't affect the Form-JSON.
  • June 2021 ‒ now: Awkward v2 ignores the field "has_identities" and looks for an optional field "has_identifier" because the name has changed (C++ refactoring: Type and Form classes #914). Identities/Identifier hasn't actually been implemented, so if a Form has this, it raises a NotImplementedError.
  • Now, October 2022, you're adding a new way to write RecordForm as JSON, but we can still read the old way.
  • I'm thinking of finally giving up on ever implementing Identifiers, and then "has_identifier" would become an ignored field.

As you can see, after v0, this Form-JSON format has not evolved very much and it's not expected to evolve in the future. We have complete forward and backward compatibility from April 2020 until now (breaking the ability of v1 to read v2 RecordForms), because the other changes were for optional fields or peripheral, not directly affecting the JSON-Form format.

An optional

"format-compatibility": "v2"

could be prudent, in a preparing-for-unseen-disasters sense, not an expectation of ever having to use it. However, Form-JSON doesn't have a top-level; it's self-similar. If we were to add this, it would have to be added in every node.

Given that needing this is such a long shot and we do have format compatibility from v1 to v2, I would vote against adding this format-compatibility marker.

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 6, 2022

@jpivarski you've written a good summary of the version discussion and its pros and cons. Another option is to go for a monotonic integer that is deliberately meaningless - it just carries enough information for us to define epochs in the metadata space. As you say, we really aren't likely to change our metadata in a backwards incompatible (can't read old files) way, and although this is a forward incompatible change, it's the only one in some time to resolve a longstanding bug. Formats like protobuf have schema versions because they do change all the time.

@agoose77 agoose77 enabled auto-merge (squash) October 6, 2022 17:40
@agoose77 agoose77 disabled auto-merge October 6, 2022 17:40
@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 6, 2022

@jpivarski if you wouldn't mind checking over these last commits I'd be much obliged. If they're good, you're welcome to merge!

if isinstance(input["contents"], dict):
# New serialisation
if "fields" in input:
if isinstance(input["contents"], Mapping):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: since this is coming from json.loads or equivalent, we can make a hard assumption that JSON objects are dicts. Being more general doesn't hurt, but it's why the rest of this function is checking isinstance(_, list) and isinstance(_, dict) everywhere.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This warning is different to the rest of the function body, which (as you point out) is much stricter with types.

In reality, this is a high level functions that users can call. My temptation is to relax the constraints from dict to Mapping, and list to sized_iterable. How do you feel about that?

@jpivarski
Copy link
Member

Yes, this is good and I'll merge it now.

Regarding the format-compatibility marker, I'd rather not include it for the reasons given, and if there's no objection, then there's no action to take.

@jpivarski jpivarski merged commit 3f7a8da into main Oct 6, 2022
@jpivarski jpivarski deleted the agoose77/form-order-dependent branch October 6, 2022 18:22
@henryiii
Copy link
Member

henryiii commented Oct 6, 2022

Quick, not looking at this in detail, comment:

Another option is to go for a monotonic integer that is deliberately meaningless - it just carries enough information for us to define epochs in the metadata space.

This is what boost-histogram does. The first item in the tuple is a monotonic number and we increase it by one if we add a field to the pickle, and the __setstate__ checks that number and knows how to read all the past versions of pickles. The original "0" version of pickles was a dict, so if it's a dict, we know it's the old style.

https://github.com/scikit-hep/boost-histogram/blob/00105a785f0a27c3718539078f9db35c92537e54/src/boost_histogram/_internal/hist.py#L612-L645

@jpivarski
Copy link
Member

The "format-compatibility" string and monotonic integer both decouple the format version from the code version, which I think is good. (I see that pybind11's ABI version and pickle's protocol version both do this.)

To fill you in, @henryiii, on what went wrong with our pickle compatibility: it had nothing to do with the format changing, since the format didn't change, so a version marker wouldn't have helped. The problem was that the state tuple contained a Python class:

def __getstate__(self):
packed = ak.operations.structure.packed(self.layout, highlevel=False)
form, length, container = ak.operations.convert.to_buffers(packed)
if self._behavior is ak.behavior:
behavior = None
else:
behavior = self._behavior
return form, length, container, behavior

where form is an instance of ak._ext.forms.Form, which no longer exists. When Python unpickles, it tries to create instances of the classes (compiled extension classes, actually), and that's where it fails.

What I wanted and intended to do, way back in April 2020, was to use form.to_json() there (or at least form.to_dict()), and convert back in __setstate__. That way, any pickles made from any version of 1.x would be readable in 2.x because the pickle state is made entirely out of Python builtins. But I somehow goofed that up and forgot.

We could release 1.x patch that fixes this, but that only fixes pickles made with the patch release, not all the ones made since the format was introduced, which is the point of long-timescale compatibility.

We could also introduce dummy classes with the right names for Python to fill, then immediately convert them over to our actual classes, but this is not maintainer-friendly. I think it's a better overall decision to give up on pickle compatibility, rather than do that.

It's frustrating because just one line of code and a time machine are needed to fix it. That's all!

@agoose77
Copy link
Collaborator Author

agoose77 commented Oct 6, 2022

We could also introduce dummy classes with the right names for Python to fill, then immediately convert them over to our actual classes, but this is not maintainer-friendly. I think it's a better overall decision to give up on pickle compatibility, rather than do that.

Yes, we could introduce shim types that just define __setstate__ according to main-v1. It would actually not be too much code, but I'll let you be the guide on whether any code is too much for something people might not need.

@jpivarski
Copy link
Member

We can revisit it if somebody complains. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Form is not guaranteed to preserve record field ordering under serialisation
3 participants