Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace to_arrayset/from_arrayset with to_buffers/from_buffers and deprecate the original. #592

Merged
merged 8 commits into from
Dec 11, 2020

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Dec 11, 2020

  • Write ak.to_buffers
  • Get ak.to_arrayset to pass through ak.to_buffers and all tests to work again
  • Write ak.from_buffers
  • Get ak.from_arrayset to pass through ak.from_buffers and get all tests to work again
  • Set up all deprecation messages
  • Test the documentation examples
  • Convert the tests from arrayset functions to buffers functions
  • Convert pickle-handling from arrayset functions to buffers functions
  • Check Parquet-handling functions; does anything need to be done here? Uniformity of arguments?
  • Update the docs-src Jupytext notebook examples
  • Ensure that there are no more warnings

@jpivarski
Copy link
Member Author

Why replace arrayset with buffers? There are some special case arrays that could not be preserved without explicit knowledge of their length, so length is now a necessary part of the interface (not just for lazy-reading).

The case I noticed was regular, zero-sized dimensions (either as RegularArrays or as the shape of a NumpyArray):

>>> ak.from_buffers(*ak.to_buffers(ak.Array(np.empty((5, 0, 3)))))
<Array [[], [], [], [], []] type='5 * 0 * 3 * float64'>

>>> ak.from_buffers(*ak.to_buffers(ak.from_numpy(np.empty((5, 0, 3)), regulararray=True)))
<Array [[], [], [], [], []] type='5 * 0 * 3 * float64'>

But I had previously overlooked RecordArrays with no fields (or equivalently, zero-field tuples):

>>> ak.from_buffers(*ak.to_buffers(ak.Array([{}, {}, {}, {}, {}])))
<Array [{}, {}, {}, {}, {}] type='5 * {}'>
>>> ak.from_buffers(*ak.to_buffers(ak.Array([(), (), (), (), ()])))
<Array [(), (), (), (), ()] type='5 * ()'>

The old arrayset interface, without an explicit length, either raises exceptions or gets this wrong. It's an exception for NumpyArrays:

>>> ak.from_arrayset(*ak.to_arrayset(ak.Array(np.empty((5, 0, 3)))))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jpivarski/irishep/awkward-1.0/awkward/operations/convert.py", line 4171, in from_arrayset
    return from_buffers(
  File "/home/jpivarski/irishep/awkward-1.0/awkward/operations/convert.py", line 3727, in from_buffers
    out = _form_to_layout(*(args + (None, None)))
  File "/home/jpivarski/irishep/awkward-1.0/awkward/operations/convert.py", line 3417, in _form_to_layout
    array = raw_array.view(dtype).reshape(shape)
ValueError: cannot reshape array of size 0 into shape (0,3)

the wrong answer for RegularArrays (this is supposed to have type "5 * 0 * 3 * float64"):

>>> ak.from_arrayset(*ak.to_arrayset(ak.from_numpy(np.empty((5, 0, 3)), regulararray=True)))
<Array [] type='0 * 0 * 3 * float64'>

and RecordArrays with no fields lose their lengths:

>>> ak.from_arrayset(*ak.to_arrayset(ak.Array([{}, {}, {}, {}, {}])))
<Array [] type='0 * {}'>
>>> ak.from_arrayset(*ak.to_arrayset(ak.Array([(), (), (), (), ()])))
<Array [] type='0 * ()'>

On top of that, the system for setting the format of key names in arrayset was convoluted—several arguments that, when concatenated with or without separators, produced the key—and the buffers interface just has a format string to fill (that can be a function returning a string, which has access to the layout node for super-customization).

Since we're living in 1.x times, ak.to_arrayset/ak.from_arrayset have a formal deprecation process. In 1.0.1rc2 and 1.0.1, they'll start showing warnings that point to ak.to_buffers/ak.from_buffers. The arrayset functions will be removed in the first monthly minor release, 1.1.0 on February 1, 2021. (This one has a short fuse because it's not widely used.)

Data pickled with old releases (which goes through arrayset) will be supported forever, and they're already being processed by ak.from_buffers with the appropriate arguments. If any of those old pickle files fall into the above corner-cases, they'll be misreconstructed just as of old (because the length was not saved), but newly pickled data will be correct for all known cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant