Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revamping the to_json/from_json interface. #1449

Merged
merged 13 commits into from
May 3, 2022

Conversation

jpivarski
Copy link
Member

@jpivarski jpivarski commented Apr 27, 2022

Reading and writing are not symmetric. to_json goes through Python's json module because Arrays/Records can have lots of overloaded behaviors and users will expect it to behave like to_list. (Fortunately, to_list has recently been optimized.) from_json will go through RapidJSON because it's faster and can handle streaming data, for file-like objects representing remote files. This is possible because the input is just text, not Python objects with possible behaviors.

Thus, reading may be much faster than writing, but there are good reasons for that. The arguments for circumventing non-JSON serializable data should match, however, and permit round-trip data, especially when a JSONSchema is provided.

  • Consolidate ak._v2.to_json and ak._v2.to_json_file into a single function that can:
    • send data to a string or a file (including fsspec if it has a URI scheme)
    • be line-delimited or not
    • handle Arrays and Records
    • converts non-JSON serializable data with composable options (e.g. both complex and NaN/infinite)
    • goes through the same code as ak._v2.to_list (so more maintainable)
  • Add ak._v2.to_json_schema to make a JSONSchema from a type or an array's type. (Completely new function.)
  • Temporarily turn off ak._v2.from_json and ak._v2.from_json_file.
  • (Re)-implement a reader from RapidJSON that can:
    • take a string or a Python object with a read(num_bytes) method
    • be line-delimited or not
    • handle Arrays and Records
    • applies the non-JSON deserialization options in a way that mirrors the above (i.e. allows composable round-trips) as a post-processing step (so that the same transformation can be applied to data read with a JSONSchema)
  • Update the JSONSchema-enabled reader to be usable in the same way, namely
    • take a string or a Python object with a read(num_bytes) method
    • be line-delimited or not
    • handle Arrays and Records
  • Reenable ak._v2.from_json as a single function that
    • takes data as a string, pathlib.Path filename, passes URIs through fsspec, or a file-like object
    • has the same line_delimited option as ak._v2.to_json
    • has a schema option, which is passed through json.loads if it's a string
    • and all the non-JSON deserialization workaround arguments

@codecov
Copy link

codecov bot commented Apr 27, 2022

Codecov Report

Merging #1449 (2f86ae5) into main (edfce38) will increase coverage by 0.42%.
The diff coverage is 54.62%.

Impacted Files Coverage Δ
src/awkward/_v2/_connect/cuda/__init__.py 0.00% <0.00%> (ø)
src/awkward/_v2/_connect/jax/__init__.py 0.00% <0.00%> (ø)
...c/awkward/_v2/_connect/rdataframe/to_rdataframe.py 0.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_from_cupy.py 50.00% <0.00%> (+23.33%) ⬆️
src/awkward/_v2/operations/convert/ak_from_iter.py 93.75% <ø> (ø)
src/awkward/_v2/operations/convert/ak_from_jax.py 50.00% <0.00%> (-25.00%) ⬇️
...wkward/_v2/operations/convert/ak_from_json_file.py 76.59% <ø> (ø)
...awkward/_v2/operations/convert/ak_from_json_new.py 0.00% <0.00%> (ø)
src/awkward/_v2/operations/convert/ak_to_cupy.py 33.33% <0.00%> (+23.95%) ⬆️
src/awkward/_v2/operations/convert/ak_to_jax.py 33.33% <0.00%> (-41.67%) ⬇️
... and 108 more

@agoose77
Copy link
Collaborator

@jpivarski just dropping in after reading the main description.

to_json_schema is a fantastic idea for an addition!

Is there much precedent / use case for customising to_list / to_json via behaviors? I assume so if it's important enough to handle during writing, but I'm used to dealing with serialisation cases (non Awkward) where a symmetric to/from pre/post processor is required, rather than a one way transformation.

Also, is there a particular reason to choose rapidjson? I recalled seeing a number of benchmarks suggesting orjson was reasonably faster for both ser/deser - here are there own numbers, for whatever they're worth https://github.com/ijl/orjson#performance
FWIW I haven't benchmarked to see how much of the from_json process is consumed by the JSON parsing vs Awkward Array building, so performance here might not be the most important factor of choosing a JSON lib.

@jpivarski
Copy link
Member Author

One important behavioral override is __array__: "string":

>>> array = ak._v2.Array(["one", "two", "three", "four", "five"])
>>> array.layout
<ListOffsetArray len='5'>
    <parameter name='__array__'>'string'</parameter>
    <offsets><Index dtype='int64' len='6'>[ 0  3  6 11 15 19]</Index></offsets>
    <content><NumpyArray dtype='uint8' len='19'>
        <parameter name='__array__'>'char'</parameter>
        [111 110 101 116 119 111 116 104 114 101 101 102 111 117 114 102 105
         118 101]
    </NumpyArray></content>
</ListOffsetArray>
>>> ak._v2.to_json(array)
'["one","two","three","four","five"]'

However, that one was so important that now it has a special implementation path: you can't opt out of __array__: "string" being JSON-encoded as strings anymore. But in principle, it's a behavioral override, with an __array__ parameter name and an ak.Array subclass. It would be strange if other behavior overrides are skipped.

Also worth mentioning: the to_list implementation has changed in such a way that checking the __array__ parameter for overrides only happens once, not with every array element. So if you don't have any overrides, it's not an O(n) performance cost.

There probably ought to be a way to pass in replacement callables for json.dump and json.dumps, though not orjson because it doesn't provide a json.dump function. (I suppose one could be wrapped.) Whether one needs a faster JSON serialization library (and therefore the dependency) would depend on whether it's the bottleneck. The new to_list works by creating a whole content list before dividing it up into sublists, instead of iterating over the data (because __getitem__ is very slow for Awkward Arrays: you have to check a lot of things with every invocation that only need to be checked once if you do a tree-node/array at a time). Skipping the Python list intermediate would introduce a much larger bottleneck, and yet making the Python lists might dominate over any JSON serialization anyway. It's not clear that much can be done here, except maybe some special cases.

The orjson documentation reminded me that we need to do something about serializing dates. That's a good to-do.

The other direction, reading and deserializing JSON, is a completely different story. (One or my favorite realizations is that there's so much asymmetry between reading and writing!) All of the JSON workarounds—nan, infinity, complex numbers, raw bytestrings, and now dates/time differences—can be applied to the Awkward Array after deserialization, so they don't need to slow down the deserialization process. They're also optional, whereas for serialization, something has to be done or we won't fit the format. We'll have to use ArrayBuilder, but it can be ArrayBuilder in C++ because we don't have to think about behavioral overrides.

I don't know how the orjson plots compare RapidJSON, a C++ library, with libraries that produce Python data. We won't be using it that way: the strings will go directly into ArrayBuilder, without touching Python (and therefore, we'll release the GIL, too). This is also true of the case that's guided by a JSONSchema, which also uses RapidJSON but skips ArrayBuilder's type discovery for some extra speed.

At one point, I did a comparison of C++ JSON libraries. The one that I started with and expected to use was simdjson, but immediately ran into portability problems because of its use of SIMD. It turned out that RapidJSON was within a factor of 2 or so, but was very portable, owing to its age/maturity. And then the ArrayBuilder overhead dominated over the actual JSON parsing, so even that didn't matter. In the performance studies that skip ArrayBuilder with a JSONSchema (#1165 (comment)), RapidJSON is still not the bottleneck. We can't do the other work fast enough to need a faster parser than RapidJSON.

Also, RapidJSON supports incremental reading, which is important for very large datasets. A JSON document of floating point values with ~16 digits uses twice as much data as the corresponding Awkward Array that we're reading it into.

@jpivarski
Copy link
Member Author

I suppose it might be possible, if there are no behavioral overrides other than strings (a common case), to do a to_buffers and pass those buffers into a C++ function that reimplements enough of __getitem__ to do a serialization in C++. That's a possible future enhancement. In v1, the __getitem__ logic was in C++, so it made sense to do that; now it isn't.

For that matter, Array.__iter__ is painfully slow, and the same C++-side __getitem__ could be used to generate Python objects from a pybind11-wrapped iterator that maintains state (the buffers from to_buffers), if yielding from pybind11 is fast enough to be worthwhile. nanobind might make the difference between whether that's worthwhile or not. (Our use of pybind11 is probably simple enough to switch to nanobind after the v1 code is dropped in December.)

@agoose77
Copy link
Collaborator

It sounds like you've given it a lot of thought, and there's a lot to reply to! In the interim, I can comment on nanobind - I used it temporarily for some fairly simple bindings and found it really nice to use. I was surprised that we were considering it, because when I looked ~ 1mo ago, it didn't support NumPy, but it seems like that's already no longer the case!

@jpivarski
Copy link
Member Author

We had to do workarounds for pybind11's NumPy support, so going forward, I would rather access data buffers on both the C++ and the Python side via borrowed pointers, anyway. But any consideration of nanobind would have to be after we've dropped v1, so that gives us lots of time to think about it.

@jpivarski
Copy link
Member Author

I'm going to cut it off here, just so that the new to_json gets in. from_json is currently untouched but accompanied by a from_json_new, which will definitely be swapped in: from_json_new is not permanent.

@jpivarski jpivarski marked this pull request as ready for review May 3, 2022 16:57
@jpivarski jpivarski merged commit 3b225f9 into main May 3, 2022
@jpivarski jpivarski deleted the jpivarski/revamp-v2-to_json-from_json-functions branch May 3, 2022 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants