-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revamping the to_json/from_json interface. #1449
Revamping the to_json/from_json interface. #1449
Conversation
Codecov Report
|
…-unnecessary function.
@jpivarski just dropping in after reading the main description.
Is there much precedent / use case for customising Also, is there a particular reason to choose rapidjson? I recalled seeing a number of benchmarks suggesting |
One important behavioral override is >>> array = ak._v2.Array(["one", "two", "three", "four", "five"])
>>> array.layout
<ListOffsetArray len='5'>
<parameter name='__array__'>'string'</parameter>
<offsets><Index dtype='int64' len='6'>[ 0 3 6 11 15 19]</Index></offsets>
<content><NumpyArray dtype='uint8' len='19'>
<parameter name='__array__'>'char'</parameter>
[111 110 101 116 119 111 116 104 114 101 101 102 111 117 114 102 105
118 101]
</NumpyArray></content>
</ListOffsetArray>
>>> ak._v2.to_json(array)
'["one","two","three","four","five"]' However, that one was so important that now it has a special implementation path: you can't opt out of Also worth mentioning: the There probably ought to be a way to pass in replacement callables for The orjson documentation reminded me that we need to do something about serializing dates. That's a good to-do. The other direction, reading and deserializing JSON, is a completely different story. (One or my favorite realizations is that there's so much asymmetry between reading and writing!) All of the JSON workarounds—nan, infinity, complex numbers, raw bytestrings, and now dates/time differences—can be applied to the Awkward Array after deserialization, so they don't need to slow down the deserialization process. They're also optional, whereas for serialization, something has to be done or we won't fit the format. We'll have to use ArrayBuilder, but it can be ArrayBuilder in C++ because we don't have to think about behavioral overrides. I don't know how the orjson plots compare RapidJSON, a C++ library, with libraries that produce Python data. We won't be using it that way: the strings will go directly into ArrayBuilder, without touching Python (and therefore, we'll release the GIL, too). This is also true of the case that's guided by a JSONSchema, which also uses RapidJSON but skips ArrayBuilder's type discovery for some extra speed. At one point, I did a comparison of C++ JSON libraries. The one that I started with and expected to use was simdjson, but immediately ran into portability problems because of its use of SIMD. It turned out that RapidJSON was within a factor of 2 or so, but was very portable, owing to its age/maturity. And then the ArrayBuilder overhead dominated over the actual JSON parsing, so even that didn't matter. In the performance studies that skip ArrayBuilder with a JSONSchema (#1165 (comment)), RapidJSON is still not the bottleneck. We can't do the other work fast enough to need a faster parser than RapidJSON. Also, RapidJSON supports incremental reading, which is important for very large datasets. A JSON document of floating point values with ~16 digits uses twice as much data as the corresponding Awkward Array that we're reading it into. |
I suppose it might be possible, if there are no behavioral overrides other than strings (a common case), to do a For that matter, |
It sounds like you've given it a lot of thought, and there's a lot to reply to! In the interim, I can comment on nanobind - I used it temporarily for some fairly simple bindings and found it really nice to use. I was surprised that we were considering it, because when I looked ~ 1mo ago, it didn't support NumPy, but it seems like that's already no longer the case! |
We had to do workarounds for pybind11's NumPy support, so going forward, I would rather access data buffers on both the C++ and the Python side via borrowed pointers, anyway. But any consideration of nanobind would have to be after we've dropped v1, so that gives us lots of time to think about it. |
I'm going to cut it off here, just so that the new |
Reading and writing are not symmetric.
to_json
goes through Python'sjson
module because Arrays/Records can have lots of overloaded behaviors and users will expect it to behave liketo_list
. (Fortunately,to_list
has recently been optimized.)from_json
will go through RapidJSON because it's faster and can handle streaming data, for file-like objects representing remote files. This is possible because the input is just text, not Python objects with possible behaviors.Thus, reading may be much faster than writing, but there are good reasons for that. The arguments for circumventing non-JSON serializable data should match, however, and permit round-trip data, especially when a JSONSchema is provided.
ak._v2.to_json
andak._v2.to_json_file
into a single function that can:fsspec
if it has a URI scheme)ak._v2.to_list
(so more maintainable)ak._v2.to_json_schema
to make a JSONSchema from a type or an array's type. (Completely new function.)ak._v2.from_json
andak._v2.from_json_file
.read(num_bytes)
methodread(num_bytes)
methodak._v2.from_json
as a single function thatdata
as a string,pathlib.Path
filename, passes URIs throughfsspec
, or a file-like objectline_delimited
option asak._v2.to_json
schema
option, which is passed throughjson.loads
if it's a string