Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting arrays of objects #13

Open
mwylde opened this issue May 31, 2024 · 2 comments
Open

Getting arrays of objects #13

mwylde opened this issue May 31, 2024 · 2 comments

Comments

@mwylde
Copy link

mwylde commented May 31, 2024

Thanks for putting this together, this is a great improvement on the state of JSON handling in datafusion! I've been trying this out in Arroyo (ArroyoSystems/arroyo#640), and it's definitely a big perf improvement on what we've been doing.

One limitation I've encountered is dealing with arrays of objects. For example, something like

{
  "a": [
    {"b": 5},   
    {"b": 9},
    {"b": 3}
  ]
}

A common pattern in Arroyo is to extract an array from JSON data, then call our unnest operator on it (which unrolls a list into individual rows). For that, we need a way to get an Arrow list out of the data. But there's no function that's capable of returning an array.

Two options I've considered:

  • Some syntax to "map" over a list in the json path (like json_get_int(s, 'a', *, 'b')), where the * represents an array that we're mapping over; this would return a SQL array/arrow list [5, 9, 3]
  • A get_json_array function, which returns an array of JsonUnion objects, which could then be queried without needing to be deserialized again (like get_json_int(unnest(get_json_array(s, 'a')), 'b)).

I've got most of a working prototype of the second one, but I'd be interested in thoughts on what approach makes sense (or whether it's within the interest of this project to support more complex JSON structures like this).

@ahirner
Copy link

ahirner commented Jun 2, 2024

what approach

IMO get_json_array is the more composable and preferred way, especially since unnest in datafusion received a significant improvement in v38: apache/datafusion#10044.

@samuelcolvin
Copy link
Collaborator

Sorry for the slow reply, I saw this then forgot to reply.

json_get_int is definitely easist.

But I see get_json_array is more powerful. I don't think it's a JsonUnion you'd reply, but rather RecordBatch or similar, basically a str->vec mapping. The thing is, that's basically reimplementing arrow-json, but probably much faster, I'll have a play and see how I get on.

PRs also welcome for either approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants