[WIP] JSON host tree algorithms #16545

shrshi · 2024-08-13T15:42:45Z

Description

Depends on #16836
This change adds a new host tree building algorithms for JSON reader.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. mark_is_pruned
Tree is constructed from pruned adjacency list, (with mixed types handling). construct_tree

All unit tests passes, 1 unit test added where old algorithm fails.

Until #16836 is merged, use karthikeyann#12 for viewing code diff.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…e-algorithms

…_code_reorg1

…nto host-tree-algorithms

revans2 · 2024-09-19T16:33:17Z

I just ran our full set of tests with this new code and I am seeing some failures. Note that I do not have column pruning enabled yet. I will rerun the tests with column pruning enabled shortly.

The first error I am seeing is

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /.../cudf/cpp/src/io/json/host_tree_algorithms.cu:1367: struct child column insertion failed, duplicate column name in the parent

The appears to be happening every time we try to read https://github.com/NVIDIA/spark-rapids/blob/branch-24.10/integration_tests/src/test/resources/escaped_strings.json as the input. The schema we are passing to CUDF is a single "data" column that is a STRING. I suspect that https://github.com/NVIDIA/spark-rapids/blob/7c13383d190bd28f69098e7b8abb15f899010c23/integration_tests/src/test/resources/escaped_strings.json#L40 is the line causing the issues as it has "data" as the key encoded in an alternative way.

I also get a lot of errors like

Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/.../cudf/cpp/src/io/json/json_column.cu:480: Unsupported column type

These show up when the schema I passed in does not match the schema of the data. Like I asked for a list of string but the data is a struct. or vise versa. In those cases I would expect a null to be returned because the list cannot be coerced into a struct and a struct cannot be coerced into a list.

karthikeyann · 2024-09-19T17:17:29Z

c68c259 commit fixed the coerced type mismatch error.

revans2 · 2024-09-19T19:25:59Z

I just tested with pruning enabled (I hacked it up), and with your latest fixes it is looking fairly good. I am seeing 3 errors. One of them is related to the escaped key not being processed properly. And the second one is

CUDF failure at: /.../cudf/cpp/src/io/json/json_column.cu:562: Input needs to be an array of arrays or an array of (nested) objects

It happens when the input data is [{"a": 1}] (So the top level object is an array) and the schema I am requesting is for a struct with a single "a" column in it.

I have not gone through all of the tests that we expect to fail yet and verified that they are passing. I am going to do that next

karthikeyann · 2024-09-20T05:33:17Z

Thank you @revans2 . I fixed the array of arrays error for example [{"a": 1}] 4efa820

revans2 · 2024-09-20T13:54:23Z

I have some performance numbers for the new patch.

For one particular benchmark in Spark-Rapids I am get about 16.7 seconds with pruning enabled. Which is a lot better than the hundreds if ms that I was getting before without pruning enabled. It is still not as good as the 13.8 seconds that I can get with an equivalent query using only get_json_object.

The only regressions I am seeing with pruning are the escape processing on a key I was seeing before. I will now retest with pruning disabled too.

revans2 · 2024-09-20T14:33:06Z

I did the same tests without pruning enabled, and the only regressions are still errors when the key is escaped. The performance is also still similar to before without pruning, about 160 seconds to complete a run. I'll mark my patch for column pruning #16796 to depend on this going in, as it has a lot of other issues without it.

shrshi added 2 commits August 13, 2024 15:37

impl

9eaacb3

formatting

27f1cb6

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 13, 2024

shrshi added 2 commits August 13, 2024 23:23

added mixed type support

4987f74

formatting

65e147f

shrshi added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Aug 13, 2024

shrshi changed the title ~~JSON host tree algorithms~~ [WIP] JSON host tree algorithms Aug 14, 2024

karthikeyann and others added 9 commits September 10, 2024 20:14

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into host-tre…

32e8619

…e-algorithms

comments - unfinished

b30c43f

very partial work; some comments

38819f2

struct column first try, basic tests pass

08cf338

add support for array_of_arrays

85983be

fix vector of dtypes in struct json

e3fd1d5

mixed type as string support added

dc25011

forced nested type in mixed type data

d1ec9c7

style fixes

ccfc6f6

karthikeyann mentioned this pull request Sep 18, 2024

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260

Open

karthikeyann and others added 8 commits September 18, 2024 10:21

Merge branch 'branch-24.10' into host-tree-algorithms

8fbb1d0

cleanup

ed0b354

fix name for list child element as not element

c3fcf8a

reuse code

a700865

reorg code build_tree

217c4d8

pulled relevant changes from rapidsai#16759

7437653

code reorg: split to 3 functions

4eff9fc

split host functions to separate file

400df4b

karthikeyann mentioned this pull request Sep 19, 2024

JSON tree algorithm code reorg #16836

Open

3 tasks

karthikeyann added 8 commits September 19, 2024 05:52

split new host algorithm to functions

7f5fdf4

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into enh-json…

10bddb8

…_code_reorg1

move code

3762477

revert to old call

6c3b681

prepare for merge with reorg

ac9fa76

Merge branch 'enh-json_code_reorg1' of github.com:karthikeyann/cudf i…

638cb24

…nto host-tree-algorithms

fix merge issue

583c576

use experimental build_tree

62085a8

karthikeyann self-assigned this Sep 19, 2024

same code for both make_device_json_column

eab13b3

karthikeyann added 2 commits September 19, 2024 16:35

add profiling

1f855b5

fix for missmatched forced type left uninitialized

c68c259

karthikeyann and others added 2 commits September 20, 2024 05:27

unprune base list in array of arrays when prune is enabled

4efa820

Merge branch 'branch-24.10' into host-tree-algorithms

69459bd

revans2 mentioned this pull request Sep 20, 2024

Add in option for Java JSON APIs to do column pruning in CUDF #16796

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] JSON host tree algorithms #16545

[WIP] JSON host tree algorithms #16545

shrshi commented Aug 13, 2024 •

edited by karthikeyann

Loading

revans2 commented Sep 19, 2024

karthikeyann commented Sep 19, 2024

revans2 commented Sep 19, 2024

karthikeyann commented Sep 20, 2024 •

edited

Loading

revans2 commented Sep 20, 2024

revans2 commented Sep 20, 2024

[WIP] JSON host tree algorithms #16545

Are you sure you want to change the base?

[WIP] JSON host tree algorithms #16545

Conversation

shrshi commented Aug 13, 2024 • edited by karthikeyann Loading

Description

Checklist

revans2 commented Sep 19, 2024

karthikeyann commented Sep 19, 2024

revans2 commented Sep 19, 2024

karthikeyann commented Sep 20, 2024 • edited Loading

revans2 commented Sep 20, 2024

revans2 commented Sep 20, 2024

shrshi commented Aug 13, 2024 •

edited by karthikeyann

Loading

karthikeyann commented Sep 20, 2024 •

edited

Loading