Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] JSON host tree algorithms #16545

Draft
wants to merge 34 commits into
base: branch-24.10
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Aug 13, 2024

Description

Depends on #16836
This change adds a new host tree building algorithms for JSON reader.

This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema. mark_is_pruned
Tree is constructed from pruned adjacency list, (with mixed types handling). construct_tree

All unit tests passes, 1 unit test added where old algorithm fails.

Until #16836 is merged, use karthikeyann#12 for viewing code diff.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue labels Aug 13, 2024
@shrshi shrshi added cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change improvement Improvement / enhancement to an existing function and removed libcudf Affects libcudf (C++/CUDA) code. labels Aug 13, 2024
@shrshi shrshi changed the title JSON host tree algorithms [WIP] JSON host tree algorithms Aug 14, 2024
@karthikeyann karthikeyann self-assigned this Sep 19, 2024
@revans2
Copy link
Contributor

revans2 commented Sep 19, 2024

I just ran our full set of tests with this new code and I am seeing some failures. Note that I do not have column pruning enabled yet. I will rerun the tests with column pruning enabled shortly.

The first error I am seeing is

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /.../cudf/cpp/src/io/json/host_tree_algorithms.cu:1367: struct child column insertion failed, duplicate column name in the parent

The appears to be happening every time we try to read https://github.com/NVIDIA/spark-rapids/blob/branch-24.10/integration_tests/src/test/resources/escaped_strings.json as the input. The schema we are passing to CUDF is a single "data" column that is a STRING. I suspect that https://github.com/NVIDIA/spark-rapids/blob/7c13383d190bd28f69098e7b8abb15f899010c23/integration_tests/src/test/resources/escaped_strings.json#L40 is the line causing the issues as it has "data" as the key encoded in an alternative way.

I also get a lot of errors like

Caused by: ai.rapids.cudf.CudfException: CUDF failure at:/.../cudf/cpp/src/io/json/json_column.cu:480: Unsupported column type

These show up when the schema I passed in does not match the schema of the data. Like I asked for a list of string but the data is a struct. or vise versa. In those cases I would expect a null to be returned because the list cannot be coerced into a struct and a struct cannot be coerced into a list.

@karthikeyann
Copy link
Contributor

c68c259 commit fixed the coerced type mismatch error.

@revans2
Copy link
Contributor

revans2 commented Sep 19, 2024

I just tested with pruning enabled (I hacked it up), and with your latest fixes it is looking fairly good. I am seeing 3 errors. One of them is related to the escaped key not being processed properly. And the second one is

CUDF failure at: /.../cudf/cpp/src/io/json/json_column.cu:562: Input needs to be an array of arrays or an array of (nested) objects

It happens when the input data is [{"a": 1}] (So the top level object is an array) and the schema I am requesting is for a struct with a single "a" column in it.

I have not gone through all of the tests that we expect to fail yet and verified that they are passing. I am going to do that next

@karthikeyann
Copy link
Contributor

karthikeyann commented Sep 20, 2024

Thank you @revans2 . I fixed the array of arrays error for example [{"a": 1}] 4efa820

@revans2
Copy link
Contributor

revans2 commented Sep 20, 2024

I have some performance numbers for the new patch.

For one particular benchmark in Spark-Rapids I am get about 16.7 seconds with pruning enabled. Which is a lot better than the hundreds if ms that I was getting before without pruning enabled. It is still not as good as the 13.8 seconds that I can get with an equivalent query using only get_json_object.

The only regressions I am seeing with pruning are the escape processing on a key I was seeing before. I will now retest with pruning disabled too.

@revans2
Copy link
Contributor

revans2 commented Sep 20, 2024

I did the same tests without pruning enabled, and the only regressions are still errors when the key is escaped. The performance is also still similar to before without pruning, about 160 seconds to complete a run. I'll mark my patch for column pruning #16796 to depend on this going in, as it has a lot of other issues without it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

3 participants