-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] JSON host tree algorithms #16545
base: branch-24.10
Are you sure you want to change the base?
Conversation
…nto host-tree-algorithms
I just ran our full set of tests with this new code and I am seeing some failures. Note that I do not have column pruning enabled yet. I will rerun the tests with column pruning enabled shortly. The first error I am seeing is
The appears to be happening every time we try to read https://github.com/NVIDIA/spark-rapids/blob/branch-24.10/integration_tests/src/test/resources/escaped_strings.json as the input. The schema we are passing to CUDF is a single "data" column that is a STRING. I suspect that https://github.com/NVIDIA/spark-rapids/blob/7c13383d190bd28f69098e7b8abb15f899010c23/integration_tests/src/test/resources/escaped_strings.json#L40 is the line causing the issues as it has "data" as the key encoded in an alternative way. I also get a lot of errors like
These show up when the schema I passed in does not match the schema of the data. Like I asked for a list of string but the data is a struct. or vise versa. In those cases I would expect a null to be returned because the list cannot be coerced into a struct and a struct cannot be coerced into a list. |
c68c259 commit fixed the coerced type mismatch error. |
I just tested with pruning enabled (I hacked it up), and with your latest fixes it is looking fairly good. I am seeing 3 errors. One of them is related to the escaped key not being processed properly. And the second one is
It happens when the input data is I have not gone through all of the tests that we expect to fail yet and verified that they are passing. I am going to do that next |
I have some performance numbers for the new patch. For one particular benchmark in Spark-Rapids I am get about 16.7 seconds with pruning enabled. Which is a lot better than the hundreds if ms that I was getting before without pruning enabled. It is still not as good as the 13.8 seconds that I can get with an equivalent query using only get_json_object. The only regressions I am seeing with pruning are the escape processing on a key I was seeing before. I will now retest with pruning disabled too. |
I did the same tests without pruning enabled, and the only regressions are still errors when the key is escaped. The performance is also still similar to before without pruning, about 160 seconds to complete a run. I'll mark my patch for column pruning #16796 to depend on this going in, as it has a lot of other issues without it. |
Description
Depends on #16836
This change adds a new host tree building algorithms for JSON reader.
This constructs the device_column_tree using an adjacency list created from parent information.
This adjacency list is pruned based on input schema, and also types are enforced as per schema.
mark_is_pruned
Tree is constructed from pruned adjacency list, (with mixed types handling).
construct_tree
All unit tests passes, 1 unit test added where old algorithm fails.
Until #16836 is merged, use karthikeyann#12 for viewing code diff.
Checklist