JSON Column creation in GPU #5

karthikeyann · 2022-09-21T13:13:38Z

Review PR for JSON Column creation in GPU rapidsai#11714

…re/nested-json-lines

…imitive-type-handling

…ted-json-lines

…type-handling

Co-authored-by: Elias Stehle <3958403+elstehle@users.noreply.github.com>

…re/json-primitive-type-handling

…f-column-type-conversion

…f into fea-json-column-gpu

…to fea-json-integration

…f into fea-json-column-gpu

…to fea-json-integration

…f into fea-json-column-gpu

…to fea-json-integration

…f into fea-json-column-gpu

…nto fea-json-column-gpu

…f into fea-json-column-gpu

Adds JSON tree traversal algorithm in host and device. It generates column indices for _record_ orient json format. List of structs at root, where each struct is a row. - [x] column indices generation - [x] row offset Depends on PR rapidsai#11518 ### Tree Traversal This algorithm assigns a unique column id to each node in the tree. The row offset is the row index of the node in that column id. Algorithm: 1. Convert node_category+fieldname to node_type. a. Create a hashmap to hash field name and assign unique node id as values. b. Convert the node categories to node types. Node type is defined as node category enum value if it is not a field node, otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY). 2. Preprocessing: Translate parent node ids after sorting by level. a. sort by level b. get gather map of sorted indices c. translate parent_node_ids to new sorted indices 3. Find level boundaries. copy_if index of first unique values of sorted levels. 4. Per-Level Processing: Propagate parent node ids for each level. For each level, a. gather col_id from previous level results. input=col_id, gather_map is parent_indices. b. stable sort by {parent_col_id, node_type} c. scan sum of unique {parent_col_id, node_type} d. scatter the col_id back to stable node_level order (using scatter_indices) Restore original node_id order 5. Generate row_offset. a. stable_sort by parent_col_id. b. scan_by_key {parent_col_id} (required only on nodes who's parent is list) c. propagate to non-list leaves from parent list node by recursion Authors: - Karthikeyan (https://github.com/karthikeyann) Approvers: - Elias Stehle (https://github.com/elstehle) - Tobias Ribizel (https://github.com/upsj) - Yunsong Wang (https://github.com/PointKernel) - David Wendt (https://github.com/davidwendt) URL: rapidsai#11610

…-column-gpu

…sses

…size

This PR generates json column creation from the traversed json tree. It has following parts 1. `reduce_to_column_tree` - Reduce node tree into column tree by aggregating each property of each column and number of rows in each column. 2. `make_json_column2` - creates the GPU json column tree structure from tree and column info 3. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. 4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Depends on PR #11518 #11610 For code-review, use PR karthikeyann#5 which contains only this tree changes. ### Overview - PR #11264 Tokenizes the JSON string to Tokens - PR #11518 Converts Tokens to Nodes (tree representation) - PR #11610 Traverses this node tree --> assigns column id and row index to each node. - This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column` JSON has 5 categories of nodes. STRUCT, LIST, FIELD, VALUE, STRING, STRUCT, LIST are nested types. FIELD nodes are struct columns' keys. VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2` Tree Representation `tree_meta_t` has 4 data members. 1. node categories 2. node parents' id 3. node level 4. node's string range {begin, end} (as 2 vectors) Currently supported JSON formats are records orient, and JSON lines. ### This PR - Detailed explanation This PR has 3 steps. 1. `reduce_to_column_tree` - Required to compute total number of columns, column type, nested column structure, and number of rows in each column. - Generates `tree_meta_t` data members for column. - - Sort node tree by col_id (stable sort) - - reduce_by_key custom_op on node_categories, collapses to column category - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id. - - reduce_by_key max on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count. 5. `make_json_column2` - Converts nodes to GPU json columns in tree structure - - get column tree, transfer column names to host. - - Create `d_json_column` for non-field columns. - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column. - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length. - - Compute list offset - - Perform scan max operation on offsets. (to fill 0's with previous offset value). - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information. 6. `json_column_to_cudf_column2` - converts this GPU json column to cudf column. - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further. 7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device. Authors: - Karthikeyan (https://github.com/karthikeyann) - Elias Stehle (https://github.com/elstehle) - Yunsong Wang (https://github.com/PointKernel) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Tobias Ribizel (https://github.com/upsj) - https://github.com/nvdbaranec - GALI PREM SAGAR (https://github.com/galipremsagar) - Vukasin Milovanovic (https://github.com/vuule) URL: #11714

This implements stacktrace and adds a stacktrace string into any exception thrown by cudf. By doing so, the exception carries information about where it originated, allowing the downstream application to trace back with much less effort. Closes rapidsai#12422. ### Example: ``` #0: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::sorted_order<false>(cudf::table_view, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x446 #1: cudf/cpp/build/libcudf.so : cudf::detail::sorted_order(cudf::table_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x113 #2: cudf/cpp/build/libcudf.so : std::unique_ptr<cudf::column, std::default_delete<cudf::column> > cudf::detail::segmented_sorted_order_common<(cudf::detail::sort_method)1>(cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x66e #3: cudf/cpp/build/libcudf.so : cudf::detail::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::cuda_stream_view, rmm::mr::device_memory_resource*)+0x88 #4: cudf/cpp/build/libcudf.so : cudf::segmented_sort_by_key(cudf::table_view const&, cudf::table_view const&, cudf::column_view const&, std::vector<cudf::order, std::allocator<cudf::order> > const&, std::vector<cudf::null_order, std::allocator<cudf::null_order> > const&, rmm::mr::device_memory_resource*)+0xb9 #5: cudf/cpp/build/gtests/SORT_TEST : ()+0xe3027 #6: cudf/cpp/build/lib/libgtest.so.1.13.0 : void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*)+0x8f #7: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::Test::Run()+0xd6 #8: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestInfo::Run()+0x195 #9: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::TestSuite::Run()+0x109 #10: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::internal::UnitTestImpl::RunAllTests()+0x44f #11: cudf/cpp/build/lib/libgtest.so.1.13.0 : bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*)+0x87 #12: cudf/cpp/build/lib/libgtest.so.1.13.0 : testing::UnitTest::Run()+0x95 rapidsai#13: cudf/cpp/build/gtests/SORT_TEST : ()+0xdb08c rapidsai#14: /lib/x86_64-linux-gnu/libc.so.6 : ()+0x29d90 rapidsai#15: /lib/x86_64-linux-gnu/libc.so.6 : __libc_start_main()+0x80 rapidsai#16: cudf/cpp/build/gtests/SORT_TEST : ()+0xdf3d5 ``` ### Usage In order to retrieve a stacktrace with fully human-readable symbols, some compiling options must be adjusted. To make such adjustment convenient and effortless, a new cmake option (`CUDF_BUILD_STACKTRACE_DEBUG`) has been added. Just set this option to `ON` before building cudf and it will be ready to use. For downstream applications, whenever a cudf-type exception is thrown, it can retrieve the stored stacktrace and do whatever it wants with it. For example: ``` try { // cudf API calls } catch (cudf::logic_error const& e) { std::cout << e.what() << std::endl; std::cout << e.stacktrace() << std::endl; throw e; } // similar with catching other exception types ``` ### Follow-up work The next step would be patching `rmm` to attach stacktrace into `rmm::` exceptions. Doing so will allow debugging various memory exceptions thrown from libcudf using their stacktrace. ### Note: * This feature doesn't require libcudf to be built in Debug mode. * The flag `CUDF_BUILD_STACKTRACE_DEBUG` should not be turned on in production as it may affect code optimization. Instead, libcudf compiled with that flag turned on should be used only when needed, when debugging cudf throwing exceptions. * This flag removes the current optimization flag from compiling (such as `-O2` or `-O3`, if in Release mode) and replaces by `-Og` (optimize for debugging). * If this option is not set to `ON`, the stacktrace will not be available. This is to avoid expensive stracktrace retrieval if the throwing exception is expected. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - AJ Schmidt (https://github.com/ajschmidt8) - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) - Jason Lowe (https://github.com/jlowe) URL: rapidsai#13298

Raise ValueError if DataFrame column length does not match data

vuule and others added 30 commits August 16, 2022 11:59

style

a6d5ab7

Merge remote-tracking branch 'upstream/pull-request/11364' into featu…

397e00f

…re/nested-json-lines

integrates upstream interface changes

a0bd229

Merge remote-tracking branch 'fea-json-col-cast' into feature/json-pr…

c822942

…imitive-type-handling

Merge remote-tracking branch 'upstream/branch-22.10' into feature/nes…

46a3c44

…ted-json-lines

enables lines option in the nested reader

f3bba9d

migrates test from details api to reader api

21b4023

improves code comment

cdc4441

Merge branch 'feature/nested-json-lines' into feature/json-primitive-…

836e0d1

…type-handling

adds inference and type conversion

7479b63

Merge remote-tracking branch 'upstream/branch-22.10' into type-inference

29c6525

Move type inference to utilities

a659817

Resolve conflicts + relocate type inference test file

4410488

Get rid of narrow conversion + add string handling

6409a5f

Updates: make column string iter compatible with zip iterator

ec07bca

Minor updates

640eb00

Add missing header

b0fac83

Fix the infinite loop bug with while

67fcaf5

Update cpp/src/io/utilities/type_inference.cuh

51997be

Co-authored-by: Elias Stehle <3958403+elstehle@users.noreply.github.com>

patches data casting for escape handling

a5e50d6

Merge remote-tracking branch 'upstream/pull-request/11121' into featu…

b2805a5

…re/json-primitive-type-handling

resolves downstream inference conflicts

fe70ac2

removes debug prints from casting

779b638

removes local test

05b506f

adds new logic for inferring nested columns

bcf4b86

fixes issue for two subsequent non-UTF-16 unicode esc sequences

ff87f3b

Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…

123cb69

…f-column-type-conversion

Merge remote-tracking branch 'upstream/branch-22.10' into feature/lea…

052fdfb

…f-column-type-conversion

resolves merge conflicts

e2fae02

fixes nullable behaviour to match nested json reader

872c332

karthikeyann and others added 23 commits September 23, 2022 20:54

add CUDF_FUNC_RANGE to parse_data and infer_data_type

d670c0a

Merge branch 'fea-json-tree-traversal' of github.com:karthikeyann/cud…

c852cc9

…f into fea-json-column-gpu

Merge branch 'fea-json-column-gpu' of github.com:karthikeyann/cudf in…

1ef60cb

…to fea-json-integration

Merge branch 'fea-json-tree-traversal' of github.com:karthikeyann/cud…

a62dff8

…f into fea-json-column-gpu

Merge branch 'fea-json-column-gpu' of github.com:karthikeyann/cudf in…

335aedc

…to fea-json-integration

performance changes, replace sort with scatter

28cc195

Merge branch 'fea-json-tree-traversal' of github.com:karthikeyann/cud…

0b398fe

…f into fea-json-column-gpu

Merge branch 'fea-json-column-gpu' of github.com:karthikeyann/cudf in…

e1643db

…to fea-json-integration

Merge branch 'fea-json-tree-traversal' of github.com:karthikeyann/cud…

18fa2fb

…f into fea-json-column-gpu

Merge branch 'fea-json-integration' of github.com:karthikeyann/cudf i…

04c7553

…nto fea-json-column-gpu

clean up debug prints

f7efff7

cleanup

840db1e

fix typo

b664e36

Merge branch 'fea-json-tree-traversal' of github.com:karthikeyann/cud…

adc38ae

…f into fea-json-column-gpu

address review comments

1ffe587

Merge branch 'branch-22.10' of github.com:rapidsai/cudf into fea-json…

6a061ad

…-column-gpu

reduce memory usage, speedup unique_copy_by_key, cleanup

13a61f9

address review comments (upsj)

55b2f24

input empty string, or empty array - fix boundary cases in array acce…

b1e5c76

…sses

dispatch_dfa initialization_pass_kernel config 0 error - fix min grid…

b0f61d6

…size

zero rows test cases in unit tests

6280780

enable lines true, false test for experimental parser

27edbf7

github-actions bot added the cuDF (Python) label Sep 26, 2022

karthikeyann and others added 2 commits September 27, 2022 03:45

address review comments (vuule)

9fb6425

Update cpp/src/io/json/experimental/read_json.cpp

6d71b6b

elstehle removed their request for review October 28, 2022 10:43

karthikeyann pushed a commit that referenced this pull request Nov 10, 2023

Merge pull request #5 from mroeschke/bug/df/column_mismatch

2869181

Raise ValueError if DataFrame column length does not match data

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON Column creation in GPU #5

JSON Column creation in GPU #5

karthikeyann commented Sep 21, 2022

JSON Column creation in GPU #5

Are you sure you want to change the base?

JSON Column creation in GPU #5

Conversation

karthikeyann commented Sep 21, 2022