Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds GPU implementation of JSON-token-stream to JSON-tree #11518

Merged
merged 36 commits into from
Sep 19, 2022

Conversation

karthikeyann
Copy link
Contributor

@karthikeyann karthikeyann commented Aug 11, 2022

Description

Adds GPU implementation of JSON-token-stream to JSON-tree
Depends on PR Adds JSON-token-stream to JSON-tree #11291


This PR adds the stage of converting a JSON input into a tree representation, where each node represents either a struct, a list, a field name, a string value, a value, or an error node.

The PR is part of a multi-part PR-chain. Specifically, this PR builds on the JSON tokenizer PR.

This PR depends on:
⛓️ #11264
⛓️ #11242
⛓️ #11078

Each node has one of the following category:

/// A node representing a struct
NC_STRUCT,
/// A node representing a list
NC_LIST,
/// A node representing a field name
NC_FN,
/// A node representing a string value
NC_STR,
/// A node representing a numeric or literal value (e.g., true, false, null)
NC_VAL,
/// A node representing a parser error
NC_ERR

For each node, the tree representation stores the following information:

  • node category
  • node level
  • node range begin (index of the first character from the original JSON input that this node demarcates)
  • node range end (index of one-past-the-last-character of the first character from the original JSON input that this node demarcates)

An example tree:
The following is just an example print of the information represented in the tree generated by the algorithm.

  • Each line is printing the full path to the next node in the tree.
  • For each node along the path we have the following format: <[NODE_ID]:[NODE_CATEGORY]:[[RANGE_BEGIN],[RANGE_END]) '[STRING_FROM_RANGE]'>

The original JSON for this tree:

  [{"category": "reference","index:": [4,12,42],"author": "Nigel Rees","title": "[Sayings of the Century]","price": 8.95},  {"category": "reference","index": [4,{},null,{"a":[{ }, {}] } ],"author": "Nigel Rees","title": "{}[], <=semantic-symbols-string","price": 8.95}] 

The tree:

<0:LIST:[2, 3) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <2:FN:[5, 13) 'category'> -> <3:STR:[17, 26) 'reference'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <6:VAL:[39, 40) '4'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <7:VAL:[41, 43) '12'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <4:FN:[29, 35) 'index:'> -> <5:LIST:[38, 39) '['> -> <8:VAL:[44, 46) '42'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <9:FN:[49, 55) 'author'> -> <10:STR:[59, 69) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <11:FN:[72, 77) 'title'> -> <12:STR:[81, 105) '[Sayings of the Century]'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'>
<0:LIST:[2, 3) '['> -> <1:STRUCT:[3, 4) '{'> -> <13:FN:[108, 113) 'price'> -> <14:VAL:[116, 120) '8.95'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <16:FN:[126, 134) 'category'> -> <17:STR:[138, 147) 'reference'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <20:VAL:[159, 160) '4'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <21:STRUCT:[161, 162) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <22:VAL:[164, 168) 'null'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <26:STRUCT:[175, 176) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <18:FN:[150, 155) 'index'> -> <19:LIST:[158, 159) '['> -> <23:STRUCT:[169, 170) '{'> -> <24:FN:[171, 172) 'a'> -> <25:LIST:[174, 175) '['> -> <27:STRUCT:[180, 181) '{'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <28:FN:[189, 195) 'author'> -> <29:STR:[199, 209) 'Nigel Rees'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <30:FN:[212, 217) 'title'> -> <31:STR:[221, 252) '{}[], <=semantic-symbols-string'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'>
<0:LIST:[2, 3) '['> -> <15:STRUCT:[124, 125) '{'> -> <32:FN:[255, 260) 'price'> -> <33:VAL:[263, 267) '8.95'>

The original JSON pretty-printed for this tree:

[
    {
        "category": "reference",
        "index:": [
            4,
            12,
            42
        ],
        "author": "Nigel Rees",
        "title": "[Sayings of the Century]",
        "price": 8.95
    },
    {
        "category": "reference",
        "index": [
            4,
            {},
            null,
            {
                "a": [
                    {},
                    {}
                ]
            }
        ],
        "author": "Nigel Rees",
        "title": "{}[], <=semantic-symbols-string",
        "price": 8.95
    }
]

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

Squashed commit of the following:

commit 6e1bc75
Author: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Date:   Fri Aug 12 03:06:30 2022 +0530

    remove debug print in logical stack

commit 8e75645
Author: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Date:   Fri Aug 12 03:01:34 2022 +0530

    remove duplicate renamed header

commit 3b2acb2
Merge: 2b59b04 a67b718
Author: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Date:   Fri Aug 12 02:59:01 2022 +0530

    Merge branch 'branch-22.10' of https://github.com/rapidsai/cudf into json-tree

commit 2b59b04
Merge: 12cf0be 2d214ea
Author: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Date:   Tue Jul 26 13:40:41 2022 +0530

    Merge branch 'branch-22.08' of https://github.com/rapidsai/cudf into json-tree

commit 12cf0be
Author: Karthikeyan Natarajan <karthikeyann@users.noreply.github.com>
Date:   Tue Jul 26 13:29:55 2022 +0530

    fix clang-format style fix

commit 3e756bb
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Jul 18 08:17:03 2022 -0700

    replaces tree return type from tuple to struct

commit bef4fb1
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon May 16 22:10:08 2022 -0700

    moved debug print to detail ns

commit ff90528
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Fri May 13 09:52:20 2022 -0700

    squash & rebase on latest tokenizer version

commit 987699f
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jun 2 05:19:53 2022 -0700

    fixes sg-count & uses rmm stream in fst tests

commit 00a95eb
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 12:17:08 2022 -0700

    put lookup tables into their own cudf file

commit a8ac5fa
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 09:59:37 2022 -0700

    refactored lookup tables

commit f996ce9
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 11 12:17:55 2022 -0700

    squashed with bracket/brace test

commit 671ce41
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Tue Apr 12 22:55:00 2022 -0700

    minor style changes addressing review comments

commit f4ec994
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 4 07:35:33 2022 -0700

    device_span

commit d18238f
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 4 02:28:30 2022 -0700

    renaming key-value store op to stack_op

commit 62ddf66
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Mar 31 05:28:17 2022 -0700

    switched to using rmm also inside algorithm

commit 2f7b254
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Mar 31 04:11:44 2022 -0700

    Added utility to debug print & instrumented code to use it

commit 67f609d
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 04:15:11 2022 -0700

    renames enums & moving from device_span to ptr params

commit 01aef44
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed Jul 13 07:22:52 2022 -0700

    wraps if with stream params into detail ns

commit 4aaf595
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed Jul 13 05:45:49 2022 -0700

    fixes for breaking downstream interface changes

commit 237456d
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jun 2 08:19:37 2022 -0700

    fixes breaking changes from dependent-FST-PR

commit 7fc8619
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Tue May 3 07:05:44 2022 -0700

    rebase on latest FST

commit 6d3eff2
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jun 2 05:19:53 2022 -0700

    fixes sg-count & uses rmm stream in fst tests

commit 6548836
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 12:17:08 2022 -0700

    put lookup tables into their own cudf file

commit 9dfd4ad
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 09:59:37 2022 -0700

    refactored lookup tables

commit fe06f0b
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 11 12:17:55 2022 -0700

    squashed with bracket/brace test

commit 36c8296
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Tue Apr 12 22:55:00 2022 -0700

    minor style changes addressing review comments

commit 24dab9e
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 4 07:35:33 2022 -0700

    device_span

commit 49fa996
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 4 02:28:30 2022 -0700

    renaming key-value store op to stack_op

commit b260610
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Mar 31 05:28:17 2022 -0700

    switched to using rmm also inside algorithm

commit 9b20d16
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Mar 31 04:11:44 2022 -0700

    Added utility to debug print & instrumented code to use it

commit 78dd893
Merge: 8a184e9 9627091
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Fri Jul 15 23:06:55 2022 -0700

    Merge remote-tracking branch 'upstream/branch-22.08' into feature/finite-state-transducer-trimmed

commit 8a184e9
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Fri Jul 15 22:51:18 2022 -0700

    rephrases documentation on in-reg array

commit bea2a02
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Fri Jul 15 01:54:20 2022 -0700

    replaces vanilla loop with iota

commit cba1619
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 09:31:12 2022 -0700

    fixes style in dispatch dfa

commit 3f47952
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 09:22:03 2022 -0700

    replaces gtest asserts with expects

commit d351e5c
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 09:17:59 2022 -0700

    addresses style review comments & fixes a todo

commit 3038058
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 09:17:09 2022 -0700

    adds excplitis error checking

commit f52e614
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 14 09:16:18 2022 -0700

    replaces enum with typed constexpr

commit eb24962
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Tue Jul 12 04:52:36 2022 -0700

    fixes logical stack test includes

commit a798852
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Jul 11 11:00:22 2022 -0700

    adds check for state transition narrowing conversion

commit e6f8def
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Jul 11 09:06:01 2022 -0700

    some west-const remainders & unifies StateIndexT

commit 5f1c4b5
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Jul 11 06:26:47 2022 -0700

    removes state vector-wrapper in favor of vanilla array

commit 485a1c6
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Fri Jul 8 22:49:57 2022 -0700

    adopts c++17 namespaces declarations

commit f656f49
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jul 7 02:41:16 2022 -0700

    adopts device-side test data gen

commit 694a365
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed Jun 15 04:28:51 2022 -0700

    adopts suggested fst test changes

commit 9fe8e4b
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Tue Jun 14 03:12:35 2022 -0700

    minor doxygen fix

commit eccf970
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Thu Jun 2 05:19:53 2022 -0700

    fixes sg-count & uses rmm stream in fst tests

commit 6fdd24a
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon May 9 12:17:34 2022 -0700

    refactor lut sanity check

commit 17dcbfd
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon May 9 10:33:00 2022 -0700

    making const vars const

commit ea79a81
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon May 9 10:32:17 2022 -0700

    Adding hostdevice macros to in-reg array

commit caf6195
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon May 9 10:24:51 2022 -0700

    unified usage of pragma unrolls

commit e24a133
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed May 4 07:29:00 2022 -0700

    removing unused var post-cleanup

commit 39cff80
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed Apr 27 04:42:31 2022 -0700

    Change interface for FST to not need temp storage

commit 239f138
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 12:17:08 2022 -0700

    put lookup tables into their own cudf file

commit 39a6b65
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 25 09:59:37 2022 -0700

    refactored lookup tables

commit 355d1e4
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Wed Apr 20 05:11:32 2022 -0700

    clean up & addressing review comments

commit 0557d41
Author: Elias Stehle <3958403+elstehle@users.noreply.github.com>
Date:   Mon Apr 11 12:17:55 2022 -0700

    squashed with bracket/brace test
@karthikeyann karthikeyann added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Aug 11, 2022
@karthikeyann karthikeyann added this to the Nested JSON reader milestone Aug 11, 2022
@karthikeyann karthikeyann self-assigned this Aug 11, 2022
@codecov
Copy link

codecov bot commented Aug 12, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@68746ae). Click here to learn what that means.
Patch has no changes to coverable lines.

❗ Current head 55f3d68 differs from pull request most recent head 2f34d3a. Consider uploading reports for the commit 2f34d3a to get more accurate results

Additional details and impacted files
@@               Coverage Diff               @@
##             branch-22.10   #11518   +/-   ##
===============================================
  Coverage                ?   86.39%           
===============================================
  Files                   ?      145           
  Lines                   ?    23014           
  Branches                ?        0           
===============================================
  Hits                    ?    19883           
  Misses                  ?     3131           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@github-actions github-actions bot added the CMake CMake build issue label Aug 24, 2022
@karthikeyann karthikeyann added 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond and removed 2 - In Progress Currently a work in progress labels Aug 26, 2022
@karthikeyann karthikeyann marked this pull request as ready for review August 26, 2022 08:55
@karthikeyann karthikeyann requested review from a team as code owners August 26, 2022 08:55
@karthikeyann karthikeyann mentioned this pull request Sep 19, 2022
3 tasks
Copy link
Contributor

@elstehle elstehle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing a few more comments. Still digging into a few more algorithmic details, following up with another review pass shortly

@elstehle elstehle dismissed their stale review September 19, 2022 13:59

Submitted review on wrong PR

@karthikeyann karthikeyann added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond labels Sep 19, 2022
@karthikeyann
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit bf2c751 into rapidsai:branch-22.10 Sep 19, 2022
rapids-bot bot pushed a commit that referenced this pull request Sep 24, 2022
Adds JSON tree traversal algorithm in host and device.

It generates column indices for _record_ orient json format. List of structs at root, where each struct is a row.
- [x] column indices generation 
- [x] row offset

Depends on PR #11518

### Tree Traversal

  This algorithm assigns a unique column id to each node in the tree.
  The row offset is the row index of the node in that column id.
  Algorithm:
  1. Convert node_category+fieldname to node_type.
	      a. Create a hashmap to hash field name and assign unique node id as values.
	      b. Convert the node categories to node types.
	         Node type is defined as node category enum value if it is not a field node,
	         otherwise it is the unique node id assigned by the hashmap (value shifted by #NUM_CATEGORY).
  2. Preprocessing: Translate parent node ids after sorting by level.
	      a. sort by level
	      b. get gather map of sorted indices
	      c. translate parent_node_ids to new sorted indices
  3. Find level boundaries.
     copy_if index of first unique values of sorted levels.
  4. Per-Level Processing: Propagate parent node ids for each level.
	      For each level,
	        a. gather col_id from previous level results. input=col_id, gather_map is parent_indices.
	        b. stable sort by {parent_col_id, node_type}
	        c. scan sum of unique {parent_col_id, node_type}
	        d. scatter the col_id back to stable node_level order (using scatter_indices)
    Restore original node_id order
  5. Generate row_offset.
	      a. stable_sort by parent_col_id.
	      b. scan_by_key {parent_col_id} (required only on nodes who's parent is list)
	      c. propagate to non-list leaves from parent list node by recursion

Authors:
  - Karthikeyan (https://github.com/karthikeyann)

Approvers:
  - Elias Stehle (https://github.com/elstehle)
  - Tobias Ribizel (https://github.com/upsj)
  - Yunsong Wang (https://github.com/PointKernel)
  - David Wendt (https://github.com/davidwendt)

URL: #11610
rapids-bot bot pushed a commit that referenced this pull request Sep 27, 2022
This PR generates json column creation from the traversed json tree. It has following parts
1. `reduce_to_column_tree` -  Reduce node tree into column tree by aggregating each property of each 	column and number of rows in each column.
2. `make_json_column2` - creates the GPU json column tree structure from tree and column info
3. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
4. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Depends on PR #11518 #11610 
For code-review, use PR karthikeyann#5 which contains only this tree changes.

### Overview

- PR #11264 Tokenizes the JSON string to Tokens
- PR #11518 Converts Tokens to Nodes (tree representation)
- PR #11610 Traverses this node tree --> assigns column id and row index to each node.
- This PR #11714 Converts this traversed tree into JSON Column, which in turn is translated to `cudf::column`

JSON has 5 categories of nodes. STRUCT, LIST,  FIELD, VALUE, STRING,
STRUCT, LIST are nested types.
FIELD nodes are struct columns' keys.
VALUE node is similar to STRING column but without double quotes. Actual datatype conversion happens in `json_column_to_cudf_column2`

Tree Representation `tree_meta_t` has 4 data members.
1. node categories
2. node parents' id
3. node level
4. node's string range {begin, end} (as 2 vectors)

Currently supported JSON formats are records orient, and JSON lines.

### This PR - Detailed explanation
This PR has 3 steps.
1. `reduce_to_column_tree`
    - Required to compute total number of columns, column type, nested column structure, and number of rows in each column.
    - Generates `tree_meta_t` data members for column.
    - - Sort node tree by col_id (stable sort)
    - - reduce_by_key custom_op on node_categories, collapses to column category
    - - unique_by_key_copy by col_id, copies first parent_node_id, string_ranges. This parent_node_id will be transformed to parent_column_id.
    - - reduce_by_key max  on row_offsets gives maximum row offset in each column, Propagate list column children's max row offset to their children because sometime structs may miss entries, so parent list gives correct count.
5. `make_json_column2` 
    - Converts nodes to GPU json columns in tree structure
    - - get column tree, transfer column names to host.
    - - Create `d_json_column` for non-field columns.
    - - if 2 columns occurs on same path, and one of them is nested and other is string column, discard the string column.
    - - For STRUCT, LIST, VALUE, STRING nodes, set the validity bits, and copy string {begin, end} range to string_offsets and string length.
    - - Compute list offset 
    - - Perform scan max operation on offsets. (to fill 0's with previous offset value).
    - Now the `d_json_column` is nested, and contains offsets, validity bits, unparsed unconverted string information.
6. `json_column_to_cudf_column2` -  converts this GPU json column to cudf column.
    - Recursively goes over each `d_json_column` and converts to `cudf::column` by inferring the type, parsing the string to type, and setting validity bits further.
7. `parse_nested_json2` - combines all json tokenizer, json tree generation, traversal, json column creation, cudf column conversion together. All steps run on device.

Authors:
  - Karthikeyan (https://github.com/karthikeyann)
  - Elias Stehle (https://github.com/elstehle)
  - Yunsong Wang (https://github.com/PointKernel)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Tobias Ribizel (https://github.com/upsj)
  - https://github.com/nvdbaranec
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #11714
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants