JSON tree algorithm code reorg #16836

karthikeyann · 2024-09-19T05:28:13Z

Description

This PR moves JSON host tree algorithms to separate file.
This code movement will help #16545 review easier.

The code is moved to new file and reorganized for code reuse.
Very long function make_device_json_column is split into

code block with reduce_to_column_tree call
code moved to function build_tree
code moved to function scatter_offsets

No new functionality is added in this PR.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…_code_reorg1

KyleFromNVIDIA

Approved trivial CMake changes

shrshi · 2024-09-19T16:44:24Z

cpp/src/io/json/host_tree_algorithms.cu

+  bitmask_type* validity;
+};
+
+std::pair<cudf::detail::host_vector<uint8_t>,


Can we change ignore_vals returned here to a bool vector? We can also move this change to #16545
EDIT: Similar suggestion for is_pruned and is_str_column_all_nulls

IIRC, I've seen concerns raised in the past regarding the use of bool vectors, and recommendations to use int8_t/uint8_t instead. I didn't think host_vector<bool> would be plagued with the same problems as std::vector<bool>, but I could be wrong.

Watching this conversation for @karthikeyann's opinion.

@mythrocks That's right. 💯
std::vector<bool> stores each element as a bit (to save memory). so, iteration and writing to data pointer causes issues.
cudf::detail::host_vector is based on thrust::host_vector which are simple vectors. (without any specialization for bool). So, that should be suitable.

Yes, thrust::host_vector<bool> will use byte-packing (not bits).

Another reason why I used uint8_t is because using 0 and 1 is easier to read. (😄)

Updating ignore_vals to host_vector is fine since it needs to be copied to device, so pinned memory will help. I will update it in next PR [WIP] JSON host tree algorithms #16545

is_pruned and is_str_column_all_nulls are accessed only in host, and not copied to device. So, no benefit in using pinned memory, also allocation is slower. (not sure if CPU read time or write time is slower still).

The only real drawbacks to pinning memory are the reduction in available physical ram to the host demand-paging system, and the time it takes to pin memory (which is significantly longer than an ordinary malloc of the same size).
https://forums.developer.nvidia.com/t/advantages-disadvantages-of-using-pinned-memory/34422/2

https://forums.developer.nvidia.com/t/pinned-memory-slower-than-pageable-memory/18821
Note: These posts are old.

mythrocks

The re-org looks good to me, FWIW. But I'm curious about @shrshi's suggestion regarding bool vectors.

shrshi

Looks good to me!

karthikeyann added 3 commits September 19, 2024 03:09

pulled relevant changes from rapidsai#16759

7437653

code reorg: split to 3 functions

4eff9fc

split host functions to separate file

400df4b

karthikeyann added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Sep 19, 2024

karthikeyann self-assigned this Sep 19, 2024

karthikeyann requested review from a team as code owners September 19, 2024 05:28

karthikeyann requested review from hyperbolic2346 and lamarrr September 19, 2024 05:28

github-actions bot added the CMake CMake build issue label Sep 19, 2024

Merge branch 'branch-24.10' of github.com:rapidsai/cudf into enh-json…

10bddb8

…_code_reorg1

karthikeyann mentioned this pull request Sep 19, 2024

[WIP] JSON host tree algorithms #16545

Draft

3 tasks

add missing include during merge

a370a45

KyleFromNVIDIA approved these changes Sep 19, 2024

View reviewed changes

shrshi self-requested a review September 19, 2024 16:41

shrshi reviewed Sep 19, 2024

View reviewed changes

mythrocks reviewed Sep 19, 2024

View reviewed changes

shrshi approved these changes Sep 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JSON tree algorithm code reorg #16836

JSON tree algorithm code reorg #16836

karthikeyann commented Sep 19, 2024

KyleFromNVIDIA left a comment

shrshi Sep 19, 2024

mythrocks Sep 19, 2024

karthikeyann Sep 19, 2024

bdice Sep 19, 2024

karthikeyann Sep 20, 2024

mythrocks left a comment

shrshi left a comment

JSON tree algorithm code reorg #16836

Are you sure you want to change the base?

JSON tree algorithm code reorg #16836

Conversation

karthikeyann commented Sep 19, 2024

Description

Checklist

KyleFromNVIDIA left a comment

Choose a reason for hiding this comment

shrshi Sep 19, 2024

Choose a reason for hiding this comment

mythrocks Sep 19, 2024

Choose a reason for hiding this comment

karthikeyann Sep 19, 2024

Choose a reason for hiding this comment

bdice Sep 19, 2024

Choose a reason for hiding this comment

karthikeyann Sep 20, 2024

Choose a reason for hiding this comment

mythrocks left a comment

Choose a reason for hiding this comment

shrshi left a comment

Choose a reason for hiding this comment