Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JSON tree algorithms refactor I: CSR data structure for column tree #15979

Open
wants to merge 44 commits into
base: branch-24.10
Choose a base branch
from

Conversation

shrshi
Copy link
Contributor

@shrshi shrshi commented Jun 11, 2024

Description

Part of #15903.

  1. Introduces the Compressed Sparse Row (CSR) format to store the adjacency information of the column tree.
  2. Analogous to reduce_to_column_tree, reduce_to_column_tree_csr reduces node tree representation to column tree stored in CSR format.

TODO:

  • Correctness test

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jun 11, 2024
@shrshi shrshi added cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 11, 2024
@github-actions github-actions bot added the CMake CMake build issue label Jun 25, 2024
@shrshi shrshi marked this pull request as ready for review June 25, 2024 01:00
@shrshi shrshi requested a review from a team as a code owner June 25, 2024 01:00
@ttnghia
Copy link
Contributor

ttnghia commented Jun 25, 2024

Question: Is this ready for review? Or still WIP?

@shrshi
Copy link
Contributor Author

shrshi commented Jun 25, 2024

Question: Is this ready for review? Or still WIP?

Sorry, forgot to update the title. It's ready for review.

@shrshi shrshi changed the title [WIP] JSON tree algorithms refactor I: CSR data structure for column tree JSON tree algorithms refactor I: CSR data structure for column tree Jun 25, 2024
@shrshi shrshi requested a review from a team as a code owner June 28, 2024 17:21
Copy link

copy-pr-bot bot commented Jun 28, 2024

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shrshi
Copy link
Contributor Author

shrshi commented Jun 28, 2024

/ok to test

@shrshi shrshi requested a review from vuule September 6, 2024 23:23
@shrshi
Copy link
Contributor Author

shrshi commented Sep 6, 2024

/ok to test

Copy link
Contributor

@revans2 revans2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine. I didn't see any test or performance regressions or improvements with this patch.

@vuule
Copy link
Contributor

vuule commented Sep 10, 2024

This looks fine. I didn't see any test or performance regressions or improvements with this patch.

AFAIK the new path is not (yet) use by default.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last batch, nothing major :)

};

/*
* @brief Unvalidated column tree stored in Compressed Sparse Row (CSR) format. The device json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* @brief Unvalidated column tree stored in Compressed Sparse Row (CSR) format. The device json
* @brief Invalidated column tree stored in Compressed Sparse Row (CSR) format. The device json

Comment on lines +215 to +216
// position of nnzs
csr adjacency;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment got separated from the data members it documents.
Also, I would not mind the full name of whatever nnz is :) (I assume something like non-zero)

Suggested change
// position of nnzs
csr adjacency;
csr adjacency;
// positions of nnzs

Comment on lines +217 to +218
rmm::device_uvector<NodeIndexT> rowidx;
rmm::device_uvector<NodeIndexT> colidx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like these were left by accident, as they're unused.

Suggested change
rmm::device_uvector<NodeIndexT> rowidx;
rmm::device_uvector<NodeIndexT> colidx;

typename OutputIterator1,
typename OutputIterator2>
void max_row_offsets_col_categories(InputIterator1 keys_first,
InputIterator1 keys_last,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this actually last? I expect it to be one past last, so we should name these begin/end.

Comment on lines +97 to +98
} else
ctg = NC_ERR;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
} else
ctg = NC_ERR;
} else {
ctg = NC_ERR;
}

// *+*=*, v+v=v
if (type_a == type_b) {
ctg = type_a;
} else if (is_a_leaf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function is seemingly not commutative w.r.t. is_a_leaf/is_b_leaf. We check is_a_leaf first, so I'm not sure if we're getting the right result when both is_a_leaf and is_b_leaf are true.

auto const num_columns = thrust::unique_count(
rmm::exec_policy_nosync(stream), level_ordered_col_ids.begin(), level_ordered_col_ids.end());
rmm::device_uvector<NodeIndexT> level_ordered_unique_node_ids(num_columns, stream);
rmm::device_uvector<NodeIndexT> mapped_col_ids(num_columns, stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there are several vectors in this function that live way past their last use (e.g. mapped_col_ids_copy, level_ordered_unique_node_ids, level_ordered_unique_node_ids). We can reset them, but it's probably cleaner to break the function into smaller ones and reduce the lifetime "organically".

thrust::make_discard_iterator(),
thrust::make_zip_iterator(max_row_offsets.begin(), column_categories.begin()),
stream);
// 4. construct parent_col_ids using permutation iterator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// 4. construct parent_col_ids using permutation iterator
// 4. construct parent_col_ids using permutation iterator

parent_col_ids = parent_col_ids.begin()] __device__(auto i) {
auto parent_col_id = parent_col_ids[i];
if (parent_col_id == 0)
map[i - 1]--;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
map[i - 1]--;
--map[i - 1];

/*
* @brief Sparse graph adjacency matrix stored in Compressed Sparse Row (CSR) format.
*/
struct csr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO this abbreviation is very confusing and cryptic. It is better if we call by full name like:

Suggested change
struct csr {
struct compress_sparse_row {

Comment on lines +193 to +194
rmm::device_uvector<NodeIndexT> rowidx;
rmm::device_uvector<NodeIndexT> colidx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rmm::device_uvector<NodeIndexT> rowidx;
rmm::device_uvector<NodeIndexT> colidx;
rmm::device_uvector<NodeIndexT> row_idx;
rmm::device_uvector<NodeIndexT> col_idx;

* annotated with information required to construct cuDF columns.
*/
struct column_tree {
// position of nnzs
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is nnzs?

Comment on lines 248 to 249
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use const consistently.

Suggested change
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
bool const is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend not using const for parameter types unless they are pass by reference. The const provides no help to the caller since the pass-by-copy ensures the parameter value is not changed from the caller's perspective. It also limits the implementation which may now require a separate variable (register) to manage a value that otherwise would not be const.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sharing the insight @davidwendt . +1 on the above comment.
I do have one doubt: if the pass by value argument is const, will it help compiler to do optimizations better? (or make compiler work easier(faster) because it already knows it's const).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The const hint to the compiler is a bit outdated. The compiler usually knows when a variable value is not changed and automatically provides optimizations regardless of the const decorator. So I'd rather make the interface clear without hampering the implementation details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a middle ground I've heard of: avoid const by-value in the definition (visible to user), but use const in the definition to prevent the parameter value changing somewhere in the code.
I personally don't use const here, but can understand the (mild) usefulness in the definition.

It also limits the implementation which may now require a separate variable (register) to manage a value that otherwise would not be const.

@davidwendt could you explain this, I'm not following?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just that the caller does not need to know your implementation details regarding the parameter. If you want to use the parameter as non-const in your implementation that should be up to you.
Using const in the definition and not in the declaration seems like a hack (to me) though I know it is supported by the C++ spec.

Comment on lines +391 to +392
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, use const consistently.

Suggested change
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
bool const is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,

Comment on lines +105 to +106
template <typename T>
void print(device_span<T const> d_vec, std::string name, rmm::cuda_stream_view stream)
Copy link
Contributor

@ttnghia ttnghia Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Function for debugging should not be compiled by default. Typically, they are wrapped in a macro such as:

#ifdef JSON_DEBUG
template <typename T>
void print(device_span<T const> d_vec, std::string name, rmm::cuda_stream_view stream)
{...}
#endif // JSON_DEBUG

Comment on lines +120 to +122
device_span<TreeDepthT> node_levels;
device_span<NodeIndexT const> col_ids;
device_span<NodeIndexT const> parent_node_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, use const consistenly.

Suggested change
device_span<TreeDepthT> node_levels;
device_span<NodeIndexT const> col_ids;
device_span<NodeIndexT const> parent_node_ids;
device_span<TreeDepthT const> node_levels;
device_span<NodeIndexT const> col_ids;
device_span<NodeIndexT const> parent_node_ids;


struct parent_nodeids_to_colids {
device_span<NodeIndexT const> col_ids;
device_span<NodeIndexT> rev_mapped_col_ids;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
device_span<NodeIndexT> rev_mapped_col_ids;
device_span<NodeIndexT const> rev_mapped_col_ids;

Comment on lines 162 to 163
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
bool is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,
bool const is_array_of_arrays,
NodeIndexT const row_array_parent_col_id,

Comment on lines 181 to 185
/*
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream);
print<NodeIndexT>(col_ids, "h_col_ids", stream);
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream);
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/*
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream);
print<NodeIndexT>(col_ids, "h_col_ids", stream);
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream);
*/
#ifdef JSON_DEBUG
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream);
print<NodeIndexT>(col_ids, "h_col_ids", stream);
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream);
#ifdef // JSON_DEBUG

Comment on lines +212 to +214
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream);
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream);
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream);
Copy link
Contributor

@ttnghia ttnghia Sep 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream);
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream);
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream);
#ifdef JSON_DEBUG
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream);
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream);
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream);
#endif // JSON_DEBUG

// Note that the first element of csr_parent_col_ids is -1 (parent_node_sentinel)
// children adjacency

print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream);
#ifdef JSON_DEBUG
print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream);
#endif // JSON_DEBUG

non_leaf_nodes.begin(),
rowidx.begin() + 1);

print<NodeIndexT>(rowidx, "h_rowidx", stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap in JSON_DEBUG.

}),
thrust::plus<NodeIndexT>{});

print<NodeIndexT>(rowidx, "h_rowidx", stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap in JSON_DEBUG.

map.begin(),
colidx.begin());

print<size_type>(max_row_offsets, "h_max_row_offsets", stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrap in JSON_DEBUG.

Comment on lines 330 to 331
{
auto max_row_offsets_it =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of wrapping code in multiple local blocks so the temp arrays can be freed as soon as possible. If possible, please consider applying this throughout this PR.

@shrshi
Copy link
Contributor Author

shrshi commented Sep 17, 2024

/ok to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
Status: In Progress
Status: Burndown
Development

Successfully merging this pull request may close these issues.

7 participants