-
Notifications
You must be signed in to change notification settings - Fork 885
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JSON tree algorithms refactor I: CSR data structure for column tree #15979
base: branch-24.10
Are you sure you want to change the base?
Conversation
Question: Is this ready for review? Or still WIP? |
Sorry, forgot to update the title. It's ready for review. |
/ok to test |
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks fine. I didn't see any test or performance regressions or improvements with this patch.
AFAIK the new path is not (yet) use by default. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last batch, nothing major :)
}; | ||
|
||
/* | ||
* @brief Unvalidated column tree stored in Compressed Sparse Row (CSR) format. The device json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* @brief Unvalidated column tree stored in Compressed Sparse Row (CSR) format. The device json | |
* @brief Invalidated column tree stored in Compressed Sparse Row (CSR) format. The device json |
// position of nnzs | ||
csr adjacency; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the comment got separated from the data members it documents.
Also, I would not mind the full name of whatever nnz is :) (I assume something like non-zero)
// position of nnzs | |
csr adjacency; | |
csr adjacency; | |
// positions of nnzs |
rmm::device_uvector<NodeIndexT> rowidx; | ||
rmm::device_uvector<NodeIndexT> colidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like these were left by accident, as they're unused.
rmm::device_uvector<NodeIndexT> rowidx; | |
rmm::device_uvector<NodeIndexT> colidx; |
typename OutputIterator1, | ||
typename OutputIterator2> | ||
void max_row_offsets_col_categories(InputIterator1 keys_first, | ||
InputIterator1 keys_last, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually last? I expect it to be one past last, so we should name these begin/end.
} else | ||
ctg = NC_ERR; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
} else | |
ctg = NC_ERR; | |
} else { | |
ctg = NC_ERR; | |
} |
// *+*=*, v+v=v | ||
if (type_a == type_b) { | ||
ctg = type_a; | ||
} else if (is_a_leaf) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the function is seemingly not commutative w.r.t. is_a_leaf/is_b_leaf. We check is_a_leaf first, so I'm not sure if we're getting the right result when both is_a_leaf and is_b_leaf are true.
auto const num_columns = thrust::unique_count( | ||
rmm::exec_policy_nosync(stream), level_ordered_col_ids.begin(), level_ordered_col_ids.end()); | ||
rmm::device_uvector<NodeIndexT> level_ordered_unique_node_ids(num_columns, stream); | ||
rmm::device_uvector<NodeIndexT> mapped_col_ids(num_columns, stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there are several vectors in this function that live way past their last use (e.g. mapped_col_ids_copy, level_ordered_unique_node_ids, level_ordered_unique_node_ids). We can reset them, but it's probably cleaner to break the function into smaller ones and reduce the lifetime "organically".
thrust::make_discard_iterator(), | ||
thrust::make_zip_iterator(max_row_offsets.begin(), column_categories.begin()), | ||
stream); | ||
// 4. construct parent_col_ids using permutation iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// 4. construct parent_col_ids using permutation iterator | |
// 4. construct parent_col_ids using permutation iterator |
parent_col_ids = parent_col_ids.begin()] __device__(auto i) { | ||
auto parent_col_id = parent_col_ids[i]; | ||
if (parent_col_id == 0) | ||
map[i - 1]--; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
map[i - 1]--; | |
--map[i - 1]; |
/* | ||
* @brief Sparse graph adjacency matrix stored in Compressed Sparse Row (CSR) format. | ||
*/ | ||
struct csr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO this abbreviation is very confusing and cryptic. It is better if we call by full name like:
struct csr { | |
struct compress_sparse_row { |
rmm::device_uvector<NodeIndexT> rowidx; | ||
rmm::device_uvector<NodeIndexT> colidx; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rmm::device_uvector<NodeIndexT> rowidx; | |
rmm::device_uvector<NodeIndexT> colidx; | |
rmm::device_uvector<NodeIndexT> row_idx; | |
rmm::device_uvector<NodeIndexT> col_idx; |
* annotated with information required to construct cuDF columns. | ||
*/ | ||
struct column_tree { | ||
// position of nnzs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is nnzs
?
cpp/src/io/json/nested_json.hpp
Outdated
bool is_array_of_arrays, | ||
NodeIndexT const row_array_parent_col_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use const
consistently.
bool is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, | |
bool const is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would recommend not using const
for parameter types unless they are pass by reference. The const
provides no help to the caller since the pass-by-copy ensures the parameter value is not changed from the caller's perspective. It also limits the implementation which may now require a separate variable (register) to manage a value that otherwise would not be const
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sharing the insight @davidwendt . +1 on the above comment.
I do have one doubt: if the pass by value argument is const, will it help compiler to do optimizations better? (or make compiler work easier(faster) because it already knows it's const).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The const
hint to the compiler is a bit outdated. The compiler usually knows when a variable value is not changed and automatically provides optimizations regardless of the const
decorator. So I'd rather make the interface clear without hampering the implementation details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a middle ground I've heard of: avoid const by-value in the definition (visible to user), but use const in the definition to prevent the parameter value changing somewhere in the code.
I personally don't use const here, but can understand the (mild) usefulness in the definition.
It also limits the implementation which may now require a separate variable (register) to manage a value that otherwise would not be const.
@davidwendt could you explain this, I'm not following?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just that the caller does not need to know your implementation details regarding the parameter. If you want to use the parameter as non-const in your implementation that should be up to you.
Using const
in the definition and not in the declaration seems like a hack (to me) though I know it is supported by the C++ spec.
bool is_array_of_arrays, | ||
NodeIndexT const row_array_parent_col_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly, use const
consistently.
bool is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, | |
bool const is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, |
template <typename T> | ||
void print(device_span<T const> d_vec, std::string name, rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Function for debugging should not be compiled by default. Typically, they are wrapped in a macro such as:
#ifdef JSON_DEBUG
template <typename T>
void print(device_span<T const> d_vec, std::string name, rmm::cuda_stream_view stream)
{...}
#endif // JSON_DEBUG
device_span<TreeDepthT> node_levels; | ||
device_span<NodeIndexT const> col_ids; | ||
device_span<NodeIndexT const> parent_node_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, use const
consistenly.
device_span<TreeDepthT> node_levels; | |
device_span<NodeIndexT const> col_ids; | |
device_span<NodeIndexT const> parent_node_ids; | |
device_span<TreeDepthT const> node_levels; | |
device_span<NodeIndexT const> col_ids; | |
device_span<NodeIndexT const> parent_node_ids; |
|
||
struct parent_nodeids_to_colids { | ||
device_span<NodeIndexT const> col_ids; | ||
device_span<NodeIndexT> rev_mapped_col_ids; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
device_span<NodeIndexT> rev_mapped_col_ids; | |
device_span<NodeIndexT const> rev_mapped_col_ids; |
bool is_array_of_arrays, | ||
NodeIndexT const row_array_parent_col_id, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, | |
bool const is_array_of_arrays, | |
NodeIndexT const row_array_parent_col_id, |
/* | ||
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream); | ||
print<NodeIndexT>(col_ids, "h_col_ids", stream); | ||
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream); | ||
*/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* | |
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream); | |
print<NodeIndexT>(col_ids, "h_col_ids", stream); | |
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream); | |
*/ | |
#ifdef JSON_DEBUG | |
print<NodeIndexT>(level_ordered_node_ids, "h_level_ordered_node_ids", stream); | |
print<NodeIndexT>(col_ids, "h_col_ids", stream); | |
print<NodeIndexT>(level_ordered_col_ids, "h_level_ordered_col_ids", stream); | |
#ifdef // JSON_DEBUG |
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream); | ||
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream); | ||
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream); | |
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream); | |
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream); | |
#ifdef JSON_DEBUG | |
print<NodeIndexT>(mapped_col_ids, "h_mapped_col_ids", stream); | |
print<NodeIndexT>(level_ordered_unique_node_ids, "h_level_ordered_unique_node_ids", stream); | |
print<NodeIndexT>(rev_mapped_col_ids, "h_rev_mapped_col_ids", stream); | |
#endif // JSON_DEBUG |
// Note that the first element of csr_parent_col_ids is -1 (parent_node_sentinel) | ||
// children adjacency | ||
|
||
print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream); | |
#ifdef JSON_DEBUG | |
print<NodeIndexT>(parent_col_ids, "h_parent_col_ids", stream); | |
#endif // JSON_DEBUG |
non_leaf_nodes.begin(), | ||
rowidx.begin() + 1); | ||
|
||
print<NodeIndexT>(rowidx, "h_rowidx", stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap in JSON_DEBUG
.
}), | ||
thrust::plus<NodeIndexT>{}); | ||
|
||
print<NodeIndexT>(rowidx, "h_rowidx", stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap in JSON_DEBUG
.
map.begin(), | ||
colidx.begin()); | ||
|
||
print<size_type>(max_row_offsets, "h_max_row_offsets", stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrap in JSON_DEBUG
.
{ | ||
auto max_row_offsets_it = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of wrapping code in multiple local blocks so the temp arrays can be freed as soon as possible. If possible, please consider applying this throughout this PR.
/ok to test |
Description
Part of #15903.
reduce_to_column_tree
,reduce_to_column_tree_csr
reduces node tree representation to column tree stored in CSR format.TODO:
Checklist