-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimize K-Truss #4742
Optimize K-Truss #4742
Conversation
cpp/src/community/k_truss_impl.cuh
Outdated
auto has_edge = thrust::binary_search( | ||
thrust::seq, edgelist_first, weak_edgelist_first, edge_p_r); | ||
|
||
if (!has_edge) { // FIXME: Do binary search instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this FIXME still relevant? It looks like you're doing binary search
cpp/src/community/k_truss_impl.cuh
Outdated
|
||
|
||
template <typename vertex_t, typename edge_t> | ||
struct extract_edges { // FIXME: ******************************Remove this functor. For testing purposes only******************* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should these be left in here?
cpp/src/community/k_truss_impl.cuh
Outdated
std::optional<rmm::device_uvector<weight_t>> edgelist_wgts{std::nullopt}; | ||
|
||
|
||
//#if 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete the commented out lines in this section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right I still had a lot of debug prints which are now removed
…2_optimize-ktruss
…2_optimize-ktruss
@jnke2016 Is this PR ready for review? |
cpp/src/community/k_truss_impl.cuh
Outdated
|
||
{ | ||
// FIXME: Use global comm for debugging purposes | ||
// then replace it by minor comm once the accuracy is verified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't you verified this? Using global_comm vs minor_comm has a pretty significant scalability impact in large scale (even if you're sending to/receiving from the same set of ranks). All-to-all on a subcommunicator performs better than P2P on a subset of ranks using a global communicator.
cpp/src/community/k_truss_impl.cuh
Outdated
std::vector<size_t> h_tx_counts(d_tx_counts.size()); | ||
|
||
raft::update_host( | ||
h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handle.sync_stream() is necessary before using h_tx_counts.
No guarantee that h_tx_counts will hold valid values before handle.sync_stream().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to address this.
cpp/src/community/k_truss_impl.cuh
Outdated
std::tie(dsts, std::ignore) = | ||
shuffle_values(handle.get_comms(), edgelist_dsts.begin(), h_tx_counts, handle.get_stream()); | ||
|
||
// rmm::device_uvector<bool> edge_exists(0, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Delete.
cpp/src/community/k_truss_impl.cuh
Outdated
shuffle_values(handle.get_comms(), edge_exists.begin(), rx_counts, handle.get_stream()); | ||
|
||
// The 'edge_exists' array is ordered based on 'cp_edgelist_srcs' where the edges where group, | ||
// hoever it needs to match 'edgelist_srcs', hence re-order 'edge_exists' accordingly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this true? I guess edge_exists
is ordered based on edgelist_srcs
not cp_edgelist_srcs
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes other way around. It is cp_edgelist_srcs
that is supposed to be groupby_and_count. The documentation is correct but made a mistake in the group_by_and_count call. I don't want to mess up original edges in the triangles when grouping that's why I make a copy.
cpp/src/community/k_truss_impl.cuh
Outdated
void order_edge_based_on_dodg(raft::handle_t const& handle, | ||
graph_view_t<vertex_t, edge_t, false, multi_gpu>& graph_view, | ||
raft::device_span<vertex_t> edgelist_srcs, | ||
raft::device_span<vertex_t> edgelist_dsts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are you doing in this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Essentially this function is ordering the edges obtained from nbr_intersection (On the symmetric graph) based on the DODG edges
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we identify weak edges, the next step is to retrieve the triangles incident to those weak edges and this is done with nbr_intersection (on the symmetric graph). Once we have the three endpoints of the triangle, the next step is to find the direction that matches the edges in the DODG and this is what that function does
cpp/src/community/k_truss_impl.cuh
Outdated
std::vector<size_t> h_tx_counts(d_tx_counts.size()); | ||
|
||
raft::update_host( | ||
h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't forget to address this.
cpp/src/community/k_truss_impl.cuh
Outdated
std::optional<rmm::device_uvector<vertex_t>> cp_edgelist_dsts{std::nullopt}; | ||
|
||
// FIXME: Minor comm is not working for all cases so I believe some edges a beyong | ||
// the partitioning range |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a beyong
here sounds like a typo.
And have you confirmed that some edge sources/destinations are outside the expected range? Or maybe the code to use minor_comm has a bug? Whichever is the case, we need to really dig into this and make sure everything is working as expected. This optimization is for large scale K-Truss benchmarking. Any testing/debugging/tuning gets way more difficult in larger scales. We need to verify everything we can in small scales before going to a larger scale. If we have doubts in our code even in this scale, things will get way worse in larger scales.
cpp/src/community/k_truss_impl.cuh
Outdated
|
||
std::vector<size_t> h_tx_counts(d_tx_counts.size()); | ||
|
||
handle.sync_stream(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be after raft::update_host.
cpp/src/community/k_truss_impl.cuh
Outdated
h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream()); | ||
|
||
std::tie(srcs, rx_counts) = shuffle_values( | ||
handle.get_comms(), cp_edgelist_srcs->begin(), h_tx_counts, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You have already created an alias for handle.get_comms()
in line 66. Why use handle.get_comms()
instead of just comm
?
cpp/src/community/k_truss_impl.cuh
Outdated
handle.get_comms(), cp_edgelist_srcs->begin(), h_tx_counts, handle.get_stream()); | ||
|
||
std::tie(dsts, std::ignore) = shuffle_values( | ||
handle.get_comms(), cp_edgelist_dsts->begin(), h_tx_counts, handle.get_stream()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here.
cpp/src/community/k_truss_impl.cuh
Outdated
raft::device_span<vertex_t> edgelist_srcs, | ||
raft::device_span<vertex_t> edgelist_dsts) | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, here, what we are actually doing is this.
For (src, dst) pairs in edgelist_srcs & edgelist_dsts, return (src, dst) if (src, dst) exists in graph_view, and return (dst, src) if not.
I think this code is overly complex.
If single-GPU, this code looks fine.
In multi-GPU, you first need to find unique (src, dst) pairs, shuffle, call has_edge, shuffle_back. Now you have (src, dst, exist) triplets. Sort them for binary search. Then, you can iterate over the input pairs and flip or not based on whether the edge exists or not.
In the current code, you are finding unique edges after shuffling. But better do this at the beginning to reduce computation/communication. And for temporary variables, try to minimize their scopes. Some variables' scopes are too large and this makes code harder to understand.
cpp/src/community/k_truss_impl.cuh
Outdated
auto src = thrust::get<0>(edgelist_first[idx]); | ||
auto dst = thrust::get<1>(edgelist_first[idx]); | ||
|
||
auto itr_pair = thrust::find( // FIXME: replace by lower bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a simple fix. Don't defer simple fixes for later updates.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am reviewing the entire K-Truss code; more reviews coming.
@@ -214,121 +409,501 @@ k_truss(raft::handle_t const& handle, | |||
|
|||
// 3. Keep only the edges from a low-degree vertex to a high-degree vertex. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto [srcs, dsts, wgts] = k_core(handle,
cur_graph_view,
edge_weight_view,
k - 1,
std::make_optional(k_core_degree_type_t::OUT),
std::make_optional(core_number_span));
I guess this code won't work if cur_graph_view != graph_view; edge_weight_view is invalid if cur_graph_view == modified_graph_view.
cpp/src/community/k_truss_impl.cuh
Outdated
@@ -121,6 +314,8 @@ k_truss(raft::handle_t const& handle, | |||
edge_weight{std::nullopt}; | |||
std::optional<rmm::device_uvector<weight_t>> wgts{std::nullopt}; | |||
|
|||
cugraph::edge_bucket_t<vertex_t, void, true, multi_gpu, true> edgelist_dodg(handle); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better minimize the scope of edgelist_dodg
. This is used only for the transform_e
in line 460. No reason to define this variable way ahead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think maintaining both cur_graph_view
and modified_graph_view
is redundant.
Won't
auto cur_graph_view = modified_graph ? modified_graph.view() : graph_view
be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. Now I only maintain cur_graph_view
and unmasked_cur_graph_view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better minimize the scope of edgelist_dodg
After updating my script, this variable was no longer need and removed
} | ||
renumber_map = std::move(tmp_renumber_map); | ||
} | ||
edgelist_dodg.clear(); | ||
|
||
// 4. Compute triangle count using nbr_intersection and unroll weak edges | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the purpose of this curly bracket? This closes at the end of this function definition and doesn't help in reducing the scope of any variable.
@@ -214,121 +409,501 @@ k_truss(raft::handle_t const& handle, | |||
|
|||
// 3. Keep only the edges from a low-degree vertex to a high-degree vertex. | |||
|
|||
{ | |||
auto cur_graph_view = modified_graph_view ? *modified_graph_view : graph_view; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may place this DODG extraction part inside { ... }
. We just need to define dodg_mask
outside the bracket and I assume all the other variables aren't used besides extracting a DODG graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was also addressed
cpp/src/community/k_truss_impl.cuh
Outdated
@@ -121,6 +314,8 @@ k_truss(raft::handle_t const& handle, | |||
edge_weight{std::nullopt}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we maintain edge_weight
from here? We can treat the graph as an unweighted graph till we finish finding K-Truss edge sources & destinations. Once we have K-Truss edge sources & destinations, we can extract edge (src, dst, weight) triplets from the input graph.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I treated the graph as unweighted and extracted the edges with weights(if exist) at the end.
Moving this to 25.02... hopefully merging very early. |
cpp/src/community/k_truss_impl.cuh
Outdated
@@ -1,5 +1,5 @@ | |||
/* | |||
* Copyright (c) 2024-2025, NVIDIA CORPORATION. | |||
* Copyright (c) 2024, NVIDIA CORPORATION. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like your copyright reverted... should include 2025.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resolved
cpp/src/community/k_truss_impl.cuh
Outdated
@@ -41,8 +42,68 @@ | |||
#include <thrust/transform.h> | |||
#include <thrust/tuple.h> | |||
|
|||
#include <chrono> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this used?
Rather not have dangling using statements like this in a header. If we're actually using std::chrono
, I think it's better to prefix the function calls/variable references (e.g. std::chrono::duration
) rather than a using which loses the context of where duration
comes from.
If this is just from some debugging/benchmarking that's not in use, best not to include it in the final code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is just from some debugging/benchmarking that's not in use, best not to include it in the final code.
Yes those were only for benchmarking purposes and I removed them
cpp/src/community/k_truss_impl.cuh
Outdated
template <typename vertex_t, typename edge_t> | ||
struct extract_weak_edges { | ||
edge_t k{}; | ||
__device__ thrust::optional<thrust::tuple<vertex_t, vertex_t>> operator()( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thrust::optional will be replaced with cuda::std::optional (see #4891).
While the above PR will replace thrust::optional in the old code, it won't replace thrust::optional in the new code. We should use cuda::std::optional in the new code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. I updated my PR. As a matter of fact, I couldn't even build my branch once I merged the latest changes
cpp/src/community/k_truss_impl.cuh
Outdated
__device__ thrust::optional<thrust::tuple<vertex_t, vertex_t>> operator()( | ||
vertex_t src, vertex_t dst, thrust::nullopt_t, thrust::nullopt_t, edge_t count) const | ||
{ | ||
return ((count < k - 2) && (count > 0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we check count > 0
? A comment explaining this will be helpful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should we check count > 0? A comment explaining this will be helpful.
count
can only be >= 0 if we unrolled the edge count properly (without overcompensating). We don't need to process edges with count == 0
that's why I check count > 0
. I just replaced that check by ((count < k - 2) && (count != 0))
which makes more sense.
Once I am done processing all the edges in the k-truss, I simply mask all the edges with count ==0
as you can see below and then extract the remaining edges with is the final result.
cugraph::transform_e(
handle,
cur_graph_view,
cugraph::edge_src_dummy_property_t{}.view(),
cugraph::edge_dst_dummy_property_t{}.view(),
edge_triangle_counts.view(),
[] __device__(auto src, auto dst, cuda::std::nullopt_t, cuda::std::nullopt_t, auto count) {
return count == 0 ? false : true;
},
dodg_mask.mutable_view(),
false);
cpp/src/community/k_truss_impl.cuh
Outdated
}; | ||
|
||
template <typename edge_t> | ||
struct is_k_minus_1_or_greater_t { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this a misnomer?
If core number is k - 1, this function will return false.
Should we rename to "is_k_or_greater_t"?
cpp/src/community/k_truss_impl.cuh
Outdated
auto endpoints = thrust::make_tuple(weak_srcs[chunk_start + idx], // p | ||
weak_dsts[chunk_start + idx], // q | ||
intersection_indices[i] // r | ||
); | ||
|
||
// Re-order the endpoints such that p < q < r in order to identify duplicate triangles | ||
// which will cause overcompensation. comparing the vertex IDs is cheaper than comparing the | ||
// degrees (d(p) < d(q) < d(r)) which will be done once in the latter stage to retrieve the | ||
// direction of the edges once the triplet dependency is broken. | ||
if (thrust::get<0>(endpoints) > thrust::get<2>(endpoints)) { // (a > c) | ||
endpoints = thrust::make_tuple(thrust::get<2>(endpoints), | ||
thrust::get<1>(endpoints), | ||
thrust::get<0>(endpoints)); // swap(a, c) | ||
} | ||
|
||
if (thrust::get<0>(endpoints) > thrust::get<1>(endpoints)) { // (a > b) | ||
endpoints = thrust::make_tuple(thrust::get<1>(endpoints), | ||
thrust::get<0>(endpoints), | ||
thrust::get<2>(endpoints)); // swap(a, b); | ||
} | ||
|
||
if (thrust::get<1>(endpoints) > thrust::get<2>(endpoints)) { // (b > c) | ||
endpoints = thrust::make_tuple(thrust::get<0>(endpoints), | ||
thrust::get<2>(endpoints), | ||
thrust::get<1>(endpoints)); // swap(b, c); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can simplify this.
#include <cuda/std/utility>
...
auto p = weak_srcs[chunk_start + idx];
auto q = weak_dsts[chunk_start + idx];
auto r = intersectoin_indices[i];
// Re-order the endpoints such that p < q < r in order to identify duplicate triangles
// which will cause overcompensation. comparing the vertex IDs is cheaper than comparing the
// degrees (d(p) < d(q) < d(r)) which will be done once in the latter stage to retrieve the
// direction of the edges once the triplet dependency is broken.
if (p > q) cuda::std::swap(p, q);
if (p > r) cuda::std::swap(p, r);
if (q > r) cuda::std::swap(q, r);
return std::make_tuple(p, q, r);
cpp/src/community/k_truss_impl.cuh
Outdated
} | ||
// 2.1 Exclude self-loops | ||
|
||
if (graph_view.count_self_loops(handle) > edge_t{0}) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better not mix graph_view & cur_graph_view (even though they are same at this point).
if (cur_graph_view.count_self_loops(handle) > edge_t{0}) {
could be less confusing.
cpp/src/community/k_truss_impl.cuh
Outdated
cugraph::edge_property_t<decltype(cur_graph_view), bool> in_k_minus_1_core_edge_mask( | ||
handle, cur_graph_view); | ||
cugraph::fill_edge_property( | ||
handle, unmasked_cur_graph_view, in_k_minus_1_core_edge_mask.mutable_view(), false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is desirable to minimize the scope of a variable. Better move this right before the transform_e call to update in_k_minus_1_core_edge_mask
.
cpp/src/community/k_truss_impl.cuh
Outdated
dodg_mask.mutable_view(), | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dodg_mask.mutable_view());
No need to explicitly set do_expensive_check to false here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right. But since we have the expensive check exposed, I can just pass it to every function that support it.
cpp/src/community/k_truss_impl.cuh
Outdated
edge_src_dummy_property_t{}.view(), | ||
edge_dst_dummy_property_t{}.view(), | ||
edge_mask.view(), | ||
[] __device__(auto src, auto dst, auto src_out_degree, auto dst_out_degree, auto e_mask) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
auto src_out_degree, auto dst_out_degree,
=>
auto, auto,
src_out_degree
& dst_out_degree
are misnomers... They are dummy values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I am not passing anything for the src and dat property.
cpp/src/community/k_truss_impl.cuh
Outdated
return e_mask ? true : false; | ||
}, | ||
weak_edges_mask.mutable_view(), | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to explicitly set do_expensive_check to false here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Won't weak_edge_mask = std::move(edge_mask) work here? (Or may just rename edge_mask to something like undirected_mask? So, we just maintain undirected_mask & dodg_mask.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
weak_edge_mask = std::move(edge_mask)
Good catch. No need to repeat the same work in edge_mask
with transform_e
cpp/src/community/k_truss_impl.cuh
Outdated
return count == 0 ? false : true; | ||
}, | ||
dodg_mask.mutable_view(), | ||
false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to explicitly set do_expensive_check = false.
cpp/src/community/k_truss_impl.cuh
Outdated
minor_comm_size}] __device__(auto val) { | ||
return key_func(thrust::get<0>(val), thrust::get<1>(val)); | ||
}, | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for an empty line here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
/merge |
This PR introduces several optimization to speed up K-Truss. In fact, our K-Truss implementation computes the intersection of all edges regardless they are weak or not which can be very expensive if only few edges need to be invalidated. By running
nbr_intersection
on the weak edges, this considerably improves the runtime.