Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize K-Truss #4742

Merged

Conversation

jnke2016
Copy link
Contributor

@jnke2016 jnke2016 commented Nov 1, 2024

This PR introduces several optimization to speed up K-Truss. In fact, our K-Truss implementation computes the intersection of all edges regardless they are weak or not which can be very expensive if only few edges need to be invalidated. By running nbr_intersection on the weak edges, this considerably improves the runtime.

@jnke2016 jnke2016 self-assigned this Nov 15, 2024
@jnke2016 jnke2016 added this to the 24.12 milestone Nov 15, 2024
@jnke2016 jnke2016 marked this pull request as ready for review November 18, 2024 15:32
@jnke2016 jnke2016 requested a review from a team as a code owner November 18, 2024 15:32
auto has_edge = thrust::binary_search(
thrust::seq, edgelist_first, weak_edgelist_first, edge_p_r);

if (!has_edge) { // FIXME: Do binary search instead
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this FIXME still relevant? It looks like you're doing binary search



template <typename vertex_t, typename edge_t>
struct extract_edges { // FIXME: ******************************Remove this functor. For testing purposes only*******************
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should these be left in here?

std::optional<rmm::device_uvector<weight_t>> edgelist_wgts{std::nullopt};


//#if 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete the commented out lines in this section

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right I still had a lot of debug prints which are now removed

@seunghwak
Copy link
Contributor

@jnke2016 Is this PR ready for review?


{
// FIXME: Use global comm for debugging purposes
// then replace it by minor comm once the accuracy is verified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't you verified this? Using global_comm vs minor_comm has a pretty significant scalability impact in large scale (even if you're sending to/receiving from the same set of ranks). All-to-all on a subcommunicator performs better than P2P on a subset of ranks using a global communicator.

std::vector<size_t> h_tx_counts(d_tx_counts.size());

raft::update_host(
h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handle.sync_stream() is necessary before using h_tx_counts.

No guarantee that h_tx_counts will hold valid values before handle.sync_stream().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to address this.

std::tie(dsts, std::ignore) =
shuffle_values(handle.get_comms(), edgelist_dsts.begin(), h_tx_counts, handle.get_stream());

// rmm::device_uvector<bool> edge_exists(0, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delete.

shuffle_values(handle.get_comms(), edge_exists.begin(), rx_counts, handle.get_stream());

// The 'edge_exists' array is ordered based on 'cp_edgelist_srcs' where the edges where group,
// hoever it needs to match 'edgelist_srcs', hence re-order 'edge_exists' accordingly.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this true? I guess edge_exists is ordered based on edgelist_srcs not cp_edgelist_srcs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes other way around. It is cp_edgelist_srcs that is supposed to be groupby_and_count. The documentation is correct but made a mistake in the group_by_and_count call. I don't want to mess up original edges in the triangles when grouping that's why I make a copy.

void order_edge_based_on_dodg(raft::handle_t const& handle,
graph_view_t<vertex_t, edge_t, false, multi_gpu>& graph_view,
raft::device_span<vertex_t> edgelist_srcs,
raft::device_span<vertex_t> edgelist_dsts)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are you doing in this function?

Copy link
Contributor Author

@jnke2016 jnke2016 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Essentially this function is ordering the edges obtained from nbr_intersection (On the symmetric graph) based on the DODG edges

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we identify weak edges, the next step is to retrieve the triangles incident to those weak edges and this is done with nbr_intersection (on the symmetric graph). Once we have the three endpoints of the triangle, the next step is to find the direction that matches the edges in the DODG and this is what that function does

std::vector<size_t> h_tx_counts(d_tx_counts.size());

raft::update_host(
h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't forget to address this.

std::optional<rmm::device_uvector<vertex_t>> cp_edgelist_dsts{std::nullopt};

// FIXME: Minor comm is not working for all cases so I believe some edges a beyong
// the partitioning range
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a beyong here sounds like a typo.

And have you confirmed that some edge sources/destinations are outside the expected range? Or maybe the code to use minor_comm has a bug? Whichever is the case, we need to really dig into this and make sure everything is working as expected. This optimization is for large scale K-Truss benchmarking. Any testing/debugging/tuning gets way more difficult in larger scales. We need to verify everything we can in small scales before going to a larger scale. If we have doubts in our code even in this scale, things will get way worse in larger scales.


std::vector<size_t> h_tx_counts(d_tx_counts.size());

handle.sync_stream();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be after raft::update_host.

h_tx_counts.data(), d_tx_counts.data(), d_tx_counts.size(), handle.get_stream());

std::tie(srcs, rx_counts) = shuffle_values(
handle.get_comms(), cp_edgelist_srcs->begin(), h_tx_counts, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have already created an alias for handle.get_comms() in line 66. Why use handle.get_comms() instead of just comm?

handle.get_comms(), cp_edgelist_srcs->begin(), h_tx_counts, handle.get_stream());

std::tie(dsts, std::ignore) = shuffle_values(
handle.get_comms(), cp_edgelist_dsts->begin(), h_tx_counts, handle.get_stream());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

raft::device_span<vertex_t> edgelist_srcs,
raft::device_span<vertex_t> edgelist_dsts)

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, here, what we are actually doing is this.

For (src, dst) pairs in edgelist_srcs & edgelist_dsts, return (src, dst) if (src, dst) exists in graph_view, and return (dst, src) if not.

I think this code is overly complex.

If single-GPU, this code looks fine.

In multi-GPU, you first need to find unique (src, dst) pairs, shuffle, call has_edge, shuffle_back. Now you have (src, dst, exist) triplets. Sort them for binary search. Then, you can iterate over the input pairs and flip or not based on whether the edge exists or not.

In the current code, you are finding unique edges after shuffling. But better do this at the beginning to reduce computation/communication. And for temporary variables, try to minimize their scopes. Some variables' scopes are too large and this makes code harder to understand.

auto src = thrust::get<0>(edgelist_first[idx]);
auto dst = thrust::get<1>(edgelist_first[idx]);

auto itr_pair = thrust::find( // FIXME: replace by lower bound
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a simple fix. Don't defer simple fixes for later updates.

Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am reviewing the entire K-Truss code; more reviews coming.

@@ -214,121 +409,501 @@ k_truss(raft::handle_t const& handle,

// 3. Keep only the edges from a low-degree vertex to a high-degree vertex.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

    auto [srcs, dsts, wgts] = k_core(handle,
                                     cur_graph_view,
                                     edge_weight_view,
                                     k - 1,
                                     std::make_optional(k_core_degree_type_t::OUT),
                                     std::make_optional(core_number_span));

I guess this code won't work if cur_graph_view != graph_view; edge_weight_view is invalid if cur_graph_view == modified_graph_view.

@@ -121,6 +314,8 @@ k_truss(raft::handle_t const& handle,
edge_weight{std::nullopt};
std::optional<rmm::device_uvector<weight_t>> wgts{std::nullopt};

cugraph::edge_bucket_t<vertex_t, void, true, multi_gpu, true> edgelist_dodg(handle);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better minimize the scope of edgelist_dodg. This is used only for the transform_e in line 460. No reason to define this variable way ahead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I think maintaining both cur_graph_view and modified_graph_view is redundant.

Won't
auto cur_graph_view = modified_graph ? modified_graph.view() : graph_view be enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. Now I only maintain cur_graph_view and unmasked_cur_graph_view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better minimize the scope of edgelist_dodg

After updating my script, this variable was no longer need and removed

}
renumber_map = std::move(tmp_renumber_map);
}
edgelist_dodg.clear();

// 4. Compute triangle count using nbr_intersection and unroll weak edges

{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the purpose of this curly bracket? This closes at the end of this function definition and doesn't help in reducing the scope of any variable.

@@ -214,121 +409,501 @@ k_truss(raft::handle_t const& handle,

// 3. Keep only the edges from a low-degree vertex to a high-degree vertex.

{
auto cur_graph_view = modified_graph_view ? *modified_graph_view : graph_view;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may place this DODG extraction part inside { ... }. We just need to define dodg_mask outside the bracket and I assume all the other variables aren't used besides extracting a DODG graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was also addressed

@@ -121,6 +314,8 @@ k_truss(raft::handle_t const& handle,
edge_weight{std::nullopt};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we maintain edge_weight from here? We can treat the graph as an unweighted graph till we finish finding K-Truss edge sources & destinations. Once we have K-Truss edge sources & destinations, we can extract edge (src, dst, weight) triplets from the input graph.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I treated the graph as unweighted and extracted the edges with weights(if exist) at the end.

@ChuckHastings
Copy link
Collaborator

Moving this to 25.02... hopefully merging very early.

@ChuckHastings ChuckHastings modified the milestones: 24.12, 25.02 Dec 4, 2024
@ChuckHastings ChuckHastings changed the base branch from branch-24.12 to branch-25.02 January 22, 2025 20:27
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2024-2025, NVIDIA CORPORATION.
* Copyright (c) 2024, NVIDIA CORPORATION.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like your copyright reverted... should include 2025.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved

@@ -41,8 +42,68 @@
#include <thrust/transform.h>
#include <thrust/tuple.h>

#include <chrono>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this used?

Rather not have dangling using statements like this in a header. If we're actually using std::chrono, I think it's better to prefix the function calls/variable references (e.g. std::chrono::duration) rather than a using which loses the context of where duration comes from.

If this is just from some debugging/benchmarking that's not in use, best not to include it in the final code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is just from some debugging/benchmarking that's not in use, best not to include it in the final code.

Yes those were only for benchmarking purposes and I removed them

template <typename vertex_t, typename edge_t>
struct extract_weak_edges {
edge_t k{};
__device__ thrust::optional<thrust::tuple<vertex_t, vertex_t>> operator()(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thrust::optional will be replaced with cuda::std::optional (see #4891).

While the above PR will replace thrust::optional in the old code, it won't replace thrust::optional in the new code. We should use cuda::std::optional in the new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. I updated my PR. As a matter of fact, I couldn't even build my branch once I merged the latest changes

__device__ thrust::optional<thrust::tuple<vertex_t, vertex_t>> operator()(
vertex_t src, vertex_t dst, thrust::nullopt_t, thrust::nullopt_t, edge_t count) const
{
return ((count < k - 2) && (count > 0))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we check count > 0? A comment explaining this will be helpful.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why should we check count > 0? A comment explaining this will be helpful.

count can only be >= 0 if we unrolled the edge count properly (without overcompensating). We don't need to process edges with count == 0 that's why I check count > 0. I just replaced that check by ((count < k - 2) && (count != 0)) which makes more sense.

Once I am done processing all the edges in the k-truss, I simply mask all the edges with count ==0 as you can see below and then extract the remaining edges with is the final result.

cugraph::transform_e(
      handle,
      cur_graph_view,
      cugraph::edge_src_dummy_property_t{}.view(),
      cugraph::edge_dst_dummy_property_t{}.view(),
      edge_triangle_counts.view(),
      [] __device__(auto src, auto dst, cuda::std::nullopt_t, cuda::std::nullopt_t, auto count) {
        return count == 0 ? false : true;
      },
      dodg_mask.mutable_view(),
      false);

};

template <typename edge_t>
struct is_k_minus_1_or_greater_t {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this a misnomer?

If core number is k - 1, this function will return false.

Should we rename to "is_k_or_greater_t"?

Comment on lines 79 to 104
auto endpoints = thrust::make_tuple(weak_srcs[chunk_start + idx], // p
weak_dsts[chunk_start + idx], // q
intersection_indices[i] // r
);

// Re-order the endpoints such that p < q < r in order to identify duplicate triangles
// which will cause overcompensation. comparing the vertex IDs is cheaper than comparing the
// degrees (d(p) < d(q) < d(r)) which will be done once in the latter stage to retrieve the
// direction of the edges once the triplet dependency is broken.
if (thrust::get<0>(endpoints) > thrust::get<2>(endpoints)) { // (a > c)
endpoints = thrust::make_tuple(thrust::get<2>(endpoints),
thrust::get<1>(endpoints),
thrust::get<0>(endpoints)); // swap(a, c)
}

if (thrust::get<0>(endpoints) > thrust::get<1>(endpoints)) { // (a > b)
endpoints = thrust::make_tuple(thrust::get<1>(endpoints),
thrust::get<0>(endpoints),
thrust::get<2>(endpoints)); // swap(a, b);
}

if (thrust::get<1>(endpoints) > thrust::get<2>(endpoints)) { // (b > c)
endpoints = thrust::make_tuple(thrust::get<0>(endpoints),
thrust::get<2>(endpoints),
thrust::get<1>(endpoints)); // swap(b, c);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can simplify this.

#include <cuda/std/utility>
...
auto p = weak_srcs[chunk_start + idx];
auto q = weak_dsts[chunk_start + idx];
auto r = intersectoin_indices[i];
// Re-order the endpoints such that p < q < r in order to identify duplicate triangles
// which will cause overcompensation. comparing the vertex IDs is cheaper than comparing the
// degrees (d(p) < d(q) < d(r)) which will be done once in the latter stage to retrieve the
// direction of the edges once the triplet dependency is broken.
if (p > q) cuda::std::swap(p, q);
if (p > r) cuda::std::swap(p, r);
if (q > r) cuda::std::swap(q, r);
return std::make_tuple(p, q, r);

}
// 2.1 Exclude self-loops

if (graph_view.count_self_loops(handle) > edge_t{0}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better not mix graph_view & cur_graph_view (even though they are same at this point).

if (cur_graph_view.count_self_loops(handle) > edge_t{0}) {

could be less confusing.

Comment on lines 226 to 229
cugraph::edge_property_t<decltype(cur_graph_view), bool> in_k_minus_1_core_edge_mask(
handle, cur_graph_view);
cugraph::fill_edge_property(
handle, unmasked_cur_graph_view, in_k_minus_1_core_edge_mask.mutable_view(), false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is desirable to minimize the scope of a variable. Better move this right before the transform_e call to update in_k_minus_1_core_edge_mask.

Comment on lines 309 to 310
dodg_mask.mutable_view(),
false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dodg_mask.mutable_view());

No need to explicitly set do_expensive_check to false here.

Copy link
Contributor Author

@jnke2016 jnke2016 Jan 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right. But since we have the expensive check exposed, I can just pass it to every function that support it.

edge_src_dummy_property_t{}.view(),
edge_dst_dummy_property_t{}.view(),
edge_mask.view(),
[] __device__(auto src, auto dst, auto src_out_degree, auto dst_out_degree, auto e_mask) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

auto src_out_degree, auto dst_out_degree,

=>

auto, auto,

src_out_degree & dst_out_degree are misnomers... They are dummy values.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I am not passing anything for the src and dat property.

return e_mask ? true : false;
},
weak_edges_mask.mutable_view(),
false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to explicitly set do_expensive_check to false here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't weak_edge_mask = std::move(edge_mask) work here? (Or may just rename edge_mask to something like undirected_mask? So, we just maintain undirected_mask & dodg_mask.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

weak_edge_mask = std::move(edge_mask)

Good catch. No need to repeat the same work in edge_mask with transform_e

return count == 0 ? false : true;
},
dodg_mask.mutable_view(),
false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to explicitly set do_expensive_check = false.

minor_comm_size}] __device__(auto val) {
return key_func(thrust::get<0>(val), thrust::get<1>(val));
},

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for an empty line here.

@ChuckHastings ChuckHastings removed request for a team and gforsyth January 28, 2025 16:35
Copy link
Contributor

@seunghwak seunghwak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ChuckHastings ChuckHastings added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 30, 2025
@ChuckHastings
Copy link
Collaborator

/merge

@rapids-bot rapids-bot bot merged commit 4baef58 into rapidsai:branch-25.02 Jan 30, 2025
81 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuGraph improvement Improvement / enhancement to an existing function non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants