Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MFG creation optimzation #3780

Closed
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
72 changes: 72 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,75 @@
# cuGraph 23.08.00 (9 Aug 2023)

## 🚨 Breaking Changes

- Change the renumber_sampled_edgelist function behavior. ([#3762](https://github.com/rapidsai/cugraph/pull/3762)) [@seunghwak](https://github.com/seunghwak)
- PLC and Python Support for Sample-Side MFG Creation ([#3734](https://github.com/rapidsai/cugraph/pull/3734)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- Stop using setup.py in build.sh ([#3704](https://github.com/rapidsai/cugraph/pull/3704)) [@vyasr](https://github.com/vyasr)
- Refactor edge betweenness centrality ([#3672](https://github.com/rapidsai/cugraph/pull/3672)) [@jnke2016](https://github.com/jnke2016)
- [FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler ([#3669](https://github.com/rapidsai/cugraph/pull/3669)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)

## 🐛 Bug Fixes

- Change the renumber_sampled_edgelist function behavior. ([#3762](https://github.com/rapidsai/cugraph/pull/3762)) [@seunghwak](https://github.com/seunghwak)
- Fix bug discovered in Jaccard testing ([#3758](https://github.com/rapidsai/cugraph/pull/3758)) [@ChuckHastings](https://github.com/ChuckHastings)
- fix inconsistent graph properties between the SG and the MG API ([#3757](https://github.com/rapidsai/cugraph/pull/3757)) [@jnke2016](https://github.com/jnke2016)
- Fixes options for `--pydevelop` to remove unneeded CWD path ("."), restores use of `setup.py` temporarily for develop builds ([#3747](https://github.com/rapidsai/cugraph/pull/3747)) [@rlratzel](https://github.com/rlratzel)
- Fix sampling call parameters if compiled with -DNO_CUGRAPH_OPS ([#3729](https://github.com/rapidsai/cugraph/pull/3729)) [@ChuckHastings](https://github.com/ChuckHastings)
- Fix primitive bug discovered in MG edge betweenness centrality testing ([#3723](https://github.com/rapidsai/cugraph/pull/3723)) [@ChuckHastings](https://github.com/ChuckHastings)
- Reorder dependencies.yaml channels ([#3721](https://github.com/rapidsai/cugraph/pull/3721)) [@raydouglass](https://github.com/raydouglass)
- [BUG] Fix namesapce to default_hash and hash_functions ([#3711](https://github.com/rapidsai/cugraph/pull/3711)) [@naimnv](https://github.com/naimnv)
- [BUG] Fix Bulk Sampling Test Issue ([#3701](https://github.com/rapidsai/cugraph/pull/3701)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- Make `pylibcugraphops` optional imports in `cugraph-dgl` and `-pyg` ([#3693](https://github.com/rapidsai/cugraph/pull/3693)) [@tingyu66](https://github.com/tingyu66)
- [FIX] Rename `cugraph-ops` symbols (refactoring) and update GHA workflows to call pytest via `python -m pytest` ([#3688](https://github.com/rapidsai/cugraph/pull/3688)) [@naimnv](https://github.com/naimnv)
- [FIX] Fix the hang in cuGraph Python Uniform Neighbor Sample, Add Logging to Bulk Sampler ([#3669](https://github.com/rapidsai/cugraph/pull/3669)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- force atlas notebook changes to run in cugraph 23.08 container. ([#3656](https://github.com/rapidsai/cugraph/pull/3656)) [@acostadon](https://github.com/acostadon)

## 📖 Documentation

- this fixes github links in cugraph, cugraph-dgl and cugraph-pyg ([#3650](https://github.com/rapidsai/cugraph/pull/3650)) [@acostadon](https://github.com/acostadon)
- Fix minor typo in README.md ([#3636](https://github.com/rapidsai/cugraph/pull/3636)) [@akasper](https://github.com/akasper)
- Created landing spot for centrality and similarity algorithms ([#3620](https://github.com/rapidsai/cugraph/pull/3620)) [@acostadon](https://github.com/acostadon)

## 🚀 New Features

- Compute shortest distances between given sets of origins and destinations for large diameter graphs ([#3741](https://github.com/rapidsai/cugraph/pull/3741)) [@seunghwak](https://github.com/seunghwak)
- Update primitive to compute weighted Jaccard, Sorensen and Overlap similarity ([#3728](https://github.com/rapidsai/cugraph/pull/3728)) [@naimnv](https://github.com/naimnv)
- Add CUDA 12.0 conda environment. ([#3725](https://github.com/rapidsai/cugraph/pull/3725)) [@bdice](https://github.com/bdice)
- Renumber utility function for sampling output ([#3707](https://github.com/rapidsai/cugraph/pull/3707)) [@seunghwak](https://github.com/seunghwak)
- Integrate C++ Sampling Source Behavior Updates ([#3699](https://github.com/rapidsai/cugraph/pull/3699)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- Adds `fail_on_nonconvergence` option to `pagerank` to provide pagerank results even on non-convergence ([#3639](https://github.com/rapidsai/cugraph/pull/3639)) [@rlratzel](https://github.com/rlratzel)
- Add Benchmark for Bulk Sampling ([#3628](https://github.com/rapidsai/cugraph/pull/3628)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- cugraph: Build CUDA 12 packages ([#3456](https://github.com/rapidsai/cugraph/pull/3456)) [@vyasr](https://github.com/vyasr)

## 🛠️ Improvements

- Pin `dask` and `distributed` for `23.08` release ([#3761](https://github.com/rapidsai/cugraph/pull/3761)) [@galipremsagar](https://github.com/galipremsagar)
- Fix `build.yaml` workflow ([#3756](https://github.com/rapidsai/cugraph/pull/3756)) [@ajschmidt8](https://github.com/ajschmidt8)
- Support MFG creation on sampling gpus for cugraph dgl ([#3742](https://github.com/rapidsai/cugraph/pull/3742)) [@VibhuJawa](https://github.com/VibhuJawa)
- PLC and Python Support for Sample-Side MFG Creation ([#3734](https://github.com/rapidsai/cugraph/pull/3734)) [@alexbarghi-nv](https://github.com/alexbarghi-nv)
- Switch to new wheel building pipeline ([#3731](https://github.com/rapidsai/cugraph/pull/3731)) [@vyasr](https://github.com/vyasr)
- Remove RAFT specialization. ([#3727](https://github.com/rapidsai/cugraph/pull/3727)) [@bdice](https://github.com/bdice)
- C API for renumbering the samples ([#3724](https://github.com/rapidsai/cugraph/pull/3724)) [@ChuckHastings](https://github.com/ChuckHastings)
- Only run cugraph conda CI for CUDA 11. ([#3713](https://github.com/rapidsai/cugraph/pull/3713)) [@bdice](https://github.com/bdice)
- Promote `Datasets` to stable and clean-up unit tests ([#3712](https://github.com/rapidsai/cugraph/pull/3712)) [@nv-rliu](https://github.com/nv-rliu)
- [BUG] Unsupported graph for similiarity algos ([#3710](https://github.com/rapidsai/cugraph/pull/3710)) [@jnke2016](https://github.com/jnke2016)
- Stop using setup.py in build.sh ([#3704](https://github.com/rapidsai/cugraph/pull/3704)) [@vyasr](https://github.com/vyasr)
- [WIP] Make edge ids optional ([#3702](https://github.com/rapidsai/cugraph/pull/3702)) [@VibhuJawa](https://github.com/VibhuJawa)
- Use rapids-cmake testing to run tests in parallel ([#3697](https://github.com/rapidsai/cugraph/pull/3697)) [@robertmaynard](https://github.com/robertmaynard)
- Sampling modifications to support PyG and DGL options ([#3696](https://github.com/rapidsai/cugraph/pull/3696)) [@ChuckHastings](https://github.com/ChuckHastings)
- Include cuCollection public header for hash functions ([#3694](https://github.com/rapidsai/cugraph/pull/3694)) [@seunghwak](https://github.com/seunghwak)
- Refactor edge betweenness centrality ([#3672](https://github.com/rapidsai/cugraph/pull/3672)) [@jnke2016](https://github.com/jnke2016)
- Refactor RMAT ([#3662](https://github.com/rapidsai/cugraph/pull/3662)) [@jnke2016](https://github.com/jnke2016)
- [REVIEW] Optimize bulk sampling ([#3661](https://github.com/rapidsai/cugraph/pull/3661)) [@VibhuJawa](https://github.com/VibhuJawa)
- Update to CMake 3.26.4 ([#3648](https://github.com/rapidsai/cugraph/pull/3648)) [@vyasr](https://github.com/vyasr)
- Optimize cugraph-dgl MFG creation ([#3646](https://github.com/rapidsai/cugraph/pull/3646)) [@VibhuJawa](https://github.com/VibhuJawa)
- use rapids-upload-docs script ([#3640](https://github.com/rapidsai/cugraph/pull/3640)) [@AyodeAwe](https://github.com/AyodeAwe)
- Fix dependency versions for `23.08` ([#3638](https://github.com/rapidsai/cugraph/pull/3638)) [@ajschmidt8](https://github.com/ajschmidt8)
- Unpin `dask` and `distributed` for development ([#3634](https://github.com/rapidsai/cugraph/pull/3634)) [@galipremsagar](https://github.com/galipremsagar)
- Remove documentation build scripts for Jenkins ([#3627](https://github.com/rapidsai/cugraph/pull/3627)) [@ajschmidt8](https://github.com/ajschmidt8)
- Unpin scikit-build upper bound ([#3609](https://github.com/rapidsai/cugraph/pull/3609)) [@vyasr](https://github.com/vyasr)
- Implement C++ Edge Betweenness Centrality ([#3602](https://github.com/rapidsai/cugraph/pull/3602)) [@ChuckHastings](https://github.com/ChuckHastings)

# cuGraph 23.06.00 (7 Jun 2023)

## 🚨 Breaking Changes
Expand Down
108 changes: 84 additions & 24 deletions python/cugraph-dgl/cugraph_dgl/dataloading/utils/sampling_helpers.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,18 @@ def cast_to_tensor(ser: cudf.Series):
return torch.as_tensor(ser.values, device="cuda")


def _get_source_destination_range(sampled_df):
o = sampled_df.groupby(["batch_id", "hop_id"], as_index=True).agg(
{"sources": "max", "destinations": "max"}
)
o.rename(
columns={"sources": "sources_range", "destinations": "destinations_range"},
inplace=True,
)
d = o.to_pandas().to_dict(orient="index")
return d


def _split_tensor(t, split_indices):
"""
Split a tensor into a list of tensors based on split_indices.
Expand Down Expand Up @@ -65,6 +77,7 @@ def _get_tensor_d_from_sampled_df(df):
Returns:
dict: A dictionary of tensors, keyed by batch_id and hop_id.
"""
range_d = _get_source_destination_range(df)
df, renumber_map, renumber_map_batch_indices = _get_renumber_map(df)
batch_id_tensor = cast_to_tensor(df["batch_id"])
batch_id_min = batch_id_tensor.min()
Expand Down Expand Up @@ -110,10 +123,16 @@ def _get_tensor_d_from_sampled_df(df):
split_t = _split_tensor(t, hop_indices)
for hid, ht in zip(hop_split_d.keys(), split_t):
hop_split_d[hid][column] = ht
for hid in hop_split_d.keys():
hop_split_d[hid]["sources_range"] = range_d[(batch_id, hid)][
"sources_range"
]
hop_split_d[hid]["destinations_range"] = range_d[(batch_id, hid)][
"destinations_range"
]

result_tensor_d[batch_id] = hop_split_d
if "map" in batch_d:
result_tensor_d[batch_id]["map"] = batch_d["map"]
result_tensor_d[batch_id]["map"] = batch_d["map"]
return result_tensor_d


Expand All @@ -138,17 +157,20 @@ def create_homogeneous_sampled_graphs_from_dataframe(
"""
result_tensor_d = _get_tensor_d_from_sampled_df(sampled_df)
del sampled_df
metagraph = dgl.convert.graph_index.from_coo(2, [0], [1], True)
result_mfgs = [
_create_homogeneous_sampled_graphs_from_tensors_perhop(
tensors_batch_d, edge_dir
tensors_batch_d, edge_dir, metagraph
)
for tensors_batch_d in result_tensor_d.values()
]
del result_tensor_d
return result_mfgs


def _create_homogeneous_sampled_graphs_from_tensors_perhop(tensors_batch_d, edge_dir):
def _create_homogeneous_sampled_graphs_from_tensors_perhop(
tensors_batch_d, edge_dir, metagraph
):
"""
This helper function creates sampled DGL MFGS for
homogeneous graphs from tensors per hop for a single
Expand All @@ -157,6 +179,7 @@ def _create_homogeneous_sampled_graphs_from_tensors_perhop(tensors_batch_d, edge
Args:
tensors_batch_d (dict): A dictionary of tensors, keyed by hop_id.
edge_dir (str): Direction of edges from samples
metagraph (dgl.metagraph): The metagraph for the sampled graph
Returns:
tuple: A tuple of three elements:
- input_nodes: The input nodes for the batch.
Expand All @@ -168,14 +191,15 @@ def _create_homogeneous_sampled_graphs_from_tensors_perhop(tensors_batch_d, edge
if edge_dir == "out":
raise ValueError("Outwards edges not supported yet")
graph_per_hop_ls = []
seednodes = None
seednodes_range = None
for hop_id, tensor_per_hop_d in tensors_batch_d.items():
if hop_id != "map":
block = _create_homogeneous_dgl_block_from_tensor_d(
tensor_per_hop_d, tensors_batch_d["map"], seednodes
tensor_per_hop_d, tensors_batch_d["map"], seednodes_range, metagraph
)
seednodes = torch.concat(
[tensor_per_hop_d["sources"], tensor_per_hop_d["destinations"]]
seednodes_range = max(
tensor_per_hop_d["sources_range"],
tensor_per_hop_d["destinations_range"],
)
graph_per_hop_ls.append(block)

Expand All @@ -188,30 +212,66 @@ def _create_homogeneous_sampled_graphs_from_tensors_perhop(tensors_batch_d, edge
return input_nodes, output_nodes, graph_per_hop_ls


def _create_homogeneous_dgl_block_from_tensor_d(tensor_d, renumber_map, seednodes=None):
def _create_homogeneous_dgl_block_from_tensor_d(
tensor_d,
renumber_map,
seednodes_range=None,
metagraph=None,
):
rs = tensor_d["sources"]
rd = tensor_d["destinations"]

max_src_nodes = rs.max()
max_dst_nodes = rd.max()
if seednodes is not None:
# If we have isolated vertices
max_src_nodes = tensor_d["sources_range"]
max_dst_nodes = tensor_d["destinations_range"]
if seednodes_range is not None:
# If we have vertices without outgoing edges, then
# sources can be missing from seednodes
# so we add them
# to ensure all the blocks are
# linedup correctly
max_dst_nodes = max(max_dst_nodes, seednodes.max())

data_dict = {("_N", "_E", "_N"): (rs, rd)}
num_src_nodes = {"_N": max_src_nodes.item() + 1}
num_dst_nodes = {"_N": max_dst_nodes.item() + 1}
block = dgl.create_block(
data_dict=data_dict, num_src_nodes=num_src_nodes, num_dst_nodes=num_dst_nodes
# lined up correctly
max_dst_nodes = max(max_dst_nodes, seednodes_range)

block = _create_homogeneous_dgl_block_from_tensor_arrays(
rs, rd, max_src_nodes + 1, max_dst_nodes + 1, metagraph
)
# data_dict = {("_N", "_E", "_N"): (rs, rd)}
# num_src_nodes = {"_N": max_src_nodes + 1}
# num_dst_nodes = {"_N": max_dst_nodes + 1}
# block = dgl.create_block(
# data_dict=data_dict, num_src_nodes=num_src_nodes, num_dst_nodes=num_dst_nodes
# )
if "edge_id" in tensor_d:
block.edata[dgl.EID] = tensor_d["edge_id"]
block.srcdata[dgl.NID] = renumber_map[block.srcnodes()]
block.dstdata[dgl.NID] = renumber_map[block.dstnodes()]
# Below adds too much run time overhead
block.srcdata[dgl.NID] = renumber_map[0 : max_src_nodes + 1]
block.dstdata[dgl.NID] = renumber_map[0 : max_dst_nodes + 1]
return block


def _create_homogeneous_dgl_block_from_tensor_arrays(
src, dst, num_src_nodes, num_dst_nodes, metagraph
):
srctype = "_N"
etype = "_E"
dsttype = "_N"

num_nodes_per_type = dgl.convert.utils.toindex(
[num_src_nodes, num_dst_nodes], "int64"
)
arrays = (src, dst)
rel_graph = dgl.convert.create_from_edges(
"coo",
arrays,
"SRC/" + srctype,
etype,
"DST/" + dsttype,
num_src_nodes,
num_dst_nodes,
)
rel_graphs = [rel_graph._graph]
hgidx = dgl.convert.heterograph_index.create_heterograph_from_relations(
metagraph, rel_graphs, num_nodes_per_type
)
block = dgl.convert.DGLBlock(hgidx, ([srctype], [dsttype]), [etype])
return block


Expand Down