Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resultset and Dataset Refactors #3957

Merged
merged 49 commits into from
Nov 19, 2023
Merged
Show file tree
Hide file tree
Changes from 36 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
db99094
Added tests to verify load_resultsets by refactoring
Aug 30, 2023
5c37396
Added benchmark datasets yaml files for use with Dataset
betochimas Sep 5, 2023
93fb3e7
Introduce way to download larger datasets
betochimas Sep 7, 2023
b236d85
Merge branch 'rapidsai:branch-23.10' into branch-23.10-large-datasets
betochimas Sep 7, 2023
e94baa3
Testing gpg verification
betochimas Sep 7, 2023
aafd5d0
Use create_using pattern to give user choice for dataframe lib
betochimas Sep 7, 2023
23edacf
Merge branch 'branch-23.10' into branch-23.10-large-datasets
betochimas Sep 7, 2023
a4ed8ef
Insert benchmarking Dataset instances
betochimas Sep 8, 2023
ddb8c35
Merge branch 'branch-23.10' into branch-23.10-large-datasets
betochimas Sep 8, 2023
87eea53
Merge branch 'rapidsai:branch-23.10' into resultset-tests
betochimas Sep 11, 2023
95bf161
Refactored getters and setters for download_dir
betochimas Sep 12, 2023
be5ff11
Update generate_resultsets.py to align with DefaultDownloadDir API ch…
betochimas Sep 13, 2023
795ab55
Merge branch 'branch-23.10' into resultset-tests
betochimas Sep 13, 2023
31ca5ea
Changes to testing and fixing urls
betochimas Sep 19, 2023
8e0cfac
Testing changes to pass CI, added and removed comments
betochimas Sep 20, 2023
d15f601
Merge branch 'branch-23.10' into branch-23.10-large-datasets
betochimas Sep 20, 2023
045597a
Merge branch 'branch-23.10' into branch-23.10-large-datasets
betochimas Sep 22, 2023
6546034
Merge branch 'branch-23.10' into resultset-tests
betochimas Sep 22, 2023
a85cb7e
Merge branch 'branch-23.10' into branch-23.10-large-datasets
betochimas Sep 22, 2023
f816f78
Add FIXMEs for CI failure points
betochimas Sep 22, 2023
12c3d66
Merge branch 'branch-23.10-large-datasets' of https://github.com/beto…
nv-rliu Oct 9, 2023
7c0db5d
Merge branch 'rapidsai:branch-23.12' into branch-23.12-large-datasets
nv-rliu Oct 16, 2023
abda461
Merge branch 'rapidsai:branch-23.12' into branch-23.12-large-datasets
nv-rliu Oct 17, 2023
3fb602b
update large dataset work. primarily unit tests
nv-rliu Oct 17, 2023
858e155
Merge branch 'branch-23.12-large-datasets' of github.com:nv-rliu/cugr…
nv-rliu Oct 17, 2023
875011d
remove twitter
nv-rliu Oct 17, 2023
424bc51
fix bug in test fixture for unweighted graphs
nv-rliu Oct 18, 2023
56fad47
Merge branch 'branch-23.12' into resultset-tests
nv-rliu Oct 18, 2023
d8fe7b6
Merge branch 'branch-23.12-rs-ds-refactor' into resultset-tests
nv-rliu Oct 18, 2023
0d555b1
Merge pull request #14 from betochimas/resultset-tests
nv-rliu Oct 18, 2023
13ed8d6
Merge branch 'rapidsai:branch-23.12' into branch-23.12-rs-ds-refactor
nv-rliu Oct 20, 2023
90d9dcf
Merge branch 'rapidsai:branch-23.12' into branch-23.12-rs-ds-refactor
nv-rliu Oct 26, 2023
89c0117
Merge branch 'branch-23.12' into branch-23.12-rs-ds-refactor
nv-rliu Nov 1, 2023
65300ef
Merge remote-tracking branch 'upstream/branch-23.12' into branch-23.1…
rlratzel Nov 15, 2023
7086af7
Removes experimental datasets module after having been promoted for >…
rlratzel Nov 16, 2023
684521b
Replaces old get_download_dir() from before the reactor with the new …
rlratzel Nov 16, 2023
f7d4e02
Black formatting
rlratzel Nov 16, 2023
c609d57
Adds workaround to PR workflow to force an upgrade of pip.
rlratzel Nov 16, 2023
ccead3e
Removes workaround to upgrade pip in devcontainers CI job since this …
rlratzel Nov 16, 2023
8a5faa9
update reader
nv-rliu Nov 16, 2023
f249470
update unit tests for dataset reader:
nv-rliu Nov 16, 2023
28198ac
Merge branch 'branch-23.12' into branch-23.12-rs-ds-refactor
nv-rliu Nov 17, 2023
476f70c
Merge branch 'branch-23.12' into branch-23.12-rs-ds-refactor
nv-rliu Nov 17, 2023
16ba391
Merge branch 'branch-23.12' into branch-23.12-rs-ds-refactor
BradReesWork Nov 18, 2023
aea91f1
Removes remaining references to experimental.datasets.
rlratzel Nov 18, 2023
7067cb0
Merge branch 'branch-23.12-rs-ds-refactor' of https://github.com/nv-r…
rlratzel Nov 18, 2023
82d69e8
Replaces DATASETS_UNDIRECTED list with list of specific dataset objs.
rlratzel Nov 19, 2023
ed3b481
Removes imports from experimental.datasets
rlratzel Nov 19, 2023
062062b
Fixes typo in import
rlratzel Nov 19, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions datasets/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,9 +120,13 @@ The benchmark datasets are described below:
| soc-twitter-2010 | 21,297,772 | 265,025,809 | No | No |

**cit-Patents** : A citation graph that includes all citations made by patents granted between 1975 and 1999, totaling 16,522,438 citations.

**soc-LiveJournal** : A graph of the LiveJournal social network.

**europe_osm** : A graph of OpenStreetMap data for Europe.

**hollywood** : A graph of movie actors where vertices are actors, and two actors are joined by an edge whenever they appeared in a movie together.

**soc-twitter-2010** : A network of follower relationships from a snapshot of Twitter in 2010, where an edge from i to j indicates that j is a follower of i.

_NOTE: the benchmark datasets were converted to a CSV format from their original format described in the reference URL below, and in doing so had edge weights and isolated vertices discarded._
Expand Down
10 changes: 10 additions & 0 deletions python/cugraph/cugraph/datasets/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,3 +39,13 @@
small_tree = Dataset(meta_path / "small_tree.yaml")
toy_graph = Dataset(meta_path / "toy_graph.yaml")
toy_graph_undirected = Dataset(meta_path / "toy_graph_undirected.yaml")

# Benchmarking datasets: be mindful of memory usage
# 250 MB
soc_livejournal = Dataset(meta_path / "soc-livejournal1.yaml")
# 965 MB
cit_patents = Dataset(meta_path / "cit-patents.yaml")
# 1.8 GB
europe_osm = Dataset(meta_path / "europe_osm.yaml")
# 1.5 GB
hollywood = Dataset(meta_path / "hollywood.yaml")
69 changes: 37 additions & 32 deletions python/cugraph/cugraph/datasets/dataset.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,38 +20,36 @@

class DefaultDownloadDir:
"""
Maintains the path to the download directory used by Dataset instances.
Maintains a path to be used as a default download directory.

All DefaultDownloadDir instances are based on RAPIDS_DATASET_ROOT_DIR if
set, or _default_base_dir if not set.

Instances of this class are typically shared by several Dataset instances
in order to allow for the download directory to be defined and updated by
a single object.
"""
_default_base_dir = Path.home() / ".cugraph/datasets"

def __init__(self):
self._path = Path(
os.environ.get("RAPIDS_DATASET_ROOT_DIR", Path.home() / ".cugraph/datasets")
)
def __init__(self, *, subdir=""):
"""
subdir can be specified to provide a specialized dir under the base dir.
"""
self._subdir = Path(subdir)
self.reset()

@property
def path(self):
"""
If `path` is not set, set it to the environment variable
RAPIDS_DATASET_ROOT_DIR. If the variable is not set, default to the
user's home directory.
"""
if self._path is None:
self._path = Path(
os.environ.get(
"RAPIDS_DATASET_ROOT_DIR", Path.home() / ".cugraph/datasets"
)
)
return self._path
return self._path.absolute()

@path.setter
def path(self, new):
self._path = Path(new)

def clear(self):
self._path = None
def reset(self):
self._basedir = Path(os.environ.get("RAPIDS_DATASET_ROOT_DIR",
self._default_base_dir))
self._path = self._basedir / self._subdir


default_download_dir = DefaultDownloadDir()
Expand All @@ -69,7 +67,6 @@ class Dataset:
information on the name, type, url link, data loading format, graph
properties
"""

def __init__(
self,
metadata_yaml_file=None,
Expand Down Expand Up @@ -159,15 +156,20 @@ def unload(self):
"""
self._edgelist = None

def get_edgelist(self, download=False):
def get_edgelist(self, download=False, reader=cudf.read_csv):
"""
Return an Edgelist
Return a DataFrame that represents a graph edgelist.

Parameters
----------
download : Boolean (default=False)
Automatically download the dataset from the 'url' location within
the YAML file.

reader : callable (default=cudf.read_csv)
A callable to use to read the dataset. The callable must be
compatible with the pandas/cudf read_csv() function and return a
compatible DataFrame.
"""
if self._edgelist is None:
full_path = self.get_path()
Expand All @@ -183,7 +185,7 @@ def get_edgelist(self, download=False):
header = None
if isinstance(self.metadata["header"], int):
header = self.metadata["header"]
self._edgelist = cudf.read_csv(
self._edgelist = reader(
full_path,
delimiter=self.metadata["delim"],
names=self.metadata["col_names"],
Expand Down Expand Up @@ -219,6 +221,10 @@ def get_graph(
dataset -if present- will be applied to the Graph. If the
dataset does not contain weights, the Graph returned will
be unweighted regardless of ignore_weights.

store_transposed: Boolean (default=False)
If True, stores the transpose of the adjacency matrix. Required
for certain algorithms, such as pagerank.
"""
if self._edgelist is None:
self.get_edgelist(download)
Expand All @@ -237,20 +243,19 @@ def get_graph(
"(or subclass) type or instance, got: "
f"{type(create_using)}"
)

if len(self.metadata["col_names"]) > 2 and not (ignore_weights):
G.from_cudf_edgelist(
self._edgelist,
source="src",
destination="dst",
edge_attr="wgt",
source=self.metadata["col_names"][0],
destination=self.metadata["col_names"][1],
edge_attr=self.metadata["col_names"][2],
store_transposed=store_transposed,
)
else:
G.from_cudf_edgelist(
self._edgelist,
source="src",
destination="dst",
source=self.metadata["col_names"][0],
destination=self.metadata["col_names"][1],
store_transposed=store_transposed,
)
return G
Expand Down Expand Up @@ -331,18 +336,18 @@ def download_all(force=False):

def set_download_dir(path):
"""
Set the download location fors datasets
Set the download location for datasets

Parameters
----------
path : String
Location used to store datafiles
"""
if path is None:
default_download_dir.clear()
default_download_dir.reset()
else:
default_download_dir.path = path


def get_download_dir():
return default_download_dir.path.absolute()
return default_download_dir.path
22 changes: 22 additions & 0 deletions python/cugraph/cugraph/datasets/metadata/cit-patents.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: cit-Patents
file_type: .csv
description: A citation graph that includes all citations made by patents granted between 1975 and 1999, totaling 16,522,438 citations.
author: NBER
refs:
J. Leskovec, J. Kleinberg and C. Faloutsos. Graphs over Time Densification Laws, Shrinking Diameters and Possible Explanations.
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 2005.
delim: " "
header: None
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: true
is_directed: true
is_multigraph: false
is_symmetric: false
number_of_edges: 16518948
number_of_nodes: 3774768
url: https://data.rapids.ai/cugraph/datasets/cit-Patents.csv
21 changes: 21 additions & 0 deletions python/cugraph/cugraph/datasets/metadata/europe_osm.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
name: europe_osm
file_type: .csv
description: A graph of OpenStreetMap data for Europe.
author: M. Kobitzsh / Geofabrik GmbH
refs:
Rossi, Ryan. Ahmed, Nesreen. The Network Data Respoistory with Interactive Graph Analytics and Visualization.
delim: " "
header: None
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: false
is_directed: false
is_multigraph: false
is_symmetric: true
number_of_edges: 54054660
number_of_nodes: 50912018
url: https://data.rapids.ai/cugraph/datasets/europe_osm.csv
26 changes: 26 additions & 0 deletions python/cugraph/cugraph/datasets/metadata/hollywood.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
name: hollywood
file_type: .csv
description:
A graph of movie actors where vertices are actors, and two actors are
joined by an edge whenever they appeared in a movie together.
author: Laboratory for Web Algorithmics (LAW)
refs:
The WebGraph Framework I Compression Techniques, Paolo Boldi
and Sebastiano Vigna, Proc. of the Thirteenth International
World Wide Web Conference (WWW 2004), 2004, Manhattan, USA,
pp. 595--601, ACM Press.
delim: " "
header: None
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: false
is_directed: false
is_multigraph: false
is_symmetric: true
number_of_edges: 57515616
number_of_nodes: 1139905
url: https://data.rapids.ai/cugraph/datasets/hollywood.csv
22 changes: 22 additions & 0 deletions python/cugraph/cugraph/datasets/metadata/soc-livejournal1.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: soc-LiveJournal1
file_type: .csv
description: A graph of the LiveJournal social network.
author: L. Backstrom, D. Huttenlocher, J. Kleinberg, X. Lan
refs:
L. Backstrom, D. Huttenlocher, J. Kleinberg, X. Lan. Group Formation in
Large Social Networks Membership, Growth, and Evolution. KDD, 2006.
delim: " "
header: None
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: true
is_directed: true
is_multigraph: false
is_symmetric: false
number_of_edges: 68993773
number_of_nodes: 4847571
url: https://data.rapids.ai/cugraph/datasets/soc-LiveJournal1.csv
22 changes: 22 additions & 0 deletions python/cugraph/cugraph/datasets/metadata/soc-twitter-2010.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
name: soc-twitter-2010
file_type: .csv
description: A network of follower relationships from a snapshot of Twitter in 2010, where an edge from i to j indicates that j is a follower of i.
author: H. Kwak, C. Lee, H. Park, S. Moon
refs:
J. Yang, J. Leskovec. Temporal Variation in Online Media. ACM Intl.
Conf. on Web Search and Data Mining (WSDM '11), 2011.
delim: " "
header: None
col_names:
- src
- dst
col_types:
- int32
- int32
has_loop: false
is_directed: false
is_multigraph: false
is_symmetric: false
number_of_edges: 530051354
number_of_nodes: 21297772
url: https://data.rapids.ai/cugraph/datasets/soc-twitter-2010.csv
79 changes: 0 additions & 79 deletions python/cugraph/cugraph/experimental/datasets/__init__.py

This file was deleted.

Loading
Loading