[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

acostadon · 2023-02-01T18:45:09Z

The DataSets class should include some datasets that require MG.

Identify datasets
Add to Datasets

huiyuxie · 2023-11-24T06:02:30Z

Hi! I'm interested in implementing this feature. It has not yet been assigned and I'd like to take on this.

Here is what I plan to do:

Add dask_get_edgelist() function to Dataset API, which returns an Edgelist by invoking dask_cudf.read_csv().
Add dask_get_graph() function to Dataset API, which returns a Graph object by invoking Graph.from_dask_cudf_edgelist().
Test above new functions through adding test_dask_get_edgelist() and test_dask_get_graph() to the test_dataset file.

Then the use case will be:

from cugraph.datasets import karate
E = karate.dask_get_edgelist()
G = karate.dask_get_graph()

I'm not quite sure if the above would satisfy these two requirements.: (1) Identify datasets and (2) Add to Datasets. Should I add more functions? For example, determining if a dataset requires MG handling based on its size, or tagging the dataset metadata (.yaml file) to indicate the need for MG processing, and then proceed with MG?

I will set up the multiple GPU EC2 instance on AWS to implement this feature and probably create a PR within the following two weeks. Any feedback or suggestions would be appreciated! I will check them here in my spare time :)

huiyuxie · 2023-11-26T07:21:26Z

There are more tasks to consider:
4. Add __dask_download_csv() and dask_download_all() to Dataset API to read .csv file via MG.
5. Test above new functions in the test_dataset file.

Hi! I choose to go further with some simple work other than docs. This PR is going to close #3218. Here is what I have done in this PR: 1. Added `get_dask_edgelist()` and `get_dask_graph()` (and another internal helper function `__download_dask_csv()`) to Dataset API. 2. Executed all necessary tests for these new functions. 3. Improved existing functions in the Dataset API and conducted tests to verify improvements. Here are some additional details regarding this PR: 1. The building and testing were conducted using version 23.12 instead of the default 24.02. Since Cugraph-ops library is no longer open, I failed to build from source using version 24.02. I built and tested the code in version 23.12 and then transferred the updated file to 24.02 before creating this PR. (I would appreciate any guidance on how to build from version 24.02 for external contributors). 2. All tests from the test file have passed, but some warnings remain, as shown below ```bash ============================================================ warnings summary ============================================================ cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0] /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/cudf/core/index.py:3284: FutureWarning: cudf.StringIndex is deprecated and will be removed from cudf in a future version. Use cudf.Index with the appropriate dtype instead. warnings.warn( -- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html ``` I think above warnings came from the function call `from_dask_cudf_edgelist` but currently I have no idea how to remove them. I will do my best to address it if anyone has any ideas about it. 3. The `get_edgelist()` function returns a deep copy of the object, but this is not supported for `get_dask_edgelist()` since only shallow copy is allowed for Dask cuDF dataframe (see [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.DataFrame.copy)). This will lead to a problem where if a user modifies the dataframe, the changes will be reflected in the internal `self._edgelist` object. So when `get_dask_graph()` is called later, the resulting graph will differ from the one directly constructed from the data file. 4. I am uncertain about the requirements for (1) Identifying datasets and (2) Adding them to Dataset. If there is a need to add another function for determining whether a dataset requires MG handling based on its size, or to tag the dataset metadata (.yaml file) to indicate the necessity for MG processing, please let me know. Also, I welcome any suggestions for further features. 5. When I ran pytest on other test files, the most common warnings were ```bash /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/dask_cudf/io/csv.py:79: FutureWarning: `chunksize` is deprecated and will be removed in the future. Please use `blocksize` instead. ``` The keyword `chunksize` is no longer in use (check [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.read_csv) here). I have checked all related functions in the repository and found that they currently use `chunksize`. If there is a need to change them to `blocksize`, I will create another PR to address this issue. Any comments and suggestions are welcome! Authors: - Huiyu Xie (https://github.com/huiyuxie) - Rick Ratzel (https://github.com/rlratzel) Approvers: - Rick Ratzel (https://github.com/rlratzel) URL: #4035

acostadon added the feature request New feature or request label Feb 1, 2023

BradReesWork assigned acostadon Feb 2, 2023

BradReesWork added this to the 23.04 milestone Feb 2, 2023

BradReesWork removed this from the 23.04 milestone Mar 1, 2023

kingmesal mentioned this issue Mar 22, 2023

ResultsSet API #3302

Closed

BradReesWork mentioned this issue Mar 23, 2023

Dataset API #3357

Open

BradReesWork unassigned acostadon May 2, 2023

huiyuxie mentioned this issue Dec 2, 2023

[FEA]: Add DASK edgelist and graph support to the Dataset API #4035

Merged

rapids-bot bot closed this as completed in #4035 Jan 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

acostadon commented Feb 1, 2023 •

edited by BradReesWork

Loading

huiyuxie commented Nov 24, 2023 •

edited

Loading

huiyuxie commented Nov 26, 2023 •

edited

Loading

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

Comments

acostadon commented Feb 1, 2023 • edited by BradReesWork Loading

huiyuxie commented Nov 24, 2023 • edited Loading

huiyuxie commented Nov 26, 2023 • edited Loading

acostadon commented Feb 1, 2023 •

edited by BradReesWork

Loading

huiyuxie commented Nov 24, 2023 •

edited

Loading

huiyuxie commented Nov 26, 2023 •

edited

Loading