Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

Closed
Tracked by #3357
acostadon opened this issue Feb 1, 2023 · 2 comments · Fixed by #4035
Closed
Tracked by #3357

[FEA]: Add MG DASK edgelist and graph support to Datasets API #3218

acostadon opened this issue Feb 1, 2023 · 2 comments · Fixed by #4035
Labels
feature request New feature or request

Comments

@acostadon
Copy link
Contributor

acostadon commented Feb 1, 2023

The DataSets class should include some datasets that require MG.

  • Identify datasets
  • Add to Datasets
@acostadon acostadon added the feature request New feature or request label Feb 1, 2023
@BradReesWork BradReesWork added this to the 23.04 milestone Feb 2, 2023
@BradReesWork BradReesWork removed this from the 23.04 milestone Mar 1, 2023
@huiyuxie
Copy link
Contributor

huiyuxie commented Nov 24, 2023

Hi! I'm interested in implementing this feature. It has not yet been assigned and I'd like to take on this.

Here is what I plan to do:

  1. Add dask_get_edgelist() function to Dataset API, which returns an Edgelist by invoking dask_cudf.read_csv().
  2. Add dask_get_graph() function to Dataset API, which returns a Graph object by invoking Graph.from_dask_cudf_edgelist().
  3. Test above new functions through adding test_dask_get_edgelist() and test_dask_get_graph() to the test_dataset file.

Then the use case will be:

from cugraph.datasets import karate
E = karate.dask_get_edgelist()
G = karate.dask_get_graph()

I'm not quite sure if the above would satisfy these two requirements.: (1) Identify datasets and (2) Add to Datasets. Should I add more functions? For example, determining if a dataset requires MG handling based on its size, or tagging the dataset metadata (.yaml file) to indicate the need for MG processing, and then proceed with MG?

I will set up the multiple GPU EC2 instance on AWS to implement this feature and probably create a PR within the following two weeks. Any feedback or suggestions would be appreciated! I will check them here in my spare time :)

@huiyuxie
Copy link
Contributor

huiyuxie commented Nov 26, 2023

There are more tasks to consider:
4. Add __dask_download_csv() and dask_download_all() to Dataset API to read .csv file via MG.
5. Test above new functions in the test_dataset file.

@rapids-bot rapids-bot bot closed this as completed in #4035 Jan 9, 2024
rapids-bot bot pushed a commit that referenced this issue Jan 9, 2024
Hi! I choose to go further with some simple work other than docs. This PR is going to close #3218.

Here is what I have done in this PR:
1. Added `get_dask_edgelist()` and `get_dask_graph()` (and another internal helper function `__download_dask_csv()`) to  Dataset API.
2. Executed all necessary tests for these new functions.
3. Improved existing functions in the Dataset API and conducted tests to verify improvements.

Here are some additional details regarding this PR:
1. The building and testing were conducted using version 23.12 instead of the default 24.02. Since Cugraph-ops library is no longer open, I failed to build from source using version 24.02. I built and tested the code in version 23.12 and then transferred the updated file to 24.02 before creating this PR. (I would appreciate any guidance on how to build from version 24.02 for external contributors).
2.  All tests from the test file have passed, but some warnings remain, as shown below
```bash
============================================================ warnings summary ============================================================
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_get_dask_graph[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
cugraph/tests/utils/test_dataset.py::test_weights_dask[dataset0]
  /home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/cudf/core/index.py:3284: FutureWarning: cudf.StringIndex is deprecated and will be removed from cudf in a future version. Use cudf.Index with the appropriate dtype instead.
    warnings.warn(

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
```
I think above warnings came from the function call `from_dask_cudf_edgelist` but currently I have no idea how to remove them. I will do my best to address it if anyone has any ideas about it.

 3. The `get_edgelist()` function returns a deep copy of the object, but this is not supported for `get_dask_edgelist()` since only shallow copy is allowed for Dask cuDF dataframe (see [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.DataFrame.copy)). This will lead to a problem where if a user modifies the dataframe, the changes will be reflected in the internal `self._edgelist` object. So when `get_dask_graph()` is called later, the resulting graph will differ from the one directly constructed from the data file.
 4. I am uncertain about the requirements for (1) Identifying datasets and (2) Adding them to Dataset. If there is a need to add another function for determining whether a dataset requires MG handling based on its size, or to tag the dataset metadata (.yaml file) to indicate the necessity for MG processing, please let me know. Also, I welcome any suggestions for further features.
 5. When I ran pytest on other test files, the most common warnings were
 ```bash
/home/ubuntu/miniconda3/envs/cugraph_dev/lib/python3.10/site-packages/dask_cudf/io/csv.py:79: FutureWarning: `chunksize` is deprecated and will be removed in the future. Please use `blocksize` instead.
 ```
 The keyword `chunksize` is no longer in use (check [docs](https://docs.rapids.ai/api/dask-cudf/legacy/api/#dask_cudf.read_csv) here). I have checked all related functions in the repository and found that they currently use `chunksize`. If there is a need to change them to `blocksize`, I will create another PR to address this issue.

Any comments and suggestions are welcome!

Authors:
  - Huiyu Xie (https://github.com/huiyuxie)
  - Rick Ratzel (https://github.com/rlratzel)

Approvers:
  - Rick Ratzel (https://github.com/rlratzel)

URL: #4035
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants