Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update documentation for Dask cuDF #16671

Merged
merged 10 commits into from
Aug 28, 2024

Conversation

rjzamora
Copy link
Member

Description

General documentation update for Dask cuDF:

  • Adds README.md file to dask_cudf (this is currently a symlink to cudf's README, which isn't terribly helpful)
  • Emphasizes direct usage of the dask.dataframe API (rather than the explicit dask_cudf API)
    • Including the to_backend API
  • Advertises query-planning support
  • Includes a simple Dask CUDA example (and best-practices link)

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added 2 - In Progress Currently a work in progress doc Documentation dask Dask issue non-breaking Non-breaking change cuDF (Python) labels Aug 27, 2024
@rjzamora rjzamora self-assigned this Aug 27, 2024
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@github-actions github-actions bot added the Python Affects Python cuDF API. label Aug 27, 2024
@rjzamora
Copy link
Member Author

cc @quasiben @pentschev @wence- @charlesbluca @madsbk (happy to get feedback from any of you with time and interest)

Copy link
Member

@pentschev pentschev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM, thanks @rjzamora !

I've left some suggestions on a single comment, for whatever reason GH is not letting me comment on each line individually.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason GH is not allowing me to comment on each line individually, so I'm doing it all here.

Line 1: the link to the image is wrong, it should be ../../img/rapids_logo.png, I think that should be allowed but if not you may need to symlink it instead.

Line 73: Unoptimzed -> Unoptimized.

Line 108: # Use memory pool for faster allocations -> # Use 90% of GPU memory as a pool for faster allocations.

Line 116: result.compute() # ... is it intentional that you're calling compute()? That is generally a suboptimal practice and having that in official docs tend to encourage that practice.

Copy link
Member Author

@rjzamora rjzamora Aug 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason GH is not allowing me to comment on each line individually, so I'm doing it all here

Weird!

Line 116: result.compute() # ... is it intentional that you're calling compute()? That is generally a suboptimal practice and having that in official docs tend to encourage that practice.

Good catch! That was supposed to be computing a groupby aggregation, but it seems I copied over the wrong example.

Copy link
Member Author

@rjzamora rjzamora Aug 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I changed the example to a groupby aggregation (which is typically safe to compute), and added this note:

"This example uses compute to materialize a concrete cudf.DataFrame object in local memory. Never call compute on a large collection that cannot fit comfortably in the memory of a single GPU! See Dask's documentation on managing computation for more details."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here, I can't comment on the diff lines. Replacing a symlink with a separate file is probably an edge case for the GitHub UI.

General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.

Line 105: Just creating the client variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.

Good point - I'll address this in a follow up.

Line 105: Just creating the client variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.

Yes, this is the common/suggested pattern in dask (unless you are working with multiple clusters at the same time. Which is extremely rare)

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@rjzamora rjzamora marked this pull request as ready for review August 28, 2024 14:37
@rjzamora rjzamora requested a review from a team as a code owner August 28, 2024 14:37
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Aug 28, 2024
@rjzamora
Copy link
Member Author

This is a huge improvement over the README we currently have in dask_cudf, so I will mark this as ready to merge now.
That said, I'm happy to follow up with further revisions if others find problems or have further suggestions.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 28, 2024
@rjzamora
Copy link
Member Author

/merge

@rjzamora
Copy link
Member Author

Oops, just realized I never got a proper cudf-dask-codeowners approval. @pentschev - Does this seem good to go?

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for writing this!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here, I can't comment on the diff lines. Replacing a symlink with a separate file is probably an edge case for the GitHub UI.

General comment: sometimes this says "The user" and sometimes it says "You." Let's use a consistent tense/audience.

Line 105: Just creating the client variable is enough? You don't have to pass it in or register it somewhere? I am not used to this pattern.

@rapids-bot rapids-bot bot merged commit c600a65 into rapidsai:branch-24.10 Aug 28, 2024
71 checks passed
@bdice
Copy link
Contributor

bdice commented Aug 28, 2024

Oops. I didn't realize my approval would merge this PR. @rjzamora Can you apply my suggestions here? #16671 (comment)

Also we need to make sure this information is all in sync with the dask-cudf docs, which don't mention query planning and other important topics. https://docs.rapids.ai/api/dask-cudf/stable/

@rjzamora rjzamora deleted the update-dask-cudf-docs branch August 28, 2024 16:57
@rjzamora
Copy link
Member Author

Oops. I didn't realize my approval would merge this PR

Yeah, my fault! No worries - I will submit a follow-up today to update the API docs and I'll include your suggestion to use consistent language.

rapids-bot bot pushed a commit that referenced this pull request Sep 16, 2024
Follow up to #16671

- Moves most of the information recently added to the dask-cudf README into the dask-cudf Sphinx documentation
- Adds a "Quick-start" example to the simplified dask-cudf README

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Benjamin Zaitlen (https://github.com/quasiben)
  - Peter Andreas Entschev (https://github.com/pentschev)

URL: #16765
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge dask Dask issue doc Documentation non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

4 participants