Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU CI #138

Closed
quasiben opened this issue Mar 24, 2021 · 26 comments
Closed

GPU CI #138

quasiben opened this issue Mar 24, 2021 · 26 comments

Comments

@quasiben
Copy link
Member

quasiben commented Mar 24, 2021

We've been chatting with folks from the ops teams within RAPIDS about getting access to the gpuCI infrastructure. gpuCI is the GPU based CI platform used for testing throughout the RAPIDS ecosystem. We've been asking for access for a couple reasons:

  1. We currently test GPU portions of Distributed only and the testing occurs in an out-of-bound manner. That is, we test GPU and UCX bits of Distributed in ucx-py and dask-cuda. This is better than no testing, however, it's only limited to distributed and only when developers push changes to dask-cuda/ucx-py
  2. The lack of gpu testing infrastructure for Dask has and can result in breakages. Additionally, because of the lack of GPU CI developers will be unaware something is broken until a user raises and issue . This occurred somewhat recently within Dask: Failing CuPy tests dask#7324 and Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from dask#6997 . The issues are currently being fixed but we'd like to improve this cycle moving forward.

Gaining access to gpuCI resolves both of these problems and will allow us to test incoming PRs to Dask ensuring GPU support is maintained without breakages and undue burdens.

While we are talking with OPs folks we've suggested that the testing matrix is a single row:

  • latest OS (ubuntu)
  • latest cudatoolkit (11.0/11.2)
  • latest stable CuPy in RAPIDS
  • latest stable cuDF in RAPIDS
  • latest NumPy (need NEP-35)

This service will start off as something maintainers can ping if they think a PR might need GPU testing. This might include changes to array/dataframe functions or new functionality. While this is not the ideal solution, it is a step towards getting better GPU test for Dask without much effort on the part of the maintainers

For this to work a bot gputester from gpuCI will need to have at least “triage” rights to monitor comments and respond with pass/fail notifications to the PR in question.

cc @pentschev @jrbourbeau

@jrbourbeau
Copy link
Member

This sounds good to me 👍

Do we also want to include this functionality in distributed? Or just dask to start since, as you mentioned, there is some distributed + GPU coverage in dask-cuda/ucx-py

@jakirkham
Copy link
Member

I think we will want this for both. There are serialization functions, UCX-Py (as you mentioned), etc. in Distributed that would be good to test

@jrbourbeau
Copy link
Member

Also, thanks for raising this issue and coordinating with the rest of the RAPIDS team. Let me know if there's anything I can do to help out with this effort

@GenevieveBuckley
Copy link
Collaborator

It's be good if we could put something like this in place for dask-image, too. Ad hoc local testing isn't going very well.

@aktech
Copy link

aktech commented May 24, 2021

Seems like you guys have a solution already, FWIW I'll share my two cents. I created a service for problems like these, which is basically running custom machines (including GPUs) in GitHub Actions: https://cirun.io/

We are about use it in the sgkit project here: https://github.com/pystatgen/sgkit/pull/567/checks?check_run_id=2618833216

It is fairly simple to setup, all you need is a cloud account (AWS or GCP) and a simple yaml file describing what kind of machines you need and Cirun will spin up ephemeral machines on your cloud for GitHub Actions to run. It's native to GitHub ecosystem, which mean you can see logs/trigger in the Github's interface itself, just like any Github Action run.

Notes

  • Cirun is free for Open source projects. (You only pay to your cloud provider for machine usage)
  • I can help set it up if you like

@charlesbluca
Copy link
Member

Bumping this as we had a similar issue happen resulting in rapidsai/dask-cuda#634; do folks have a preference between gpuCI and Cirun?

@quasiben
Copy link
Member Author

I think it would be good test experiment with CIRUN. gpuCI is going to require us to get some time from NVIDIA ops folks which we don't have quite yet

@charlesbluca
Copy link
Member

Great! @aktech would you be willing to help set this up on Distributed? We should probably sync offline to discuss account / cloud provider set up

@aktech
Copy link

aktech commented Jun 29, 2021

@charlesbluca Sure we can catch up offline for account. Meanwhile I can get it working with a personal AWS account.

@aktech
Copy link

aktech commented Jun 29, 2021

Here is a run of the Dask Distributed CI (for Python 3.7) on GPU via Cirun.io:
https://github.com/aktech/distributed/runs/2945849888?check_suite_focus=true

Here is the branch

@aktech
Copy link

aktech commented Jun 30, 2021

Should I go ahead and create a PR for the full matrix on distributed?

@quasiben
Copy link
Member Author

I would suggest we keep with one version of python (maybe 3.8) and latest available of everything else (cupy/cuda/etc)

@charlesbluca
Copy link
Member

After some internal conversation with NVIDIA ops folks, we can confirm that gpuCI does have the capacity to run tests for Dask and Distributed - the tests would be triggered both on commits to the main branch, as well as on PRs opened by a set of approved users (this would probably start out as the members of the Dask org, and could be expanded later on).

Currently, we are working on getting this set up on my own forks of Dask/Distributed; here are PRs adding the relevant gpuCI scripts:

Once testing is working on these forks, we can manage the required permissions/webhooks and merged these branches into upstream Dask/Distributed.

With this option available, we probably don't need to use Cirun for the time being, though it still seems like a good option if we intend to expand GPU testing far beyond the current repos in question.

@aktech
Copy link

aktech commented Jul 14, 2021

@charlesbluca That's some excellent news! Looking forward to it.

I'll close the Cirun PR, feel free to let me know if I can help with anything or need setting up Cirun for any other project.

@jrbourbeau
Copy link
Member

Thanks @charlesbluca @aktech for your continued work on this. @charlesbluca FWIW I don't have a strong opinion on gpuCI vs. Cirun. I suspect you and folks around you have the most context/expertise to make that decision

tests would be triggered both on commits to the main branch, as well as on PRs opened by a set of approved users (this would probably start out as the members of the Dask org, and could be expanded later on)

Is it possible to also trigger CI runs on PRs from non-Dask org members through some other mechanism like having test-gpu in the commit message, adding a gpu GitHub label, or something else? This would be useful for situations when a non-Dask org member opens a PR that touches GPU-related code

@quasiben
Copy link
Member Author

quasiben commented Jul 30, 2021

Yes, we will have a mechanisms to test non-dask org member PRs

We'll see the comment "can one of the admin please verify the patch" and an admin then admin would respond with ok to test

In order to enable gpuCI, please do the following to each repo:

I'm planning on making these changes later this afternoon unless there are objections

@jrbourbeau
Copy link
Member

Thanks for the update @quasiben -- that sounds good to me

@quasiben
Copy link
Member Author

Note we are using a docker image with pre-installed dependencies here:
https://github.com/rapidsai/dask-build-environment

@GenevieveBuckley
Copy link
Collaborator

The gpuCI seems to have been running pretty well on the main dask repository recently - perhaps it's a good time to open a conversation about whether we could do the same on the dask-image repository?

cc @jakirkham & @quasiben

@quasiben
Copy link
Member Author

Thanks for the ping @GenevieveBuckley -- we can bring this up with OPs folks. And see what their capacity and try and find out about dask/distributed usage. Can you characterize the activity on dask-image ? My impression is that PRs are in the weekly to monthly range

@GenevieveBuckley
Copy link
Collaborator

Thanks Ben!

Can you characterize the activity on dask-image ? My impression is that PRs are in the weekly to monthly range

This impression is relatively accurate, it's pretty low traffic. Here's the code frequency graph: https://github.com/dask/dask-image/graphs/code-frequency

@charlesbluca
Copy link
Member

Bumping this issue to say that we've also been considering setting up gpuCI for dask-ml, as there has been recent work getting cuML integrated there (dask/dask-ml#862); this was discussed earlier in the Dask monthly meeting.

cc @TomAugspurger hoping that we can continue that conversation here?

@TomAugspurger
Copy link
Member

SGTM. Just let me know if there's anything I need to enable on the project settings side of things.

@charlesbluca
Copy link
Member

Sure! In general, #138 (comment) breaks down the admin tasks that need to happen on the repo, but that can until we:

  • Have GPU tests in the repo
  • Have gpuCI scripts that we can verify are working properly on a fork of the repo

Could you give an idea of the PR frequency on the repo? Trying to gauge if we would prefer having gpuCI run on all PRs, or just those where a trigger phrase is commented (i.e. run tests)

@jsignell
Copy link
Member

Should we leave this open or close it now that we have GPU CI up and running?

@charlesbluca
Copy link
Member

I think this should be good to close - at this point. we've generally handled maintenance in separate issues, and requests to add gpuCI to additional repos can be opened in follow up issues here or in the relevant repos.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants