GPU CI #138

quasiben · 2021-03-24T14:16:41Z

We've been chatting with folks from the ops teams within RAPIDS about getting access to the gpuCI infrastructure. gpuCI is the GPU based CI platform used for testing throughout the RAPIDS ecosystem. We've been asking for access for a couple reasons:

We currently test GPU portions of Distributed only and the testing occurs in an out-of-bound manner. That is, we test GPU and UCX bits of Distributed in ucx-py and dask-cuda. This is better than no testing, however, it's only limited to distributed and only when developers push changes to dask-cuda/ucx-py
The lack of gpu testing infrastructure for Dask has and can result in breakages. Additionally, because of the lack of GPU CI developers will be unaware something is broken until a user raises and issue . This occurred somewhat recently within Dask: Failing CuPy tests dask#7324 and Add numpy functions tri, triu_indices, triu_indices_from, tril_indices, tril_indices_from dask#6997 . The issues are currently being fixed but we'd like to improve this cycle moving forward.

Gaining access to gpuCI resolves both of these problems and will allow us to test incoming PRs to Dask ensuring GPU support is maintained without breakages and undue burdens.

While we are talking with OPs folks we've suggested that the testing matrix is a single row:

latest OS (ubuntu)
latest cudatoolkit (11.0/11.2)
latest stable CuPy in RAPIDS
latest stable cuDF in RAPIDS
latest NumPy (need NEP-35)

This service will start off as something maintainers can ping if they think a PR might need GPU testing. This might include changes to array/dataframe functions or new functionality. While this is not the ideal solution, it is a step towards getting better GPU test for Dask without much effort on the part of the maintainers

For this to work a bot gputester from gpuCI will need to have at least “triage” rights to monitor comments and respond with pass/fail notifications to the PR in question.

cc @pentschev @jrbourbeau

The text was updated successfully, but these errors were encountered:

jrbourbeau · 2021-03-24T17:28:31Z

This sounds good to me 👍

Do we also want to include this functionality in distributed? Or just dask to start since, as you mentioned, there is some distributed + GPU coverage in dask-cuda/ucx-py

jakirkham · 2021-03-24T17:44:53Z

I think we will want this for both. There are serialization functions, UCX-Py (as you mentioned), etc. in Distributed that would be good to test

jrbourbeau · 2021-03-24T17:45:33Z

Also, thanks for raising this issue and coordinating with the rest of the RAPIDS team. Let me know if there's anything I can do to help out with this effort

GenevieveBuckley · 2021-04-20T06:57:55Z

It's be good if we could put something like this in place for dask-image, too. Ad hoc local testing isn't going very well.

aktech · 2021-05-24T06:47:59Z

Seems like you guys have a solution already, FWIW I'll share my two cents. I created a service for problems like these, which is basically running custom machines (including GPUs) in GitHub Actions: https://cirun.io/

We are about use it in the sgkit project here: https://github.com/pystatgen/sgkit/pull/567/checks?check_run_id=2618833216

It is fairly simple to setup, all you need is a cloud account (AWS or GCP) and a simple yaml file describing what kind of machines you need and Cirun will spin up ephemeral machines on your cloud for GitHub Actions to run. It's native to GitHub ecosystem, which mean you can see logs/trigger in the Github's interface itself, just like any Github Action run.

Notes

Cirun is free for Open source projects. (You only pay to your cloud provider for machine usage)
I can help set it up if you like

charlesbluca · 2021-06-22T16:21:33Z

Bumping this as we had a similar issue happen resulting in rapidsai/dask-cuda#634; do folks have a preference between gpuCI and Cirun?

quasiben · 2021-06-29T17:02:52Z

I think it would be good test experiment with CIRUN. gpuCI is going to require us to get some time from NVIDIA ops folks which we don't have quite yet

charlesbluca · 2021-06-29T18:20:46Z

Great! @aktech would you be willing to help set this up on Distributed? We should probably sync offline to discuss account / cloud provider set up

aktech · 2021-06-29T18:34:13Z

@charlesbluca Sure we can catch up offline for account. Meanwhile I can get it working with a personal AWS account.

aktech · 2021-06-29T20:34:04Z

Here is a run of the Dask Distributed CI (for Python 3.7) on GPU via Cirun.io:
https://github.com/aktech/distributed/runs/2945849888?check_suite_focus=true

Here is the branch

aktech · 2021-06-30T08:17:06Z

Should I go ahead and create a PR for the full matrix on distributed?

quasiben · 2021-06-30T15:02:52Z

I would suggest we keep with one version of python (maybe 3.8) and latest available of everything else (cupy/cuda/etc)

charlesbluca · 2021-07-14T18:36:29Z

After some internal conversation with NVIDIA ops folks, we can confirm that gpuCI does have the capacity to run tests for Dask and Distributed - the tests would be triggered both on commits to the main branch, as well as on PRs opened by a set of approved users (this would probably start out as the members of the Dask org, and could be expanded later on).

Currently, we are working on getting this set up on my own forks of Dask/Distributed; here are PRs adding the relevant gpuCI scripts:

Once testing is working on these forks, we can manage the required permissions/webhooks and merged these branches into upstream Dask/Distributed.

With this option available, we probably don't need to use Cirun for the time being, though it still seems like a good option if we intend to expand GPU testing far beyond the current repos in question.

aktech · 2021-07-14T19:03:48Z

@charlesbluca That's some excellent news! Looking forward to it.

I'll close the Cirun PR, feel free to let me know if I can help with anything or need setting up Cirun for any other project.

jrbourbeau · 2021-07-14T19:57:03Z

Thanks @charlesbluca @aktech for your continued work on this. @charlesbluca FWIW I don't have a strong opinion on gpuCI vs. Cirun. I suspect you and folks around you have the most context/expertise to make that decision

tests would be triggered both on commits to the main branch, as well as on PRs opened by a set of approved users (this would probably start out as the members of the Dask org, and could be expanded later on)

Is it possible to also trigger CI runs on PRs from non-Dask org members through some other mechanism like having test-gpu in the commit message, adding a gpu GitHub label, or something else? This would be useful for situations when a non-Dask org member opens a PR that touches GPU-related code

quasiben · 2021-07-30T16:19:34Z

Yes, we will have a mechanisms to test non-dask org member PRs

We'll see the comment "can one of the admin please verify the patch" and an admin then admin would respond with ok to test

In order to enable gpuCI, please do the following to each repo:

Add @GPUtester to the repo with write permission
Create a webhook:
- Payload URL: https://gpuci.gpuopenanalytics.com/ghprbhook/
- Content type: application/x-www-form-urlencoded
- Enable SSL verification
- Select individual events: Issue comments, Pull requests, & Pushes

I'm planning on making these changes later this afternoon unless there are objections

jrbourbeau · 2021-07-30T16:34:35Z

Thanks for the update @quasiben -- that sounds good to me

quasiben · 2021-07-30T18:42:04Z

Note we are using a docker image with pre-installed dependencies here:
https://github.com/rapidsai/dask-build-environment

GenevieveBuckley · 2021-08-31T02:55:05Z

The gpuCI seems to have been running pretty well on the main dask repository recently - perhaps it's a good time to open a conversation about whether we could do the same on the dask-image repository?

cc @jakirkham & @quasiben

quasiben · 2021-08-31T13:44:24Z

Thanks for the ping @GenevieveBuckley -- we can bring this up with OPs folks. And see what their capacity and try and find out about dask/distributed usage. Can you characterize the activity on dask-image ? My impression is that PRs are in the weekly to monthly range

GenevieveBuckley · 2021-09-01T04:48:08Z

Thanks Ben!

Can you characterize the activity on dask-image ? My impression is that PRs are in the weekly to monthly range

This impression is relatively accurate, it's pretty low traffic. Here's the code frequency graph: https://github.com/dask/dask-image/graphs/code-frequency

charlesbluca · 2021-12-02T19:58:38Z

Bumping this issue to say that we've also been considering setting up gpuCI for dask-ml, as there has been recent work getting cuML integrated there (dask/dask-ml#862); this was discussed earlier in the Dask monthly meeting.

cc @TomAugspurger hoping that we can continue that conversation here?

TomAugspurger · 2021-12-02T20:00:44Z

SGTM. Just let me know if there's anything I need to enable on the project settings side of things.

charlesbluca · 2021-12-02T20:21:41Z

Sure! In general, #138 (comment) breaks down the admin tasks that need to happen on the repo, but that can until we:

Have GPU tests in the repo
Have gpuCI scripts that we can verify are working properly on a fork of the repo

Could you give an idea of the PR frequency on the repo? Trying to gauge if we would prefer having gpuCI run on all PRs, or just those where a trigger phrase is commented (i.e. run tests)

jsignell · 2022-01-12T16:12:29Z

Should we leave this open or close it now that we have GPU CI up and running?

charlesbluca · 2022-01-12T17:16:21Z

I think this should be good to close - at this point. we've generally handled maintenance in separate issues, and requests to add gpuCI to additional repos can be opened in follow up issues here or in the relevant repos.

GenevieveBuckley mentioned this issue Apr 20, 2021

Testing with CuPy 9 dask/dask-image#190

Closed

jrbourbeau mentioned this issue Jun 21, 2021

GPU support for CI dask/distributed#4944

Closed

aktech mentioned this issue Jul 1, 2021

Add support for GPU on CI via Cirun.io dask/distributed#5011

Closed

5 tasks

pentschev mentioned this issue Jul 15, 2021

Unit tests to use a random port for the dashboard dask/distributed#5060

Merged

jakirkham mentioned this issue Jul 26, 2021

Set GPU ThreadPoolExecutor and set known libraries to use it dask/distributed#5084

Open

ayushdg mentioned this issue Aug 24, 2021

Support creating tables from cudf dataframes dask-contrib/dask-sql#220

Merged

nils-braun mentioned this issue Aug 26, 2021

Strategy for running CI-tests against GPU machines dask-contrib/dask-sql#227

Closed

jrbourbeau closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU CI #138

GPU CI #138

quasiben commented Mar 24, 2021 •

edited

Loading

jrbourbeau commented Mar 24, 2021

jakirkham commented Mar 24, 2021

jrbourbeau commented Mar 24, 2021

GenevieveBuckley commented Apr 20, 2021

aktech commented May 24, 2021

charlesbluca commented Jun 22, 2021

quasiben commented Jun 29, 2021

charlesbluca commented Jun 29, 2021

aktech commented Jun 29, 2021

aktech commented Jun 29, 2021

aktech commented Jun 30, 2021

quasiben commented Jun 30, 2021

charlesbluca commented Jul 14, 2021

aktech commented Jul 14, 2021

jrbourbeau commented Jul 14, 2021

quasiben commented Jul 30, 2021 •

edited

Loading

jrbourbeau commented Jul 30, 2021

quasiben commented Jul 30, 2021

GenevieveBuckley commented Aug 31, 2021

quasiben commented Aug 31, 2021

GenevieveBuckley commented Sep 1, 2021

charlesbluca commented Dec 2, 2021

TomAugspurger commented Dec 2, 2021

charlesbluca commented Dec 2, 2021

jsignell commented Jan 12, 2022

charlesbluca commented Jan 12, 2022

GPU CI #138

GPU CI #138

Comments

quasiben commented Mar 24, 2021 • edited Loading

jrbourbeau commented Mar 24, 2021

jakirkham commented Mar 24, 2021

jrbourbeau commented Mar 24, 2021

GenevieveBuckley commented Apr 20, 2021

aktech commented May 24, 2021

charlesbluca commented Jun 22, 2021

quasiben commented Jun 29, 2021

charlesbluca commented Jun 29, 2021

aktech commented Jun 29, 2021

aktech commented Jun 29, 2021

aktech commented Jun 30, 2021

quasiben commented Jun 30, 2021

charlesbluca commented Jul 14, 2021

aktech commented Jul 14, 2021

jrbourbeau commented Jul 14, 2021

quasiben commented Jul 30, 2021 • edited Loading

jrbourbeau commented Jul 30, 2021

quasiben commented Jul 30, 2021

GenevieveBuckley commented Aug 31, 2021

quasiben commented Aug 31, 2021

GenevieveBuckley commented Sep 1, 2021

charlesbluca commented Dec 2, 2021

TomAugspurger commented Dec 2, 2021

charlesbluca commented Dec 2, 2021

jsignell commented Jan 12, 2022

charlesbluca commented Jan 12, 2022

quasiben commented Mar 24, 2021 •

edited

Loading

quasiben commented Jul 30, 2021 •

edited

Loading