[Meta] Reduce CI runtimes #95

vyasr · 2024-08-21T16:29:51Z

CI runtimes are increasingly becoming a bottleneck for development in RAPIDS. There are numerous reasons for this, including (but not limited to):

An increasingly large matrix as we aim to support more platforms, installation mechanisms, etc.
Additional new libraries requiring new test frameworks.
More tests being added

In the past, our primary focus has been in reducing the load on our GPU runners because those are in the shortest supply, which in turn has meant a focus on more carefully pruning the test matrix (since only test jobs require GPU runners). While this has helped alleviate pressure in the short term, it is clear that we need to take more expansive steps to address the problem in a more comprehensive way. Some notes that should guide some thinking:

When thinking through solutions to this problem, we need to consider both the throughput of CI for a single PR and global throughput across all jobs running at all times. Historically we have been reticent to consider any solutions that slow down a single PR (such as reducing the number of parallel jobs running) even if it would reduce global load. I strongly think we need to reconsider this notion. With the current approach, due to global load a given PR often ends up waiting for test runners anyway, so by maximizing parallelism on each PR we have in fact slowed down every PR.
While the focus on test jobs makes sense in a global sense because we are almost never bottlenecked on being able to spin up CPU runners, on a per-PR basis build jobs are also important to accelerate because they have substantial effects on the end-to-end runtime of CI pipelines of each PR.

This meta-issue aims to catalog a number of the different efforts we could undertake going forward. I have organized solutions into a few different classes.

Tooling

These improvements have a cost to implement, but once implemented will have only positive impacts since they do not involve making any compromises in testing coverage or frequency.

Evaluate replacing conda-build with rattler-build #47: Will immediately improve the runtime of all conda build jobs.
- Currently blocked by Switch to using strict channel priority during RAPIDS builds #84 (which is not blocked)
Investigate using uv instead of pip in CI #86: Will immediately improve the runtime of both build and test jobs for wheels.
- Currently blocked by Consider publishing dask/distributed nightlies to our nightly pip index #85 (which is not blocked)
sccache-dist improvements are being worked on to further improve build times.
- In progress?
Use GHA's caching mechanism to save package manager caches between runs #51: Package manager caching would improve all build and test jobs for both conda and pip
- So far we have considered this blocked on the NVKS migration. Perhaps we should consider ways to get this working sooner given the difficulties with the migration.

More judicious selection of jobs

These improvements have a cost to implement and will also have nonzero ongoing maintenance cost to ensure that test coverage remains correct. If implemented correctly, there will be no loss in coverage, but correct implementation will require some care.

Implement judicious skipping of CI jobs #94: Would reduce test loads (we still always need to build everything)

Running more jobs only in nightlies

These are easy to implement, but without careful monitoring of nightly results could have significant costs if issues are only uncovered later.

Only test one Python version in PRs
Only test one architecture (arm or x86) per PR
- Given that we have both arm and x86 runners, we could consider using a round-robin in PRs to get both better coverage and better utilization of available GPU resources
Some other ideas in Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Other

Miscellaneous other improvements that will help without being directly focused on improving build times.

Support dynamic linking between RAPIDS wheels #33: Will reduce total build time across all wheel build jobs in a repo since it will remove duplicate compilation
- In progress
Working with library teams to reduce compile times and build sizes. This is going to be an important ongoing task, and likely something the build team should be dedicating little to no solo effort to.
We recently discovered that pytest's traceback handling for xfailed tests is quite expensive, and switching over to using the native traceback mode with --tb=native therefore shaves off substantial time (10-20%) in total test suite runs since many of our repos have a large number of xfailed tests (Switch to using native traceback cudf#16851). Similarly, in the past we've observed significant improvements by switching the mode by which pytest-xdist distributes to avoid idle workers. There may be other similar optimizations in our pytest usage to be considered.

The text was updated successfully, but these errors were encountered:

vyasr · 2024-09-25T16:40:40Z

I updated the table above to reflect that we may also be able to optimize our usage of pytest for better performance (see e.g. rapidsai/cudf#16851).

betatim · 2025-01-08T08:08:44Z

Some other thoughts to explore:

reduce time between all dependencies being installed and first test running ("build time"?) - it would help reduce CI load and make people more productive while developing. Not sure how easy/feasible this is, probably requires work with each project's team, but it seems like something worth investigating
as a PR reviewer I don't really care if it takes 1h, 2h or 4hours to run the tests for a PR. Anything above maybe 10min means I will park the PR and have to come back to it later. What is annoying though is when you come back and don't have "all the information" you need to then take an action on the PR. I think this means running jobs for a PR in parallel is maybe less important (you mentioned per PR sequential vs parallel jobs above).
if I was to start a new Python only project today I'd probably setup the following minimal CI jobs and not feel terrible about it:
1. oldest supported Python, oldest version of all my dependencies (for PRs)
2. newest supported Python, newest version of all my dependencies (for PRs)
3. a weekly job that uses the nightly version of my dependencies to get early warning about upcoming breakage
4. if the project gained compiled code, run (1) and (2) on the different supported platforms (macOS, linux, windows)
5. expand matrix as needed based on where we end up having problems

vyasr mentioned this issue Aug 21, 2024

Explore ways to maximize coverage while minimizing cost of the CI test matrix #5

Closed

bdice mentioned this issue Aug 21, 2024

Reduce CI jobs. rapidsai/shared-workflows#236

Merged

vyasr closed this as completed Sep 25, 2024

vyasr reopened this Sep 25, 2024

jameslamb mentioned this issue Oct 14, 2024

Remove build isolation in wheel builds #108

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Meta] Reduce CI runtimes #95

[Meta] Reduce CI runtimes #95

vyasr commented Aug 21, 2024 •

edited

Loading

vyasr commented Sep 25, 2024

betatim commented Jan 8, 2025

[Meta] Reduce CI runtimes #95

[Meta] Reduce CI runtimes #95

Comments

vyasr commented Aug 21, 2024 • edited Loading

Tooling

More judicious selection of jobs

Running more jobs only in nightlies

Other

vyasr commented Sep 25, 2024

betatim commented Jan 8, 2025

vyasr commented Aug 21, 2024 •

edited

Loading