Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Meta] Reduce CI runtimes #95

Open
vyasr opened this issue Aug 21, 2024 · 2 comments
Open

[Meta] Reduce CI runtimes #95

vyasr opened this issue Aug 21, 2024 · 2 comments

Comments

@vyasr
Copy link
Contributor

vyasr commented Aug 21, 2024

CI runtimes are increasingly becoming a bottleneck for development in RAPIDS. There are numerous reasons for this, including (but not limited to):

  • An increasingly large matrix as we aim to support more platforms, installation mechanisms, etc.
  • Additional new libraries requiring new test frameworks.
  • More tests being added

In the past, our primary focus has been in reducing the load on our GPU runners because those are in the shortest supply, which in turn has meant a focus on more carefully pruning the test matrix (since only test jobs require GPU runners). While this has helped alleviate pressure in the short term, it is clear that we need to take more expansive steps to address the problem in a more comprehensive way. Some notes that should guide some thinking:

  • When thinking through solutions to this problem, we need to consider both the throughput of CI for a single PR and global throughput across all jobs running at all times. Historically we have been reticent to consider any solutions that slow down a single PR (such as reducing the number of parallel jobs running) even if it would reduce global load. I strongly think we need to reconsider this notion. With the current approach, due to global load a given PR often ends up waiting for test runners anyway, so by maximizing parallelism on each PR we have in fact slowed down every PR.
  • While the focus on test jobs makes sense in a global sense because we are almost never bottlenecked on being able to spin up CPU runners, on a per-PR basis build jobs are also important to accelerate because they have substantial effects on the end-to-end runtime of CI pipelines of each PR.

This meta-issue aims to catalog a number of the different efforts we could undertake going forward. I have organized solutions into a few different classes.

Tooling

These improvements have a cost to implement, but once implemented will have only positive impacts since they do not involve making any compromises in testing coverage or frequency.

More judicious selection of jobs

These improvements have a cost to implement and will also have nonzero ongoing maintenance cost to ensure that test coverage remains correct. If implemented correctly, there will be no loss in coverage, but correct implementation will require some care.

Running more jobs only in nightlies

These are easy to implement, but without careful monitoring of nightly results could have significant costs if issues are only uncovered later.

Other

Miscellaneous other improvements that will help without being directly focused on improving build times.

  • Support dynamic linking between RAPIDS wheels #33: Will reduce total build time across all wheel build jobs in a repo since it will remove duplicate compilation
    • In progress
  • Working with library teams to reduce compile times and build sizes. This is going to be an important ongoing task, and likely something the build team should be dedicating little to no solo effort to.
  • We recently discovered that pytest's traceback handling for xfailed tests is quite expensive, and switching over to using the native traceback mode with --tb=native therefore shaves off substantial time (10-20%) in total test suite runs since many of our repos have a large number of xfailed tests (Switch to using native traceback cudf#16851). Similarly, in the past we've observed significant improvements by switching the mode by which pytest-xdist distributes to avoid idle workers. There may be other similar optimizations in our pytest usage to be considered.
@vyasr
Copy link
Contributor Author

vyasr commented Sep 25, 2024

I updated the table above to reflect that we may also be able to optimize our usage of pytest for better performance (see e.g. rapidsai/cudf#16851).

@betatim
Copy link
Member

betatim commented Jan 8, 2025

Some other thoughts to explore:

  • reduce time between all dependencies being installed and first test running ("build time"?) - it would help reduce CI load and make people more productive while developing. Not sure how easy/feasible this is, probably requires work with each project's team, but it seems like something worth investigating
  • as a PR reviewer I don't really care if it takes 1h, 2h or 4hours to run the tests for a PR. Anything above maybe 10min means I will park the PR and have to come back to it later. What is annoying though is when you come back and don't have "all the information" you need to then take an action on the PR. I think this means running jobs for a PR in parallel is maybe less important (you mentioned per PR sequential vs parallel jobs above).
  • if I was to start a new Python only project today I'd probably setup the following minimal CI jobs and not feel terrible about it:
    1. oldest supported Python, oldest version of all my dependencies (for PRs)
    2. newest supported Python, newest version of all my dependencies (for PRs)
    3. a weekly job that uses the nightly version of my dependencies to get early warning about upcoming breakage
    4. if the project gained compiled code, run (1) and (2) on the different supported platforms (macOS, linux, windows)
    5. expand matrix as needed based on where we end up having problems

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants