A GitHub Action to perform automated performance regression testing on pull requests #6065

Anirban166 · 2024-04-09T18:15:05Z

In an effort to potentially address #4687, I created a GitHub Action to facilitate performance testing of the incoming changes that are introduced via pull requests to data.table. My motivation in doing so was to help ensure that data.table maintains its code efficiency or high-performance standards (core values of the project!) as PRs keep coming and getting integrated frequently (meaning they need to be monitored for performance regressions, and an automatic way to do that would be ideal, noh?).

Onto more details:

My action utilizes atime to run predefined tests (as specified in inst/atime/tests.R) on different versions of data.table. These 'tests' are based on historical regressions that have been documented/discussed via issues and PRs here.
It uses cml to publish information in a comment on the pull request thread via the GitHub bot (which operates using the GITHUB_TOKEN I provide to authenticate itself and interact with the GitHub API within the scope of this step that I defined in my GHA workflow).
Contents of the comment include a plot consisting of subplots for each test case depicting the time and memory trends across different data.table versions, the commit (SHA) that generated it, the link to download the artifact containing all of the atime-generated results for that run, the time elapsed from the job's start to the completion of the installation steps, and the time it took to run all the tests.
To avoid cluttering, only one comment will exist in the PR thread at all times (after the initial comment is generated from the first commit on the PR branch, subsequent pushes will simply update that existing comment's body).
The different versions of data.table (as defined within atime) include:
- HEAD (PR source)
- base (PR target)
- merge-base (common ancestor between base and HEAD)
- CRAN (latest version on the platform)
- Before (a commit predating the onset of the performance regression)
- Regression (a commit that is affected by the regression)
- Fixed (a commit where the regression has been fixed; this should be the same or better as 'Before' in terms of performance)

To demonstrate, I've created two examples on my fork of data.table that can be compared against each other:

Both of them replicate PRs that are associated with historical data.table regressions (introducing/fixing one) and showcase the expected results in their respective plots:

For the test case titled 'Test regression fixed in #5463' one can see that base, merge-base, CRAN, Fixed, and Before are fast while Regression and HEAD are slow for the first PR that replicates the regression introducing PR, and in the second PR where it should not be affected, all commits are fast with the expected exception of Regression.
For 'Test regression fixed in #4440', the first PR illustrates the reverse case with all commits being fast except the one associated with Regression (as expected since that PR should not affect that test case), and in the second PR where it replicates the regression fixing PR, we see that Fixed, HEAD, CRAN, and Before are fast while Regression, merge-base, and base are slow as we would expect.

Note that the first PR's GHA workflow uses my action from the Marketplace while the second one is run via a self-contained version of my workflow, or with the entire logic in the file under .github/workflows/ within the target branch. (I made them different on purpose to illustrate the two ways!)

So a question that I have here is: Which among these two approaches would be preferable if I make a PR to integrate my workflow?

The first approach allows for more direct customization while the second requires comparatively less maintenance (as I update things for my action, only the version numbers need to be changed here).

Another question I have is regarding the duration.

The total time taken to run the CI job can be roughly seen as an aggregation of two fairly time-consuming tasks:

The time it takes to set up everything that is required. A core chunk of the time here is taken in installing atime and its dependencies. Roughly 12 minutes on average.
The time it takes to run the tests, which would vary based on the number and complexity of them. The two tests that were used in the examples above take roughly 3 minutes on average.

Is this acceptable in terms of how long it takes to run the workflow? (and I presume we can skip running the action always; for e.g., changes that only deal with documentation)

Finally, I want to mention that I made this action specifically for data.table (again, in hopes that this helps to detect regressions and maintain the quality of code contributions performance-wise!), but it also works on any other GitHub repository for an R package. For example, last month I made a PR between branches on my fork of binsegRcpp that introduces a performance regression (intentionally placed Sys.sleep()) in one of the R files, and it was depicted in the commented plot. (Note that it was run using an older version of my action's code)

Ready for feedback!

The text was updated successfully, but these errors were encountered:

tdhock · 2024-04-09T18:36:30Z

This is a great contribution, thanks for the detailed explanation. For the time that it takes, I believe ~15 minutes per run should be acceptable (for reference, other data.table GHA runs take 5-10 minutes). If it ends up being too much time, I think we can reduce the time it takes later by creating a docker container with all of the required packages (which should substantially reduce the installation time). I wonder if that sounds reasonable to @jangorecki who is most familiar with our CI setup?

MichaelChirico · 2024-04-09T18:52:40Z

15 mins is fine, it needn't be a binding check. that way for simple PRs they can be merged quickly without waiting for the slower check to finish.

If anything, I wonder if it should take longer to really draw out the types of performance regressions we're worried about. E.g. the duckdb benchmarks take at least that long to run a few queries.

jangorecki · 2024-04-09T19:57:11Z

yes, agree, 15 min is fine

tdhock · 2024-04-09T20:09:37Z

If anything, I wonder if it should take longer to really draw out the types of performance regressions we're worried about. E.g. the duckdb benchmarks take at least that long to run a few queries.

Agree that really large data/timings can be used to see some of the performance regressions we're worried about, but actually the really remarkable thing about atime is that it allows us to detect a large number of relevant performance regressions, by looking at a range of small N. The trick is that atime keeps increasing N until we get to a time limit, for example the default is 0.01 seconds, and actually that is sufficient to see differences in time/memory in historical regressions. And it keeps the overall check time very small (0.01 seconds per timing * 10 timings per version * 7 versions = 0.7 seconds per performance test).

If there are any performance regressions that require more time to test/see, we can increase the time limit on a per-test basis (say to from 0.01 to 0.1 or 1 seconds for some tests). And if we need even more time than that, we can try to think of other ways of testing (but probably wouldn't be feasible on github actions).

tdhock · 2024-04-09T20:20:16Z

Which among these two approaches would be preferable if I make a PR to integrate my workflow?

I would suggest the marketplace version.

tdhock · 2024-04-09T20:45:51Z

and when you make the PR, please make it from Rdatatable repo (not your fork), you should have permission to create new branch.

Anirban166 · 2024-04-09T20:53:35Z

and when you make the PR, please make it from Rdatatable repo (not your fork), you should have permission to create new branch.

Yup, I will create a new one by branching off from master here (I already did that once before)

Now that you mentioned this, I'd like to add here that since this workflow is going to be on the master branch, it will only get triggered for pull requests to the master branch from any other branch.

MichaelChirico · 2024-04-10T04:49:17Z

Got to read everything fully now, thanks so much for the extremely well-documented issue+examples!

I think we are well into MVP stage for this and look forward to the PR! We can continue to improve more as we go along.

I can't say I understand well the tradeoffs of Marketplace vs. local version, so I defer to @tdhock.

@tdhock are those confidence bands on the {atime} output graphs? One thing I have in mind is whether we'll be able to easily discern statistical noise from these plots.

tdhock · 2024-04-10T15:45:26Z

yes they are confidence bands, in fact the y axis says so "median line, quartiles band" and that is over 10 timings.
Most of the time the confidence bands are pretty tight around the median, so anything two lines that are not completely overlapping is probably a significant difference.
Also we do a t-test and report the significance/p-value in the panel title (smaller p-values / more likely to be a regression on the left).

tdhock · 2024-04-10T15:47:04Z

@Anirban166 can you explain the benefits/drawbacks of the marketplace/local approaches?

tdhock · 2024-04-10T15:53:28Z

also thanks to @DorisAmoakohene who has created a database of historical regressions that we can use to seed the initial performance tests https://github.com/DorisAmoakohene/PerformanceTest_data.table

Anirban166 · 2024-04-11T05:30:23Z

Got to read everything fully now, thanks so much for the extremely well-documented issue+examples!

Thanks Michael! Happy to hear that you found it well-documented.

I think we are well into MVP stage for this and look forward to the PR!

Perfect! I'll be making the PR soon.

We can continue to improve more as we go along.

Agreed, and updates to the action aside, we can keep adding to the test cases via separate PRs.

Anirban166 · 2024-04-11T05:45:17Z

@Anirban166 can you explain the benefits/drawbacks of the marketplace/local approaches?

Using a version of my action from the Marketplace means that we don't have to edit the workflow here to integrate any changes to the code (may it be bug fixes or features extending this MVP) aside from updating the version number for new releases as and when necessary. Maintenance might be easier thus. Another slight benefit that I can think of is the small size of the workflow code and narrowed-down scope, which may be helpful if we were to integrate this into another workflow or make it a part of it.

On the other hand, using a self-contained/local version might be better if someone wants to customize the logic here directly (becomes independent of me maintaining my repository for my action). Updating the action would be solely restricted to this repository then. (And I would presume anyone who likely modifies it in the future will first run tests in their own fork of data.table instead of testing here directly to avoid CI fails)

Imo the difference between the two approaches is quite subtle and it's more of a preference or choice-based thing at the end instead of having concrete pros and cons. (I for one have no strong preference towards either one and that's why I asked for opinions to make a decision among two options that work but are confusing to select from)

MichaelChirico · 2024-04-11T06:12:24Z

Thanks, that's pretty clear. I agree with Toby that the marketplace is the correct approach, especially given you've already built generality beyond data.table into the design:

Finally, I want to mention that I made this action specifically for data.table (again, in hopes that this helps to detect regressions and maintain the quality of code contributions performance-wise!), but it also works on any other GitHub repository for an R package.

Anirban166 mentioned this issue Apr 11, 2024

Added a GitHub Action to check for performance regressions within the incoming changes from PRs, along with two test cases based on historical regressions #6078

Merged

tdhock closed this as completed in #6078 Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A GitHub Action to perform automated performance regression testing on pull requests #6065

A GitHub Action to perform automated performance regression testing on pull requests #6065

Anirban166 commented Apr 9, 2024

tdhock commented Apr 9, 2024

MichaelChirico commented Apr 9, 2024 •

edited

Loading

jangorecki commented Apr 9, 2024

tdhock commented Apr 9, 2024 •

edited

Loading

tdhock commented Apr 9, 2024

tdhock commented Apr 9, 2024

Anirban166 commented Apr 9, 2024

MichaelChirico commented Apr 10, 2024 •

edited

Loading

tdhock commented Apr 10, 2024

tdhock commented Apr 10, 2024

tdhock commented Apr 10, 2024

Anirban166 commented Apr 11, 2024

Anirban166 commented Apr 11, 2024

MichaelChirico commented Apr 11, 2024

A GitHub Action to perform automated performance regression testing on pull requests #6065

A GitHub Action to perform automated performance regression testing on pull requests #6065

Comments

Anirban166 commented Apr 9, 2024

tdhock commented Apr 9, 2024

MichaelChirico commented Apr 9, 2024 • edited Loading

jangorecki commented Apr 9, 2024

tdhock commented Apr 9, 2024 • edited Loading

tdhock commented Apr 9, 2024

tdhock commented Apr 9, 2024

Anirban166 commented Apr 9, 2024

MichaelChirico commented Apr 10, 2024 • edited Loading

tdhock commented Apr 10, 2024

tdhock commented Apr 10, 2024

tdhock commented Apr 10, 2024

Anirban166 commented Apr 11, 2024

Anirban166 commented Apr 11, 2024

MichaelChirico commented Apr 11, 2024

MichaelChirico commented Apr 9, 2024 •

edited

Loading

tdhock commented Apr 9, 2024 •

edited

Loading

MichaelChirico commented Apr 10, 2024 •

edited

Loading