CI: validate: don't fail-fast #10291

ulysses4ever · 2024-08-29T12:55:07Z

See discussion at #10263

@geekosaur do we want to backport to 3.12?

Template B: This PR does not modify behaviour or interface

E.g. the PR only touches documentation or tests, does refactorings, etc.

Include the following checklist in your PR:

Patches conform to the coding conventions.
Is this a PR that fixes CI? If so, it will need to be backported to older cabal release branches (ask maintainers for directions).

Mikolaj

Let's turn it on and see how it goes.

geekosaur · 2024-08-30T00:00:35Z

I do have a bit of a worry about this: while odd GitHub Actions failures are common, so are (especially, early in development) actual problems where, if one job fails, the others will as well (possibly after another 45 minutes, as with the Mac jobs).

mpickering · 2024-08-30T10:00:35Z

I think this PR implements

If one CI job fails (for example a windows job), then all other jobs are cancelled, which can hide further failures on other platforms or versions.

rather than

Tests are run on CI sequentially, some tests are not run if earlier tests fail (for example, unit tests run before the package tests). In a bad situation you may do multiple rounds if a unit test fails and then a package test for ./Setup and then a package test for cabal-install.

Perhaps the commit can be updated with the description about what the commit is intended to achieve?

ulysses4ever · 2024-08-30T13:16:25Z

@mpickering

Perhaps the commit can be updated with the description about what the commit is intended to achieve?

Indeed, thank you for catching it. I updated the commit message to say:

CI: validate: the matrix won't fail-fast

Which means that if a Windows job fails, all other jobs in the matrix
will be allowed to finish (other platforms, as well as other compilers on Windows, etc.)

Inspired by the discussion at #10263

Now, the sequential failures need another approach. Quick googling revealed that we'll have to add several

if: success() || failure()

to every step that we want to run irrespective of the success of the previous steps.

@geekosaur I don't understand your concern, could you, perhaps, elaborate? You can always cancel the job manually if you feel like it. Or push an update, and GitHub restarts the whole thing for you. I fail to see how that 45-minutes MacOS job hurts anything. It may be pointless, I agree, but if it doesn't hurt, we have a net gain, because of the other type of failures (spurious Windows failures, as you say).

mpickering · 2024-08-30T13:20:25Z

I think the point is that there will be a lot of redundant work performed on each MR if there is a failure on one of the jobs. I don't know if this will cause any issues or not, I don't know how much compute time each project is allowed.

Overall the CI workload for cabal seems quite high to me (compared to GHC). On GHC there are 5 validation jobs (5 different platforms) which run on each MR, then there is a more stringent pipeline which runs on each batch of MRs before they are merged together. I imagine this might be quite difficult to engineer with github.

@ulysses4ever Perhaps you can combine all the different steps together and just have one "test" step, which works by one call to the validation script?

ulysses4ever · 2024-08-30T13:38:18Z

Perhaps you can combine all the different steps together and just have one "test" step, which works by one call to the validation script?

I don't think it's a good idea. There's a reason they were split. Maybe several ones. For instance, some steps have to be blocked on specific platforms / compiler versions occasionally: this happened several times during my involvement with this (2-3 years) with MacOS and Windows.

Another reason is pure UI. I think it's much easier to see where the problem lies when you have some level of granularity in your test process. We have a bunch of ugly bash to print headers and stuff but I like GitHUb's UI for "steps" much more. Granted, this is subjective.

I don't know how much compute time each project is allowed.

AFAIK we use GitHub cloud and nothing else. I don't know why we should care about their resources. But if you think that this change is not helpful, we can close this PR — after all, you started this discussion, so if this PR doesn't implement what you intended, perhaps, it doesn't achieve anything. The reasons I did it are:

@andreasabel suggested this particular change,
it was very easy to implement, and
seemed like a net gain to me because Windows jobs fail nondeterministically quite often. In that case, other jobs will get canceled for no good reason.

geekosaur · 2024-08-30T22:44:58Z

I don't know why we should care about their resources.

If nothing else, because it slows down any other jobs we (and anyone else) have running: GitHub doesn't give unlimited resources to free users.

ulysses4ever · 2024-08-31T01:53:51Z

@geekosaur can I get a reference to GitHub documentation explaining the alleged showdown?

geekosaur · 2024-08-31T02:00:52Z

The basics are here. The rest should be obvious: GitHub does not have an infinite number of machines, therefore the more concurrent actions that are running, the more load is on those machines and the longer it takes for actions to finish.

fgaz · 2024-08-31T16:45:45Z

@geekosaur we were granted a higher limit than that, I think it's enough to sustain this change

Which means that if a Windows job fails, all other jobs in the matrix will be allowed to finish (other platforms, as well as other compilers on Windows, etc.) Inspired by the discussion at #10263

ulysses4ever · 2024-09-01T23:56:10Z

Oh, I didn't notice that a merge label was applied earlier... I didn't mean to merge it before discussions are fully resolved. Sorry everyone!

geekosaur · 2024-09-02T00:11:26Z

I think we determined that Mergify doesn't consider discussion to be changes to the PR, likely because GitHub considers PR comments to be distinct from the PR itself. We need to be careful about that.

fgaz · 2024-09-03T07:49:40Z

Negative reviews will block instead, so don't hesitate to write one if necessary.

geekosaur · 2024-09-14T00:27:31Z

@mergify backport 3.12

mergify · 2024-09-14T00:27:38Z

backport 3.12

✅ Backports have been created

#10348 CI: validate: don't fail-fast (backport #10291) has been created for branch 3.12

CI: validate: don't fail-fast (backport #10291)

ulysses4ever added continuous-integration re: devx Improving the cabal developer experience (internal issue) labels Aug 29, 2024

ulysses4ever mentioned this pull request Aug 29, 2024

Reduce amount of CI round-tripping required by contributors #10263

Open

ulysses4ever added the attention: needs-review label Aug 29, 2024

fgaz approved these changes Aug 29, 2024

View reviewed changes

Mikolaj approved these changes Aug 29, 2024

View reviewed changes

Mikolaj added the merge me Tell Mergify Bot to merge label Aug 29, 2024

mergify bot added the ready and waiting Mergify is waiting out the cooldown period label Aug 29, 2024

ulysses4ever removed the attention: needs-review label Aug 29, 2024

geekosaur added the attention: needs-backport 3.12 label Aug 29, 2024

ulysses4ever force-pushed the ulysses4ever-patch-1 branch from fe11422 to 44950ea Compare August 30, 2024 13:06

andreasabel approved these changes Aug 30, 2024

View reviewed changes

mergify bot added the merge delay passed Applied (usually by Mergify) when PR approved and received no updates for 2 days label Sep 1, 2024

CI: validate: the matrix won't fail-fast

f0e0985

Which means that if a Windows job fails, all other jobs in the matrix will be allowed to finish (other platforms, as well as other compilers on Windows, etc.) Inspired by the discussion at #10263

Mikolaj force-pushed the ulysses4ever-patch-1 branch from 44950ea to f0e0985 Compare September 1, 2024 13:08

mergify bot merged commit dceba0f into master Sep 1, 2024
51 checks passed

mergify bot deleted the ulysses4ever-patch-1 branch September 1, 2024 15:13

geekosaur removed the attention: needs-backport 3.12 label Sep 3, 2024

geekosaur added the attention: needs-backport 3.12 label Sep 13, 2024

geekosaur removed the attention: needs-backport 3.12 label Sep 14, 2024

mergify bot mentioned this pull request Sep 14, 2024

CI: validate: don't fail-fast (backport #10291) #10348

Merged

2 tasks

mergify bot added a commit that referenced this pull request Sep 14, 2024

Merge pull request #10348 from haskell/mergify/bp/3.12/pr-10291

cf611d2

CI: validate: don't fail-fast (backport #10291)

geekosaur mentioned this pull request Sep 16, 2024

keep running tests even if earlier ones failed #10361

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: validate: don't fail-fast #10291

CI: validate: don't fail-fast #10291

ulysses4ever commented Aug 29, 2024

Mikolaj left a comment

geekosaur commented Aug 30, 2024

mpickering commented Aug 30, 2024

ulysses4ever commented Aug 30, 2024

mpickering commented Aug 30, 2024

ulysses4ever commented Aug 30, 2024

geekosaur commented Aug 30, 2024

ulysses4ever commented Aug 31, 2024

geekosaur commented Aug 31, 2024

fgaz commented Aug 31, 2024

ulysses4ever commented Sep 1, 2024

geekosaur commented Sep 2, 2024

fgaz commented Sep 3, 2024

geekosaur commented Sep 14, 2024

mergify bot commented Sep 14, 2024 •

edited

Loading

CI: validate: don't fail-fast #10291

CI: validate: don't fail-fast #10291

Conversation

ulysses4ever commented Aug 29, 2024

Mikolaj left a comment

Choose a reason for hiding this comment

geekosaur commented Aug 30, 2024

mpickering commented Aug 30, 2024

ulysses4ever commented Aug 30, 2024

mpickering commented Aug 30, 2024

ulysses4ever commented Aug 30, 2024

geekosaur commented Aug 30, 2024

ulysses4ever commented Aug 31, 2024

geekosaur commented Aug 31, 2024

fgaz commented Aug 31, 2024

ulysses4ever commented Sep 1, 2024

geekosaur commented Sep 2, 2024

fgaz commented Sep 3, 2024

geekosaur commented Sep 14, 2024

mergify bot commented Sep 14, 2024 • edited Loading

✅ Backports have been created

mergify bot commented Sep 14, 2024 •

edited

Loading