Let's talk about the CI situation #1614

aduh95 · 2024-09-08T10:01:41Z

As of recently, especially with nodejs/build#3887, dealing with the CI when preparing a release or when trying to land a PR has become very frustrating. I tried to list all the pain points, let me know if I forgot something:

The CI is so flaky it's systematically takes several attempts to get a passing one.
Most folks do not bother to check the failures.
Even fewer folks are reporting flaky tests (I plead guilty).
Even fewer folks are investigating those flakes (and I get it, the very nature of flaky tests makes them very hard to reproduce, let alone debugging them).
Opening a PR to mark test(s) as flaky often receives objections that the flakiness should be solved instead of ignored (which I totally get, those flakes are most of the time due to a bug in node; but it also makes the situation more frustrating for contributing, and also for other projects that build their own node and expect tests not marked as flaky to not be flaky).

I think we need to discuss:

Is there a way to better detect flaky tests before landing them?
How can we handle the immediate situation?

I'm opening this in the TSC repo, because it's kind of a meta discussion that I'm guessing is not going to be of much interest to folks following nodejs/node, but of course anyone is welcome to participate to the discussion.

The text was updated successfully, but these errors were encountered:

aduh95 · 2024-09-08T10:21:05Z

Is there a way to better detect flaky tests before landing them?

IMO, the correct response to that would be automation. Our previous attempt was nodejs/node#43954, which I don't think have helped a lot unfortunately. Maybe we should reconsider nodejs/node#40817: we could have a bot that checks the Jenkins CI result when a resume-ci label is added to a PR:

First it checks whether any of the failing test is one that's being changed in this PR. If it is, it adds a comment on the PR explaining why it won't resume CI.
Then it checks if all the falling tests have a corresponding issue. If not, it adds a comment on the PR explaining why it won't resume CI.
Otherwise, it resumes the CI.

It should solve the "no one bothers to check the failures" and "no one reports the flaky tests" – it might not help with the "no one is investigating the flakes" though.

aduh95 · 2024-09-08T10:24:59Z

Another thing that would help would be to mark on the Current release line as flaky all tests which have a corresponding open issue. That would greatly help with making release proposal ready, and also external team/projects that run our tests. That way, main can still have a high bar for when a test can be marked as flaky, only the release line (on which we don't do development) can be for forgiving.

marco-ippolito · 2024-09-08T10:42:06Z

In the past I tried to automate marking tests as flaky nodejs/node-core-utils#746 but closed for lack of time, if someone wants to give it a shot.
The idea is that if a test has failed on x different PRs, create a PR to mark it as flaky and issue.

joyeecheung · 2024-09-08T14:13:26Z

I think we should also discuss what happens after a test is marked flaky - is it just going to rot in the status file? I think that has been effectively happening and could actually be contributing to the flakes. e.g. some of the flakes might be caused by the same V8 task runner bug e.g. nodejs/node#47297, some of those have been marked flaky but if the bug never gets dealt with, it could just continue to flake other tests under certain circumstances.

gireeshpunathil · 2024-09-09T01:09:34Z

is it reasonable to say all the current problems are stemming from the fact that there are less folks looking at the CI? or is there a new pattern? (IIRC, our CI was very healthy when there were many folks tending it)

yharaskrik · 2024-09-09T01:43:50Z

@juristr this is probably a much larger discussion with the Node team but it seems like Node could benefit from the flaky target detection that Nx has no? I'm not too familiar with Node's CI setup right now and I would assume that Nx Cloud doesn't have all of the OS targets Node would need but maybe there's a discussion to be had here?

anonrig · 2024-09-09T02:42:57Z

If you look into jenkins-alerts repository, we have a warning/incident almost everyday. Most of the time, it is related to host machine related issues. https://github.com/nodejs/jenkins-alerts/issues

richardlau · 2024-09-09T15:08:05Z

If you look into jenkins-alerts repository, we have a warning/incident almost everyday. Most of the time, it is related to host machine related issues. https://github.com/nodejs/jenkins-alerts/issues

Every issue in that repository will be a host machine issue as the whole point of the repository is to monitor for machine related issues -- it only looks to see if the machine is either running out of disk space or offline in Jenkins.

jasnell · 2024-09-09T15:14:37Z

The CI is so flaky it's systematically takes several attempts to get a passing one.

I've had PRs that have required 10-16 CI runs to get even a flaky success. Now the OSX runner has been jammed up for days with no clear indication how to fix it or who can fix it.

mhdawson · 2024-09-09T15:31:24Z

I think @aduh95 concrete suggestion of the bot which:

First it checks whether any of the failing test is one that's being changed in this PR. If it is, it adds a comment on the PR explaining why it won't resume CI.
Then it checks if all the falling tests have a corresponding issue. If not, it adds a comment on the PR explaining why it won't resume CI.
Otherwise, it resumes the CI.

Makes a lot of sense to me. I'm +1 on that and I think it would help get tests marked as flaky.

In terms of actually investigating/fixing issues I really wish we could find some way to get people to volunteer/focus on that but don't have any new ideas on that front. If we had a number of people who would commit to spending X amount of time each week/month on doing that either separately or together then I would join that effort, but in the absence of some level of critical mass of people committing to contribute I think any one individual sees an unsurmountable task.

thomasrockhu-codecov · 2024-09-10T20:40:08Z

Hey, Tom from @codecov here.

I know this may come off as me being a shill (I am), but as Node already uses Codecov, I thought I'd mention that we are building a flaky test product to help identify and highlight flakes. This is still a pretty new feature, but I'd love to see it actually be useful for a large codebase.

Here is a screenshot from the blog post that shows what a flaky test looks like

Here is a link to the source of the screenshot on GitHub.

richardlau · 2024-09-10T23:21:32Z

Finding/highlighting flakes is not the problem (which is not to say more could be done). We are already detecting test failures that happen for CIs across multiple PRs in https://github.com/nodejs/reliability. BuildPulse was added as an experiment for Windows builds in nodejs/build#3653 and I have no idea if anyone has been looking at the results.

This is fundamentally a people problem -- we need to somehow motivate people to look at the flakes and decide whether the tests can be fixed or should be removed. Keeping long term flakey tests in Node.js is just building up problems for later.

And/or be more proactive in detecting when a PR introduces a flake into the system.

As a warning we also need to not become too reliant on the Resume Build feature of the Multijob Plugin in Jenkins as that plugin is deprecated -- if a future Jenkins update ever breaks that plugin we'd need to migrate (most likely to Jenkins pipelines) and we'd lose the Resume Build feature since it's part of that plugin (I don't know if Pipelines has an equivalent).

thomasrockhu-codecov · 2024-09-11T18:38:06Z

Got it, thanks for the info @richardlau! I suppose then that our flaky test product is not useful right now for this case.

lpinca · 2024-09-13T08:14:49Z

There is a common issue (deadlock during process shutdown) behind many (all?) timeout failures in our CI. I've opened a specific issue for this: nodejs/node#54918.

pmarchini · 2024-09-29T10:24:47Z

Hi everyone, while working on some flaky tests, I implemented this very basic tool (https://github.com/pmarchini/giogo) that leverages cgroups to limit resources (memory, CPU, IO).

Using this approach, I was able to reproduce the flakiness of some of the tests on my local machine.
Note: in my case, I was working on test runner tests, and limiting the processor allowed me to reproduce the issue most of the time.

I hope this could be helpful to someone else.
If you have any ideas or want to contribute, feel free to open a PR

bnoordhuis · 2024-10-30T11:17:04Z

Instead of technical solutions, there's a simple social one: pay someone to investigate and fix flaky tests. It's soul crushing work, no one is going to do that for fun.

The foundation is still swimming in money, right? Might as well put it to better use than marketing and lawyers.

richardlau · 2024-10-30T12:17:17Z

Instead of technical solutions, there's a simple social one: pay someone to investigate and fix flaky tests. It's soul crushing work, no one is going to do that for fun.

FYI That is being planned. The statement of work is being drafted in #1629.

bnoordhuis · 2024-10-30T13:37:08Z

I just saw that, nice! The fact it's a paid position got kind of buried but it's there.

* doc: add 2024-10-30 meeting notes * fixup! doc: add 2024-10-30 meeting notes * fixup: explain #1614 --------- Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com>

aduh95 added the tsc-agenda label Sep 8, 2024

mhdawson mentioned this issue Sep 9, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-09-11 #1615

Closed

mhdawson mentioned this issue Sep 16, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-09-18 #1619

Closed

mhdawson mentioned this issue Sep 23, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-09-25 #1623

Closed

mhdawson mentioned this issue Sep 30, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-10-02 #1626

Closed

This was referenced Oct 7, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-10-09 #1632

Closed

Node.js Technical Steering Committee (TSC) Meeting 2024-10-16 #1635

Closed

mhdawson mentioned this issue Oct 21, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-10-23 #1638

Closed

mhdawson mentioned this issue Oct 28, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-10-30 #1643

Closed

gireeshpunathil added a commit that referenced this issue Oct 30, 2024

fixup: explain #1614

5b821ad

RafaelGSS added a commit that referenced this issue Oct 30, 2024

doc: add 2024-10-30 meeting notes (#1645)

5725cd9

* doc: add 2024-10-30 meeting notes * fixup! doc: add 2024-10-30 meeting notes * fixup: explain #1614 --------- Co-authored-by: Gireesh Punathil <gpunathi@in.ibm.com>

mhdawson mentioned this issue Nov 4, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-11-06 #1648

Closed

This was referenced Nov 11, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-11-13 #1649

Closed

Node.js Technical Steering Committee (TSC) Meeting 2024-11-20 #1652

Closed

This was referenced Nov 25, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-11-27 #1656

Closed

Node.js Technical Steering Committee (TSC) Meeting 2024-12-04 #1660

Closed

mhdawson mentioned this issue Dec 9, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-12-11 #1662

Closed

mhdawson mentioned this issue Dec 16, 2024

Node.js Technical Steering Committee (TSC) Meeting 2024-12-18 #1665

Closed

mcollina mentioned this issue Dec 20, 2024

CI management for Tier 2 platforms #1666

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let's talk about the CI situation #1614

Let's talk about the CI situation #1614

aduh95 commented Sep 8, 2024

aduh95 commented Sep 8, 2024

aduh95 commented Sep 8, 2024

marco-ippolito commented Sep 8, 2024

joyeecheung commented Sep 8, 2024

gireeshpunathil commented Sep 9, 2024

yharaskrik commented Sep 9, 2024

anonrig commented Sep 9, 2024

richardlau commented Sep 9, 2024

jasnell commented Sep 9, 2024

mhdawson commented Sep 9, 2024

thomasrockhu-codecov commented Sep 10, 2024

richardlau commented Sep 10, 2024 •

edited

Loading

thomasrockhu-codecov commented Sep 11, 2024

lpinca commented Sep 13, 2024 •

edited

Loading

pmarchini commented Sep 29, 2024

bnoordhuis commented Oct 30, 2024

richardlau commented Oct 30, 2024

bnoordhuis commented Oct 30, 2024

Let's talk about the CI situation #1614

Let's talk about the CI situation #1614

Comments

aduh95 commented Sep 8, 2024

aduh95 commented Sep 8, 2024

aduh95 commented Sep 8, 2024

marco-ippolito commented Sep 8, 2024

joyeecheung commented Sep 8, 2024

gireeshpunathil commented Sep 9, 2024

yharaskrik commented Sep 9, 2024

anonrig commented Sep 9, 2024

richardlau commented Sep 9, 2024

jasnell commented Sep 9, 2024

mhdawson commented Sep 9, 2024

thomasrockhu-codecov commented Sep 10, 2024

richardlau commented Sep 10, 2024 • edited Loading

thomasrockhu-codecov commented Sep 11, 2024

lpinca commented Sep 13, 2024 • edited Loading

pmarchini commented Sep 29, 2024

bnoordhuis commented Oct 30, 2024

richardlau commented Oct 30, 2024

bnoordhuis commented Oct 30, 2024

richardlau commented Sep 10, 2024 •

edited

Loading

lpinca commented Sep 13, 2024 •

edited

Loading