Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial discussion around CI failure handling #1

Open
joyeecheung opened this issue Jun 6, 2018 · 15 comments
Open

Initial discussion around CI failure handling #1

joyeecheung opened this issue Jun 6, 2018 · 15 comments

Comments

@joyeecheung
Copy link
Member

joyeecheung commented Jun 6, 2018

Specifically

After we settle down on those I can start implement the actions in ncu-ci to automate the CI status reporting and flake tracking process

@joyeecheung
Copy link
Member Author

BTW on why I chose to include the full URL in the flakes.draft.json: the Microsoft acquisition thing (it's unlikely that we will want to move, but anyway)

@sam-github
Copy link

sam-github commented Jun 7, 2019

@joyeecheung @nodejs/tsc Should this/could this be a strategic initiative?

Also, what is current state? Flakes seem to be tracked as issues in nodejs/node, so what is current state? I'd never seen this repo until moments ago!

FYI, I've contact some folks who have offered to look at windows platform issues in the past, and got some promises that some of them will look at the windows flakiness.

https://ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master/ is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.

I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.

P.S. RSS of Daily Master failures

@refack
Copy link

refack commented Jun 7, 2019

ci.nodejs.org/view/Node.js%20Daily/job/node-daily-master is discouraging, and helps explain why I've had so much trouble getting trivial PRs green in CI recently.

Currently @Trott and myself get emails when "Daily Master" fails. I wanted to setup a mailing list or an RSS/ATOM feed for that so anyone can easily track it's status.

I've started adding issues and PRing flaky status to tests. We "should" fix them, but while they are not fixed, its not helpful to have unrelated test failures on PR builds.

IMHO that's currently our best way to structure the known-flaky DB as test configuration. But without a champion those can turn from temporary measure into permanent "solutions", as we do have some tests that have been flaky for a long time, e.g. nodejs/node#23207, nodejs/node#26401, nodejs/node#20750.

@sam-github
Copy link

If its a strategic initiative, the current count of ignored tests will get reported at every TSC meeting. No guarantee they will get fixed, of course, but at least they will be visible.

@sam-github
Copy link

I'd like to find a place to point contributors at the flakes issues as good first contributions, but didn't. I was hoping to find somewhere in our docs that pointed to the good first issue label, but the only reference i found was in doc/onboarding-extras.md, which seems a bit obscure. Is there no "so, you want to help out with Node.js, great, look here" section of our docs?

Perhaps I should simply label all the ci flakes as good first issue?

I think they are good first contributions because they often require no knowledge of node internals. If the tests are written entirely in terms of the public node.js API, then someone with sufficient knowledge of the API alone could take a crack at any of them. For example, some expert user of workers might not have the C++ ability to contribute to the worker implementation, but be perfectly capable of reasoning carefully about the worker tests, and making them more robust. Fixing a flake also adds appreciable value to the project, so make great drive-by contributions for those so inclined.

@Trott
Copy link
Member

Trott commented Jun 7, 2019

I'd use help wanted and maybe even mentor available for the flaky tests, but definitely not good first issue. They often require very specific knowledge and/or access to a particular platform, often at a particular version. There's also a lot of what you might call "institutional knowledge" about how to reliably reproduce the flakiness, since a straightforward run of node-stress-single-test often doesn't do it.

@gireeshpunathil
Copy link
Member

addressing CI failures require an organized and continued team work. Individual efforts are not sustainable, and people get worn out easily.

I agree with @Trott , these are definitely not good-first-issues.

Often, the problem determination of these failures takes 2 phases:

  • recreating and reducing the test to its fundamental form, that involves only the failing expression
  • analyzing the faulty path, connecting the dots and coming up with a theory, derive or collect patches to prove or disprove the theory

second part is real hard.

@gireeshpunathil
Copy link
Member

I am +1 on running this through tsc.

@joyeecheung
Copy link
Member Author

joyeecheung commented Jun 9, 2019

The idea proposed by the OP has never been used in practice. Sometimes I use ncu-ci walk pr to gather suspicious failures and their patterns and open a thread here, but I have not been able to do so for some time now and ncu-ci now sometimes have trouble understanding windows failures (somehow neither does Jenkins CI understand those). The source of truth about the flakes is still the nodejs/node issue tracker.

I think it's worth being a strategic initiative, but we'll need a champion (or more) who has enough time to keep an eye on this.

@Trott
Copy link
Member

Trott commented Jun 9, 2019

Arguably, this could be a part of the existing strategic initiative for Build WG resources.

@mhdawson
Copy link
Member

I'm definitely +1 as a strategic initiative if we can find a champion or if @Trott is volunteering by integrating it into the existing Build strategic initiative. I agree raising the awareness of the flakes by reviewing the numbers in the TSC meeting would be a great first step.

@Trott
Copy link
Member

Trott commented Jun 11, 2019

I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(

@sam-github
Copy link

I'll think about what I have time to do (vs what I wish I could do with more time).

I looked at the build strategic initiative, and I'm not sure it is a fit for this. It is related, in that build infratructure failures can manifest as CI failures. Problems with host memory, disk, jenkins disconnections, tap2xml failures, etc. Those are within scope of build resources.

However, the flaky tests in particular seem to be solidly non-build, they need developer effort. The build team doesn't seem to me like it should take on the extra scope of fixing unreliable github.com/nodejs/node unit tests, its strained already.

@Trott
Copy link
Member

Trott commented Jun 11, 2019

I'm happy to highlight a particularly problematic test during each announcements section of the meeting when I'm in attendance. Maybe sam-github or someone else can do it when I'm not around. This week, it would have to be the way sequential/test-cpu-prof is now timing out at or nearly at 100% of the time on ubuntu1604_sharedlibs_debug_x64 making it nearly impossible to land anything that isn't a doc change. This started about 48 hours ago, so only stuff that had CI runs before then tends to be landing right now. :-(

Refs: nodejs/node#27611

Also, Joyee has a proposed fix as of 7 hours ago.
nodejs/node#28170

@mhdawson
Copy link
Member

@Trott while highlighting an urgent problem will have some benefit, I think reporting on the trend in terms of CI stability would possibly be more useful. ie are we slowly degrading/getting better and how quickly versus simply. That is something I think the TSC should be on top of and work to try to influence as opposed informing the TSC that some issue needs more help right now and hoping the TSC members have time to help out on that specific issue. Just my 2cents though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants