Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

Merged

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Mar 12, 2024

Fixes #4987

Dr.CI logic to detect isInfraFlakyJob and isLogClassifierFailed has a FP where it misclassifies the GH failure to dispatch the whole workflow as flaky, for example pytorch/pytorch#121317.

These logic should only be applicable to workflow job, not workflow run. The way to separate them is to check the workflowId field where it is set to null whenever it is a workflow run.

Testing

Unit test + local curl command will mark them as legit failures:

curl --request POST \
--url "http://localhost:3000/api/drci/drci?prNumber=121317" \
--header "Authorization: TOKEN" \
--data 'repo=pytorch'

@huydhn huydhn requested review from malfet, clee2000 and a team March 12, 2024 02:51
Copy link

vercel bot commented Mar 12, 2024

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2024
Copy link

vercel bot commented Mar 12, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 12, 2024 2:54am


// Has log and failure lines and is a workflow job
mockFailure.workflowId = "A";
expect(await isLogClassifierFailed(mockFailure)).toEqual(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this line supposed to be the same as the one above it?

Copy link
Contributor Author

@huydhn huydhn Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tries to add 2 tests here:

  1. The first one without workflow ID to mark it as a workflow run, isLogClassifierFailed returns false as the default value
  2. The second one with workflow ID (so it's a job) + hasS3Log returning true + has failure lines, so isLogClassifierFailed should return false

So yeah, both of them expect false as the returned value.

@@ -334,6 +334,7 @@ describe("Test various utils used by Dr.CI", () => {
});

test("test isInfraFlakyJob", () => {
// Not a workflow job
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe im reading this wrong, but do you have any tests where it would have been flaky if it were a workflow job, but not if it is a workflow run?

Copy link
Contributor Author

@huydhn huydhn Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are examples from the original PR #4622 that implements isInfraFlakyJob. They were all jobs that weren't run. GitHub records are gone, but they are still shown on HUD https://hud.pytorch.org/pr/110608

Copy link
Contributor Author

@huydhn huydhn Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By querying Rockset, I can confirm that there were infra flaky jobs from 110608. Here is an example:

workflowID = 6420960240 <-- Has workflow ID, so this is a job
id = 17434762139
runnerName = '' <-- Not having a runner name is a signal for infra flaky
authorEmail = 'eltociear@gmail.com'
name = 'pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge)'
jobName = 'linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge)'
conclusion = 'failure'
failure_captures = null

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry what I mean is that I don't see tests where the existence of a workflowID has an effect on the output of the function, I only see cases that returned false regardless of whether or not the workflowID is there. This makes me think that the original code would have also returned false, so I'm confused as to if the code works, am I missing something?

@huydhn huydhn requested a review from clee2000 March 12, 2024 20:20
@huydhn huydhn merged commit 630817a into pytorch:main Mar 12, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DrCI should never classify failure to run a workflow as flaky
4 participants