Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

huydhn · 2024-03-12T02:51:57Z

Dr.CI logic to detect isInfraFlakyJob and isLogClassifierFailed has a FP where it misclassifies the GH failure to dispatch the whole workflow as flaky, for example pytorch/pytorch#121317.

These logic should only be applicable to workflow job, not workflow run. The way to separate them is to check the workflowId field where it is set to null whenever it is a workflow run.

Testing

Unit test + local curl command will mark them as legit failures:

curl --request POST \
--url "http://localhost:3000/api/drci/drci?prNumber=121317" \
--header "Authorization: TOKEN" \
--data 'repo=pytorch'

vercel · 2024-03-12T02:52:01Z

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2024-03-12T02:53:53Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
torchci	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Mar 12, 2024 2:54am

clee2000 · 2024-03-12T19:53:35Z

torchci/test/drciUtils.test.ts

+
+    // Has log and failure lines and is a workflow job
+    mockFailure.workflowId = "A";
+    expect(await isLogClassifierFailed(mockFailure)).toEqual(false);


is this line supposed to be the same as the one above it?

I tries to add 2 tests here:

The first one without workflow ID to mark it as a workflow run, isLogClassifierFailed returns false as the default value

The second one with workflow ID (so it's a job) + hasS3Log returning true + has failure lines, so isLogClassifierFailed should return false

So yeah, both of them expect false as the returned value.

clee2000 · 2024-03-12T19:56:04Z

torchci/test/drciUtils.test.ts

@@ -334,6 +334,7 @@ describe("Test various utils used by Dr.CI", () => {
  });

  test("test isInfraFlakyJob", () => {
+    // Not a workflow job


maybe im reading this wrong, but do you have any tests where it would have been flaky if it were a workflow job, but not if it is a workflow run?

There are examples from the original PR #4622 that implements isInfraFlakyJob. They were all jobs that weren't run. GitHub records are gone, but they are still shown on HUD https://hud.pytorch.org/pr/110608

By querying Rockset, I can confirm that there were infra flaky jobs from 110608. Here is an example:

workflowID = 6420960240 <-- Has workflow ID, so this is a job id = 17434762139 runnerName = '' <-- Not having a runner name is a signal for infra flaky authorEmail = 'eltociear@gmail.com' name = 'pull / linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge)' jobName = 'linux-jammy-py3.8-gcc11 / test (distributed, 2, 2, linux.2xlarge)' conclusion = 'failure' failure_captures = null

Sorry what I mean is that I don't see tests where the existence of a workflowID has an effect on the output of the function, I only see cases that returned false regardless of whether or not the workflowID is there. This makes me think that the original code would have also returned false, so I'm confused as to if the code works, am I missing something?

Fix Dr.CI flaky FP when GH fails to dispatch the workflow

2be5a6b

huydhn requested review from malfet, clee2000 and a team March 12, 2024 02:51

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Mar 12, 2024

vercel bot deployed to Preview March 12, 2024 02:54 View deployment

huydhn mentioned this pull request Mar 12, 2024

[drci] Workflow file errors remain after they are retried #4969

Closed

clee2000 reviewed Mar 12, 2024

View reviewed changes

huydhn requested a review from clee2000 March 12, 2024 20:20

clee2000 approved these changes Mar 12, 2024

View reviewed changes

malfet approved these changes Mar 12, 2024

View reviewed changes

huydhn merged commit 630817a into pytorch:main Mar 12, 2024
7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

huydhn commented Mar 12, 2024 •

edited

Loading

vercel bot commented Mar 12, 2024

vercel bot commented Mar 12, 2024 •

edited

Loading

clee2000 Mar 12, 2024

huydhn Mar 12, 2024 •

edited

Loading

clee2000 Mar 12, 2024

huydhn Mar 12, 2024 •

edited

Loading

huydhn Mar 12, 2024 •

edited

Loading

clee2000 Mar 12, 2024

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

Conversation

huydhn commented Mar 12, 2024 • edited Loading

Testing

vercel bot commented Mar 12, 2024

vercel bot commented Mar 12, 2024 • edited Loading

clee2000 Mar 12, 2024

Choose a reason for hiding this comment

huydhn Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

clee2000 Mar 12, 2024

Choose a reason for hiding this comment

huydhn Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

huydhn Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

clee2000 Mar 12, 2024

Choose a reason for hiding this comment

huydhn commented Mar 12, 2024 •

edited

Loading

vercel bot commented Mar 12, 2024 •

edited

Loading

huydhn Mar 12, 2024 •

edited

Loading

huydhn Mar 12, 2024 •

edited

Loading

huydhn Mar 12, 2024 •

edited

Loading