Classify infra failures without any associated runners as flaky #4622

huydhn · 2023-10-06T03:38:13Z

This handles the cases like pytorch/pytorch#110608 or pytorch/pytorch#110510 where there were a bunch of infra flaky failures in which the runner crashes and no log was found. The runner_name and failure_line fields are all empty in such cases. Having no associated runner guarantees that the failure is an unrelated infra flake.

Testing

With Fix typo in BatchLinearAlgebraLibBlas.cpp pytorch#110608

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110608

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 2c38c884c7a8a39d713167cdc789d0e0f332f019 with merge base f17fe89e14ef7c29690d989c857ae011b8589b80 ():

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

With [aotinductor] Avoid generating redundant kernel loading code pytorch#110510

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110510

📄 Preview Python docs built from this PR
📄 Preview C++ docs built from this PR
❓ Need help or want to give feedback on the CI? Visit the bot commands wiki or our office hours

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 7 Unrelated Failures

As of commit b4b55bd4421e4af1f6749a8ebaa557a49e66c9ae with merge base cf1b494afd0d0368c22e70e93d91da3d9fe1ddce ():

NEW FAILURE - The following job has failed:

inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor_torchbench, 1, 1, linux.g5.4xlarge.nvidia.gpu) (gh)

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

vercel · 2023-10-06T03:38:17Z

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2023-10-06T03:43:46Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
torchci	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 6, 2023 3:46am

huydhn · 2023-10-06T03:46:45Z

torchci/rockset/commons/__sql/commit_failed_jobs.sql

  j.conclusion,
  j.completed_at,
  j.html_url,
  j.head_sha,
  j.head_branch,
  j.torchci_classification.captures AS failure_captures,
  j.torchci_classification.line AS failure_line,
-  j._event_time as time,
+  j._event_time AS time,


I'll follow up with a separate PR to remove the usage of _event_time in these queries

ZainRizvi · 2023-10-06T18:59:19Z

torchci/lib/drciUtils.ts

@@ -315,3 +315,18 @@ export async function hasSimilarFailures(

  return false;
 }
+
+export function isInfraFlakyJob(job: RecentWorkflowsData): boolean {


Prob worth mentioning that this is not a documented github behavior and is subject to potentially changing (but we can take advantage of it while it lasts since it fails gracefully).

Just FYI, there is actually another case of infra flaky that I still need to figure out how it can be classified correctly. pytorch/pytorch#110668 is an example where one job fails because of the infamous

The self-hosted runner: i-0abf34dcfd6b840ab lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

This case is an infra flake, but there could be rare legit OOM build issues (remember the FlashAttentionV2 build) that end up with exactly the same error message. So, it's hard to separate them. Basically, if it's really OOM, it's legit failure. Otherwise, it's flaky infra.

huydhn added 2 commits October 5, 2023 20:33

Mark infra failures without any associated runners as flaky

199ac87

Fix test

b349e9f

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2023

vercel bot deployed to Preview October 6, 2023 03:43 View deployment

Fix typo

e507c36

huydhn changed the title ~~Mark infra failures without any associated runners as flaky~~ Classify infra failures without any associated runners as flaky Oct 6, 2023

huydhn requested review from clee2000, ZainRizvi and a team October 6, 2023 03:44

vercel bot deployed to Preview October 6, 2023 03:46 View deployment

huydhn commented Oct 6, 2023

View reviewed changes

huydhn marked this pull request as draft October 6, 2023 06:14

huydhn marked this pull request as ready for review October 6, 2023 08:02

ZainRizvi approved these changes Oct 6, 2023

View reviewed changes

huydhn merged commit f851855 into pytorch:main Oct 6, 2023
4 checks passed

huydhn mentioned this pull request Mar 12, 2024

Fix Dr.CI flaky FP when GH fails to dispatch the workflow #4998

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classify infra failures without any associated runners as flaky #4622

Classify infra failures without any associated runners as flaky #4622

huydhn commented Oct 6, 2023 •

edited

Loading

vercel bot commented Oct 6, 2023

vercel bot commented Oct 6, 2023 •

edited

Loading

huydhn Oct 6, 2023

ZainRizvi Oct 6, 2023

huydhn Oct 6, 2023

Classify infra failures without any associated runners as flaky #4622

Classify infra failures without any associated runners as flaky #4622

Conversation

huydhn commented Oct 6, 2023 • edited Loading

Testing

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110608

✅ You can merge normally! (7 Unrelated Failures)

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110510

❌ 1 New Failure, 7 Unrelated Failures

vercel bot commented Oct 6, 2023

vercel bot commented Oct 6, 2023 • edited Loading

huydhn Oct 6, 2023

Choose a reason for hiding this comment

ZainRizvi Oct 6, 2023

Choose a reason for hiding this comment

huydhn Oct 6, 2023

Choose a reason for hiding this comment

huydhn commented Oct 6, 2023 •

edited

Loading

vercel bot commented Oct 6, 2023 •

edited

Loading