Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classify infra failures without any associated runners as flaky #4622

Merged
merged 3 commits into from
Oct 6, 2023

Conversation

huydhn
Copy link
Contributor

@huydhn huydhn commented Oct 6, 2023

This handles the cases like pytorch/pytorch#110608 or pytorch/pytorch#110510 where there were a bunch of infra flaky failures in which the runner crashes and no log was found. The runner_name and failure_line fields are all empty in such cases. Having no associated runner guarantees that the failure is an unrelated infra flake.

Testing

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110608

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (7 Unrelated Failures)

As of commit 2c38c884c7a8a39d713167cdc789d0e0f332f019 with merge base f17fe89e14ef7c29690d989c857ae011b8589b80 (image):

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110510

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 7 Unrelated Failures

As of commit b4b55bd4421e4af1f6749a8ebaa557a49e66c9ae with merge base cf1b494afd0d0368c22e70e93d91da3d9fe1ddce (image):

NEW FAILURE - The following job has failed:

FLAKY - The following jobs failed but were likely due to flakiness present on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@vercel
Copy link

vercel bot commented Oct 6, 2023

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 6, 2023
@vercel
Copy link

vercel bot commented Oct 6, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview 💬 Add feedback Oct 6, 2023 3:46am

@huydhn huydhn changed the title Mark infra failures without any associated runners as flaky Classify infra failures without any associated runners as flaky Oct 6, 2023
@huydhn huydhn requested review from clee2000, ZainRizvi and a team October 6, 2023 03:44
j.conclusion,
j.completed_at,
j.html_url,
j.head_sha,
j.head_branch,
j.torchci_classification.captures AS failure_captures,
j.torchci_classification.line AS failure_line,
j._event_time as time,
j._event_time AS time,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll follow up with a separate PR to remove the usage of _event_time in these queries

@huydhn huydhn marked this pull request as draft October 6, 2023 06:14
@huydhn huydhn marked this pull request as ready for review October 6, 2023 08:02
@@ -315,3 +315,18 @@ export async function hasSimilarFailures(

return false;
}

export function isInfraFlakyJob(job: RecentWorkflowsData): boolean {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob worth mentioning that this is not a documented github behavior and is subject to potentially changing (but we can take advantage of it while it lasts since it fails gracefully).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, there is actually another case of infra flaky that I still need to figure out how it can be classified correctly. pytorch/pytorch#110668 is an example where one job fails because of the infamous

The self-hosted runner: i-0abf34dcfd6b840ab lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.

This case is an infra flake, but there could be rare legit OOM build issues (remember the FlashAttentionV2 build) that end up with exactly the same error message. So, it's hard to separate them. Basically, if it's really OOM, it's legit failure. Otherwise, it's flaky infra.

@huydhn huydhn merged commit f851855 into pytorch:main Oct 6, 2023
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants