-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Classify infra failures without any associated runners as flaky #4622
Conversation
@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
j.conclusion, | ||
j.completed_at, | ||
j.html_url, | ||
j.head_sha, | ||
j.head_branch, | ||
j.torchci_classification.captures AS failure_captures, | ||
j.torchci_classification.line AS failure_line, | ||
j._event_time as time, | ||
j._event_time AS time, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll follow up with a separate PR to remove the usage of _event_time
in these queries
@@ -315,3 +315,18 @@ export async function hasSimilarFailures( | |||
|
|||
return false; | |||
} | |||
|
|||
export function isInfraFlakyJob(job: RecentWorkflowsData): boolean { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prob worth mentioning that this is not a documented github behavior and is subject to potentially changing (but we can take advantage of it while it lasts since it fails gracefully).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just FYI, there is actually another case of infra flaky that I still need to figure out how it can be classified correctly. pytorch/pytorch#110668 is an example where one job fails because of the infamous
The self-hosted runner: i-0abf34dcfd6b840ab lost communication with the server. Verify the machine is running and has a healthy network connection. Anything in your workflow that terminates the runner process, starves it for CPU/Memory, or blocks its network access can cause this error.
This case is an infra flake, but there could be rare legit OOM build issues (remember the FlashAttentionV2 build) that end up with exactly the same error message. So, it's hard to separate them. Basically, if it's really OOM, it's legit failure. Otherwise, it's flaky infra.
This handles the cases like pytorch/pytorch#110608 or pytorch/pytorch#110510 where there were a bunch of infra flaky failures in which the runner crashes and no log was found. The
runner_name
andfailure_line
fields are all empty in such cases. Having no associated runner guarantees that the failure is an unrelated infra flake.Testing
🔗 Helpful Links
🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110608
Note: Links to docs will display an error until the docs builds have been completed.
✅ You can merge normally! (7 Unrelated Failures)
As of commit 2c38c884c7a8a39d713167cdc789d0e0f332f019 with merge base f17fe89e14ef7c29690d989c857ae011b8589b80 ():
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes.
🔗 Helpful Links
🧪 See artifacts and rendered test results at hud.pytorch.org/pr/110510
Note: Links to docs will display an error until the docs builds have been completed.
❌ 1 New Failure, 7 Unrelated Failures
As of commit b4b55bd4421e4af1f6749a8ebaa557a49e66c9ae with merge base cf1b494afd0d0368c22e70e93d91da3d9fe1ddce ():
NEW FAILURE - The following job has failed:
FLAKY - The following jobs failed but were likely due to flakiness present on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes.