Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate workflow job failing on main: e2ePerformanceTests / Run E2E tests in AWS device farm #48824

Open
github-actions bot opened this issue Sep 9, 2024 · 16 comments

Comments

@github-actions
Copy link
Contributor

github-actions bot commented Sep 9, 2024

🚨 Failure Summary 🚨:

⚠️ Action Required ⚠️:

🛠️ A recent merge appears to have caused a failure in the job named e2ePerformanceTests / Run E2E tests in AWS device farm.
This issue has been automatically created and labeled with Workflow Failure for investigation.

👀 Please look into the following:

  1. Why the PR caused the job to fail?
  2. Address any underlying issues.

🐛 We appreciate your help in squashing this bug!

Issue OwnerCurrent Issue Owner: @kirillzyusko
@dangrous
Copy link
Contributor

@dangrous
Copy link
Contributor

Investigation in process!

@dangrous dangrous added Daily KSv2 and removed Hourly KSv2 labels Sep 13, 2024
@dangrous
Copy link
Contributor

Working on getting the logs. It's not related to the linked PR, but keeping this open as a daily for that investigation

@melvin-bot melvin-bot bot added the Overdue label Sep 16, 2024
Copy link

melvin-bot bot commented Sep 17, 2024

@dangrous Whoops! This issue is 2 days overdue. Let's get this updated quick!

@dangrous
Copy link
Contributor

margelo team is on it I believe, in that same slack thread. @kirillzyusko let me know if you want me to assign you here!

@melvin-bot melvin-bot bot removed the Overdue label Sep 17, 2024
@kirillzyusko
Copy link
Contributor

@dangrous yeah, feel free to assign me on this!

Copy link

melvin-bot bot commented Sep 24, 2024

@dangrous, @kirillzyusko Eep! 4 days overdue now. Issues have feelings too...

@kirillzyusko
Copy link
Contributor

It failed because of timeout issue (we hit a limit of 5400s) - 1.5h.

I think we merged a PR #47777 which increases it to 7200 (2h). Do you think we can close the issue?

@melvin-bot melvin-bot bot removed the Overdue label Sep 24, 2024
@dangrous
Copy link
Contributor

It looks from the screengrab that it crashed though, right? And that's what caused the timeout since the app never reopened? We should see if we can figure out what that crash was....

@kirillzyusko
Copy link
Contributor

@dangrous yeah, you are right, but from my observation:

  • these crashes are happening only in e2e tests (most likely they are coming from that fact that we use flashlight tool);
  • I've tried to read a stacktrace but it was c++ code with its weird stacktraces so I didn't get any useful insights into what causing these crashes.

In fact in out e2e tests we allow test to crash 3 times during its 60 runs. And we are relying on this fact. The problem is that when test crashes, then we are waiting 5 mins to force quit it (we have 5 mins timeout for a test). And if we get 2 random failures in any test, it will result in 10 minutes overhead for 1 test-suite. We have 5 test suites, so potentially retrying mechanism can add ~50 minutes for our test run 🤷‍♂️ And I think because of that we hit a limit in this particular test.

One of the things to optimize it I've been thinking of is reducing the timeout interval (from 5 minutes to 2.5 minutes). But I think we need to ask @hannojg why such relatively big timeout was chosen for e2e tests?

@dangrous
Copy link
Contributor

oh okay that makes sense - yeah I feel like we could even go shorter than 2.5 mins - I feel like if something is hanging for more than, say, 1 minute, then something is wrong enough that we should look at it. But curious what @hannojg thinks. Or if he's still OOO I think we can close this in the meantime

@hannojg
Copy link
Contributor

hannojg commented Oct 1, 2024

Agree, we can definitely make this timeout interval shorter!

@dangrous
Copy link
Contributor

dangrous commented Oct 2, 2024

Great! @kirillzyusko do you want to put up a PR to drop that timeout, maybe start with 2.5 mins and we see how that one goes? Probably could go even shorter but maybe that's a good starting point

@melvin-bot melvin-bot bot added the Overdue label Oct 2, 2024
@hannojg
Copy link
Contributor

hannojg commented Oct 3, 2024

Kiryl is OOO, and will be back next week to pick this one up!

Copy link

melvin-bot bot commented Oct 3, 2024

@dangrous, @kirillzyusko Uh oh! This issue is overdue by 2 days. Don't forget to update your issues!

Copy link

melvin-bot bot commented Oct 7, 2024

@dangrous, @kirillzyusko 6 days overdue. This is scarier than being forced to listen to Vogon poetry!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants