-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Packaging tests fail with "Gradle build daemon disappeared unexpectedly" #44623
Comments
Pinging @elastic/es-core-infra |
It seems not all metal works are created equal. We have a mix of workers with 32G of memory and others with 64G. That in conjunction with our recent daemon leaking issue is probably causing the OOM killer to blow away the Gradle daemon. |
I've opened up https://github.com/elastic/infra/issues/13356 to have infra look into beefing up all our metal workers such that they have 64G of RAM. |
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+packaging-tests/278/console is another case of this error occurring on a worker with 32GB RAM (https://elasticsearch-ci.elastic.co/computer/worker-854308). But https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+packaging-tests/277/console is a case of the same Jenkins job succeeding on a worker with 32GB RAM (https://elasticsearch-ci.elastic.co/computer/worker-854313). So it sometimes works on 32GB machines. But it's true that the cause of failure in build 278 is the OOM killer. I looked in the syslog of
|
We are working to move these jobs to ephemeral workers using nested virtualization. I think this will solve this problem too |
Given that nested virtualization hasn't worked out as we hoped it would, and I see some failures as recently as a few days ago I think it would be best to keep this on our radar. It is my understanding that we have reduced the current need for metal machines to only the vagrant jobs; what about requesting that we limit these jobs to metal workers that have 64gb ram? |
I don't think we have a way of making that distinction right now. Perhaps we could manually label those workers that we know are "good". That might be good enough, given this is a single job, and we run it only once a day. |
That's kind of what I had in mind; these would only run on machines labeled |
Any other machines will go unused, so it might make more sense to just remove, as in return to infra, any machine that is not 64GB |
I'm going to close this since we are only using the metal workers for the daily vagrant tests, and that is configured now as a matrix job so memory pressure should be much less as we are only spinning up a single VM per-job. That build is unfortunately failing for other reasons, but we can reopen this if it crops up again. |
There were a bunch of packaging test failures on different ES versions with the following error: "Gradle build daemon disappeared unexpectedly".
For example,
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+7.2+packaging-tests/173/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+6.8+packaging-tests/169/console
The text was updated successfully, but these errors were encountered: