Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

Closed
mark-vieira opened this issue Jul 7, 2022 · 5 comments · Fixed by #88362
Assignees
Labels
:ml Machine learning Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI

Comments

@mark-vieira
Copy link
Contributor

There have been two instances of this today (both in PR builds) that blew up due to an OOME.

Build scan:
https://gradle-enterprise.elastic.co/s/ryfsljqpbyjlm/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

Reproduction line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign" -Dtests.seed=A3E13461A939BDF6 -Dtests.locale=ar-IQ -Dtests.timezone=America/Porto_Velho -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT&tests.test=testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([A3E13461A939BDF6]:0)

@mark-vieira mark-vieira added :ml Machine learning >test-failure Triaged test failures from CI labels Jul 7, 2022
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Jul 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@benwtrent
Copy link
Member

So, looking at the heapdump:
image

A new object added in: #88097

is taking 250MB of a ~500MB of heap.

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/ILMImmutableStateHandlerProvider.java

@grcevski what does this provider do?
You can get the heap dump from the test results (or I can provide it directly, its much too big to add here).

@grcevski
Copy link
Contributor

grcevski commented Jul 7, 2022

Oh that's unexpected. I guess I have a bug there, it should be only one object at the moment. It appears that the ILM plugin would reload many times and we add that immutable state handler many many times. I'll take this over to fix.

@grcevski grcevski self-assigned this Jul 7, 2022
@grcevski grcevski added the Team:Core/Infra Meta label for core/infra team label Jul 7, 2022
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (Team:Core/Infra)

@grcevski grcevski removed the Team:ML Meta label for the ML team label Jul 7, 2022
@davidkyle
Copy link
Member

davidkyle commented Jul 11, 2022

Different test in the same suite and failure is a ResourceNotFoundException but looking at the logs the JVM is spending nearly all the time garbage collecting so I'm guessing the root cause is the same.

https://gradle-enterprise.elastic.co/s/i4badycr5nt5k/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:Core/Infra Meta label for core/infra team >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants