[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

mark-vieira · 2022-07-07T00:50:27Z

There have been two instances of this today (both in PR builds) that blew up due to an OOME.

Build scan:
https://gradle-enterprise.elastic.co/s/ryfsljqpbyjlm/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

Reproduction line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign" -Dtests.seed=A3E13461A939BDF6 -Dtests.locale=ar-IQ -Dtests.timezone=America/Porto_Velho -Druntime.java=17

Applicable branches:
master

Reproduces locally?:
No

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT&tests.test=testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

Failure excerpt:

java.lang.Exception: Test abandoned because suite timeout was reached.

  at __randomizedtesting.SeedInfo.seed([A3E13461A939BDF6]:0)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-07-07T00:50:30Z

Pinging @elastic/ml-core (Team:ML)

benwtrent · 2022-07-07T17:31:48Z

So, looking at the heapdump:

A new object added in: #88097

is taking 250MB of a ~500MB of heap.

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/ilm/ILMImmutableStateHandlerProvider.java

@grcevski what does this provider do?
You can get the heap dump from the test results (or I can provide it directly, its much too big to add here).

grcevski · 2022-07-07T23:10:37Z

Oh that's unexpected. I guess I have a bug there, it should be only one object at the moment. It appears that the ILM plugin would reload many times and we add that immutable state handler many many times. I'll take this over to fix.

elasticmachine · 2022-07-07T23:10:55Z

Pinging @elastic/es-core-infra (Team:Core/Infra)

davidkyle · 2022-07-11T12:23:06Z

Different test in the same suite and failure is a ResourceNotFoundException but looking at the logs the JVM is spending nearly all the time garbage collecting so I'm guessing the root cause is the same.

https://gradle-enterprise.elastic.co/s/i4badycr5nt5k/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign

mark-vieira added :ml Machine learning >test-failure Triaged test failures from CI labels Jul 7, 2022

elasticmachine added the Team:ML Meta label for the ML team label Jul 7, 2022

grcevski self-assigned this Jul 7, 2022

grcevski added the Team:Core/Infra Meta label for core/infra team label Jul 7, 2022

grcevski removed the Team:ML Meta label for the ML team label Jul 7, 2022

grcevski mentioned this issue Jul 8, 2022

Fix test memory leak #88362

Merged

grcevski closed this as completed in #88362 Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

mark-vieira commented Jul 7, 2022

elasticmachine commented Jul 7, 2022

benwtrent commented Jul 7, 2022

grcevski commented Jul 7, 2022

elasticmachine commented Jul 7, 2022

davidkyle commented Jul 11, 2022 •

edited

Loading

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_StopsDatafeed_GivenJobFailsOnReassign failing #88337

Comments

mark-vieira commented Jul 7, 2022

elasticmachine commented Jul 7, 2022

benwtrent commented Jul 7, 2022

grcevski commented Jul 7, 2022

elasticmachine commented Jul 7, 2022

davidkyle commented Jul 11, 2022 • edited Loading

davidkyle commented Jul 11, 2022 •

edited

Loading