-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] OpenSearch 2.0 possible JIT Compiler memory leak or related issue #2791
Comments
Just out of curiosity. Did you try setting
Were you able to see anything in the slow logs (e.g., Your profiling logs from before testing to right before the crash showed the majority of direct memory consumption in the JIT compiler. This is likely just a side effect of some other issue (e.g., a job thread not terminating, etc) but for anyone wanting to dig further into the jvm this sounds something like an offending lambda or functional interface similar to this stackOverflow discussion.
|
Looks like Index Management also depends on job scheduler? I, admittedly, know zero about the job scheduler. Are there trace logs that can be enabled here to ensure jobs aren't going rogue or zombie and continuing after all tests "finish" despite dashboards reporting otherwise? I'm also curious if kotlin is doing some JIT funniness here? Did this behavior occur in 1.3? Does kotlin compiler try to be smart and re-interpret to java functional interfaces or lambdas that might be doing something funny leading to the compiler direct memory explosion? /cc @rmuir @uschindler enjoy digging into the jvm and may have some thoughts |
Yes, it uses job scheduler. I did add logs at one point to confirm no jobs were continuously running on their own, but if there was some method inside of a job that somehow caused it then I could have missed it. I did enable tracelogs the other day while looking, here are two separate times it happened and the last 5000 logs before the node died. Yep, IM is using Kotlin, would like to see if any other plugins are affected at all by this if they run their tests against a cluster and then do not kill the cluster. We have run more tests and so far have only been able to repro on JDK 11, not JDK 17 or 14. I tested once on JDK 11 for 1.3 and did not see it, and also have never seen this specific error previously in general. |
Did you try latest 11.0.14.1 release or an older one? Also I hope the garbage collector is default one and not any of the new ones! |
Is 11.0.2 newer than 11.0.14.1 or is 11.0.14.1 for a different distribution? I see JVM arg: That being said, I am not passing in any new args or customizing any of it on my end, this is the same |
11.0.2 is stone aged. 😆 11.0.14.1 is newest one. The 1 at end is because of a bugfix due to a regression in 11.0.14. |
Ah I think I had a brainfart moment and put a dot between the We can try testing with later version. |
I was running these tests on JDK 11.0.2 on three different ec2 boxes and I was able to reproduce the failure fairly consistently on all of them. When running After upgrading to 11.0.14 I have not yet been able to reproduce the OOM failure while looping the same set of tests for over half an hour. I am going to keep trying for a bit longer and with the full test suite to see if I am able to reproduce the OOM failure in any way on 11.0.14 |
If it does turn out to be fixed w/ newer JDK 11 version.. wondering what our recourse here is. Do we just release OS 2.0 potentially w/ a caveat that says only JDK 11.x.x.x and up is supported? Edit: Seems like Nick mentioned there's a minimum jdk variable we can set so it's more enforced. |
We should probably point people to https://adoptium.net/temurin/archive ... but we also know that OpenSearch works well with other distributions such as Coretto.
I think it's not worth the effort. I recommend closing this issue. If you think something is worth saying, add to https://opensearch.org/docs/latest/opensearch/install/compatibility/? |
Hi,
Generally: yes. I would always tell users to go and use the latest Adoptium/Temurin JDK. Also those shipped with Ubuntu regular updates are good (Ubuntu 20.0.4 LTS has 11.0.14.1). Coretto is also fine! Actually there are no real differences between those variants. It is just the original OpenJDK source code built with custom build scripts and some tools added on top. For Coretto there also seems to be an included "log4j patcher agent" (not 100% sure about if it is bundled, haven't checked), but this one does not work well with security manager as used by OpenSearch, so I would not recommend to enable it. The problem with the OpenJDK home page: They explicitely state on https://jdk.java.net/11/ that it is end of life and you should not use it in production: "The JDK 11 Early Access Program has concluded. Please visit Java SE Downloads for production ready builds. Older releases, which do not include the most up to date security vulnerability fixes and are no longer recommended for use in production, remain available in the OpenJDK Archive." If OpenSearch ships prebuilt binaries/images like Elasticsearch, it should for sure contain the latest OpenJDK version provided by Coretto or Adoptium/Temurin. In the README it should maybe give instructions about where to get a version. I would plain simple say: Use Java 11, but make sure at least 11.0.x (x =14), the one out on the date of relase and the one we tested with. |
Yes. This is not a problem of OpenSearch. It is just buggy runtime. We may improve documentation, but you will always find people using some crazy JDK versions rotting on their machines. This is why I generally recommend to ask users to first update to latest version before submitting bug reports like this one. Maybe add this to the bug report template. |
Yes this can be done. If JDK 8 is no longer the minimum requirement for opensearch, we can for sure use the new Java 9 feature to detect and compare version: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/lang/Runtime.html#version() This allows to do the startup checks. Parsing system properties is risky, but using those classes makes it very easy to have denylists of versions or enforce minimum versions. In Apache Lucene we also change to this soon. I am not sure if I am allowed to mention this here, the other fork named "Elasticsearch" already implemented support for |
I'll close this. @uschindler Care to open a feature request describing startup checks? OpenSearch 2.0 is min JDK 11 so that can work. |
Will do when at home. The version detection stuff is already available (class |
I opened: #2822 |
Interesting, could it be related to https://bugs.openjdk.java.net/browse/JDK-8259541 ? @dbbaughe out of curiosity, did you enable JVM compiler logging to see what is going on? Thank you. |
Yeah I did, but I was only cross-checking for other similar issues that others had reported.. didn't have enough expertise on those compiler logs to be able to find something on my own without having to spend a significant amount of time. |
Describe the bug
When running Index Management Dashboards cypress tests against an OpenSearch 2.0 cluster, the tests will all pass but if I leave the cluster up for 0-10 minutes then at some point the memory usage starts to spike and go crazy, more specifically the Compiler and Arena Chunk memory grow incredibly fast until my entire instance runs out of memory and then OS kills the process.
I haven't tried this with other plugins yet or OpenSearch by itself, so I can't 100% say this is a problem purely in core and not something in Index Management or Job Scheduler, but so far while debugging what I have done is after the Index Management tests are run, I deleted the
.opendistro-ism-config
index which is where all the IM jobs are located which deschedules anything IM related and it still happened after a few minutes. That along with the major changes done in core vs minimal in Index Management makes me think it's currently something in core, but still trying to root cause it. If no other plugin though is able to replicate this at all, then perhaps it is something in IM still.Here's NMT baseline
https://gist.github.com/dbbaughe/b3eb7457e8380f39db7cfa13ae22c5e7
Here's a diff after it's been spiking and about to go OOM in the next minute.
https://gist.github.com/dbbaughe/85975886d291e8edc0c63bdf85247b72
Went from 5GB committed to 50GB committed, with a 32GB increase in Compiler and 16GB increase in Arena Chunk.
This has happened on my dev machine using JDK11 and Log4jhotpatch disabled.
dmesg:
To Reproduce
Steps to reproduce the behavior: See further below for updated shortened steps..
2.1. While on main branch run
./gradlew run
and make sure JD2.2 If you'd like to see NMT info use
./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=summary"
or./gradlew run -Dtests.jvm.argline="-XX:NativeMemoryTracking=detail
2.3 While cluster is running you can execute
jcmd <pid> VM.native_memory summary
orjcmd <pid> VM.native_memory detail
to see information andjcmd <pid> VM.native_memory baseline
to take a baseline and then compare againstjcmd <pid> VM.native_memory summary.diff
orjcmd <pid> VM.native_memory detail.diff
3.0 Note: I specifically had the backend running on my dev machine and the dashboards running on my mac and ssh tunneled to my dev machine so they could communicate, i.e.
ssh -L 9200:localhost:9200 <dev host>
3.1 cd into plugins directory and pull down Index Management Dashboards
3.2 Make sure your node version is set to 14.18.2 (
nvm use 14.18.2
)3.3 Run
yarn osd bootstrap
while inside the Index Management Dashboards directory3.4 Once that's done, from the OpenSearch Dashboards directory run
yarn start --no-base-path --no-watch
3.5 Once OpenSearch Dashboards is up then from the Index Management Dashboards plugin run
yarn run cypress run
to run the cypress test suite.3.6 After the cypress tests are done... usually it's anywhere from 0 to 10 minutes until memory starts spiking and then goes OOM after a short while. During this time you can run the jcmd commands above to see if the committed memory is growing.
I'm trying to see if I can scope down the steps needed to reproduce to just some backend API calls.
Leaving the cluster up by itself with no calls to it did not cause it to run OOM eventually.
Looks like I was able to also reproduce by just running our backend integ tests against the cluster now..
So shortened steps would be steps 1 and 2 above and then just:
./gradlew integTestRemote -Dtests.rest.cluster=localhost:9200 -Dtests.cluster=localhost:9200 -Dtests.clustername=integTest
From another terminal in the same directory (Index Management).
This wasn't able to make it through IM tests now.. but I believe we do have CI running in Index Management.. checking it looks like a recent failure from 17 hours ago for a commit merged to main.
https://github.com/opensearch-project/index-management/actions/runs/2099364761
Expected behavior
Not going OOM
Plugins
Index Management / Job Scheduler
The text was updated successfully, but these errors were encountered: