Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JTReg jdk11-m2 Failure: TestJlmRemoteMemoryAuth_0 #9046

Open
M-Davies opened this issue Mar 31, 2020 · 24 comments
Open

JTReg jdk11-m2 Failure: TestJlmRemoteMemoryAuth_0 #9046

M-Davies opened this issue Mar 31, 2020 · 24 comments

Comments

@M-Davies
Copy link

M-Davies commented Mar 31, 2020

Failure link

  • test category, sanity.system
  • OS/architecture, aarch_64
openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+9)
Eclipse OpenJ9 VM AdoptOpenJDK (build openj9-0.20.0-m2, JRE 11 Linux aarch64-64-Bit 20200330_133 (JIT enabled, AOT enabled)
OpenJ9   - c93e4dabc
OMR      - 1b6abd044
JCL      - 644e9abfc0 based on jdk-11.0.7+9)

https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.system_aarch64_linux_xl/73

Optional info

Failure output (captured from console output)

STF 16:44:08.898 - +------ Step 5 - Wait for the processes to complete
STF 16:44:08.899 - | Wait for processes to meet expectations
STF 16:44:08.899 - |   Processes: [LT1, CL1]
STF 16:44:08.899 - |
STF 16:44:08.899 - Monitoring processes: CL1 LT1
CL1 j> 2020/03/30 16:44:09.614 ServerURL=service:jmx:rmi:///jndi/rmi://localhost:1234/jmxrmi
CL1 j> 2020/03/30 16:44:15.507 Monitored VM not ready at Mar 30, 2020, 4:44:15 PM (attempt 0).
CL1 j> 2020/03/30 16:44:15.509 Wait 5 secs and trying again...
CL1 j> 2020/03/30 16:44:18.157 Connection established!
CL1 j> 2020/03/30 16:44:18.811 Starting to write data
CL1 ...j> 2020/03/30 16:44:51.548 Exiting as JVM we are connected to has finished
CL1 stderr Exception in thread "main" java.lang.AssertionError
CL1 stderr 	at org.junit.Assert.fail(Assert.java:86)
CL1 stderr 	at org.junit.Assert.fail(Assert.java:95)
CL1 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaProxy(MemoryProfiler.java:208)
CL1 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:110)
STF 16:44:51.784 - **FAILED** Process CL1 ended with exit code (1) and not the expected exit code/s (0)
STF 16:44:51.784 - Monitoring Report Summary:
STF 16:44:51.785 -   o Process CL1 ended with exit code (1) and not the expected exit code/s (0)
STF 16:44:51.785 -   o Process LT1 is still running as expected
STF 16:44:51.792 - Killing processes: CL1 LT1
STF 16:45:02.796 -   o Process LT1 killed
STF 16:45:02.796 -   o Process CL1 is not running
**FAILED** at step 5 (Wait for the processes to complete). Expected return value=0 Actual=1 at /home/jenkins/workspace/Test_openjdk11_j9_sanity.system_aarch64_linux_xl/openjdk-tests/TKG/../TKG/test_output_15855859707042/TestJlmRemoteMemoryAuth_0/20200330-164402-TestJlmRemoteMemoryAuth/execute.pl line 175.
STF 16:45:02.894 - **FAILED** execute script failed. Expected return value=0 Actual=1
@pshipton
Copy link
Member

@knn-k

@knn-k
Copy link
Contributor

knn-k commented Mar 31, 2020

https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_Personal/2/ was the last time I saw this failure.
Possibly an intermittent problem.

@M-Davies
Copy link
Author

M-Davies commented Apr 2, 2020

yeah it seems intermittent. Couldn't reproduce in a singular grinder https://ci.adoptopenjdk.net/view/Test_grinder/job/Grinder/2721/

@knn-k
Copy link
Contributor

knn-k commented Apr 3, 2020

I ran this test 10 times manually on cent7-aarch64-1, and saw no failures.
It takes 11 minutes to run a single round of this test on the test server.

@M-Davies
Copy link
Author

M-Davies commented Apr 3, 2020

I managed to recreate this.

Again, its on test-packet-armv8-ubuntu-16-04 (See #9047 (comment) and #9048 (comment))

@knn-k
Copy link
Contributor

knn-k commented Apr 8, 2020

I ran this test 20 times on test-packet-armv8-ubuntu-16-04, but I couldn't reproduce the failure.
I used the following version:

OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+9-202004062324)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-956678929, JRE 11 Linux aarch64-64-Bit Compressed References 20200406_263 (JIT enabled, AOT enabled)
OpenJ9   - 956678929
OMR      - 9d422b0b0
JCL      - 634f577760 based on jdk-11.0.7+9)

@knn-k
Copy link
Contributor

knn-k commented Apr 9, 2020

I managed to reproduce this on cent7-aarch64-1.

  • There are two Java processes (LT1, CT1) running at the same time in this test
  • LT1 runs a task with 30 threads
  • CT1 exits with the following exception in the file 4.CL1.stderr:
Exception in thread "main" java.lang.AssertionError
        at org.junit.Assert.fail(Assert.java:86)
        at org.junit.Assert.fail(Assert.java:95)
        at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaProxy(MemoryProfiler.java:208)
        at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:110)

@knn-k
Copy link
Contributor

knn-k commented Apr 10, 2020

The failure was recreated on cent7-aarch64-1 and on test-packet-armv8-ubuntu-16-04, but it takes long time and the rate is around 1/10 - 1/20.

I found this test failed with UnmarshalException caused by SSLException in the CL1 process shown below:

Exception in thread "main" java.lang.AssertionError: UnmarshalException:
Error unmarshaling return header; nested exception is:
        javax.net.ssl.SSLException: Tag mismatch!
        at org.junit.Assert.fail(Assert.java:88)
        at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
Exception in thread "main" java.lang.AssertionError: UnmarshalException:
Error unmarshaling return header; nested exception is:
        javax.net.ssl.SSLException: Received fatal alert: bad_record_mac
        at org.junit.Assert.fail(Assert.java:88)
        at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)

My colleague and I ran the test repeatedly on those test servers with the following options:

  1. -Xint on cent7-aarch64-1 (16 runs)
  2. -Djdk.nativeCrypto=false on test-packet-armv8-ubuntu-16-04 (50+ runs)

We couldn't recreate failures with either of these options.
JIT or native OpenSSL support may be the root cause, but it is difficult to conclude from these results.

@knn-k
Copy link
Contributor

knn-k commented Apr 10, 2020

Both servers above have 96 CPU cores and OpenSSL 1.0.2. We used the following OpenJ9 build, which contains the fix for multi-threading with OpenSSL 1.0.2 ibmruntimes/openj9-openjdk-jdk11#283 :

openjdk version "11.0.7" 2020-04-14
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+9-202004091659)
Eclipse OpenJ9 VM AdoptOpenJDK (build master-c6b594b6c, JRE 11 Linux aarch64-64-Bit Compressed References 20200409_277 (JIT enabled, AOT enabled)
OpenJ9   - c6b594b6c
OMR      - dba60e491
JCL      - 58d6fa1862 based on jdk-11.0.7+9)

I haven't been able to recreate this failure on a quad-core machine with OpenSSL 1.1.1.

@pshipton
Copy link
Member

pshipton commented Apr 15, 2020

https://ci.eclipse.org/openj9/view/Test/job/Grinder/796
0.20.0 XL cent7-aarch64-1
TestJlmRemoteMemoryAuth_0

@0xdaryl
Copy link
Contributor

0xdaryl commented Jun 18, 2020

This issue is difficult to reproduce (only been seen on a large core machine) and may be related to the version of OpenSSL (for example, we have not seen this problem on the machine on which it failed running with OpenSSL 1.1.1). We'll leave it open for a few more days to try and collect more data, but given that AArch64 is still early access there is a good chance this will be deferred.

@knn-k
Copy link
Contributor

knn-k commented Jun 24, 2020

#9769 reports a failure with SSLException: Tag mismatch! with JDK8 OpenJ9 v0.20.0 on amd64.

@pshipton
Copy link
Member

0.21.0 m2 build https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_Personal/5/

Error unmarshaling return header; nested exception is: 
	javax.net.ssl.SSLException: Tag mismatch!
	at org.junit.Assert.fail(Assert.java:88)
	at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)

@pshipton
Copy link
Member

0.21.0 m2 build https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_sanity.system_aarch64_linux_xl_Personal/2

CL2 stderr Exception in thread "main" java.lang.AssertionError: UnmarshalException: 
CL2 stderr Error unmarshaling return header; nested exception is: 
CL2 stderr 	javax.net.ssl.SSLException: Received fatal alert: bad_record_mac
CL2 stderr 	at org.junit.Assert.fail(Assert.java:88)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.recordMemoryStats(MemoryProfiler.java:307)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:255)
CL2 stderr 	at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)

@knn-k
Copy link
Contributor

knn-k commented Jun 29, 2020

I ran the test on cent7-aarch64-1 50 times with -Djdk.nativeDigest=false, and the exception was reproduced as follows. OpenSSL version on the machine is 1.0.2k:

CL1 stderr MessageDigest load - Native crypto library disabled.
CL1 stderr Warning: Native ChaCha20 load failed. Need OpenSSL 1.1.0 or above for ChaCha20 support. Using Java crypto implementation
CL1 stderr CipherCore Load - using native crypto library.
CL1 stderr RSACore load - using Native crypto library.

CL2 stderr Error unmarshaling return header; nested exception is:
CL2 stderr      javax.net.ssl.SSLException: Tag mismatch!
CL2 stderr      at org.junit.Assert.fail(Assert.java:88)
CL2 stderr      at net.adoptopenjdk.test.jlm.resources.MemoryData.writeData(MemoryData.java:480)
CL2 stderr      at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.getStatsViaServer(MemoryProfiler.java:251)
CL2 stderr      at net.adoptopenjdk.test.jlm.remote.MemoryProfiler.main(MemoryProfiler.java:112)

@DanHeidinga
Copy link
Member

Moving this forward as we've completed the milestone 2 builds for 0.21.0 and it's too late to put this in.

@lumpfish
Copy link
Contributor

This is occurring intermittently on AdoptOpenJDK nightly builds - e.g. https://ci.adoptopenjdk.net/job/Test_openjdk11_j9_sanity.system_aarch64_linux_xl/207/consoleFull

I have gone through the most recent runs and only found it occurring on machines test-packet-ubuntu1604-armv8-1 and test-packet-ubuntu1604-armv8-2 - i.e. Ubuntu 16.04 machines. I did not see any failures on the other AdoptOpenJDK test machines running rhel7.6

@pshipton
Copy link
Member

Do we need this in the milestone plan? I'll remove it.

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 14, 2020

Based on the investigations so far, the strong suspicion on this one is that it is an OpenSSL issue. We haven't found any evidence to suggest it is an OpenJ9 problem, hence deferring is the recommended approach.

One useful datapoint would be the version of OpenSSL found on the Ubuntu 16.04 boxes at AdoptOpenJDK and the RHL 7.6 boxes.

@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 14, 2020

The AArch64 builds use --with-openssl=system despite fetching the latest OpenSSL version 1.1.1i. I'm wondering if this failure can be reproduced if we change the build to use --with-openssl=fetched to ensure we're using the latest regardless of the machine we're running on and what version of OpenSSL happens to be installed there.

@pshipton
Copy link
Member

pshipton commented Dec 14, 2020

Perhaps so. The reason this option isn't used now is so the OS or user is responsible for providing openssl updates. Otherwise Adopt is on the hook in case some out of band update is required. Also, the only way to get an update for a bundled library is by getting the latest version from Adopt, or (unsupported) manually copying the latest library to an older build.

@sxa
Copy link
Contributor

sxa commented Feb 5, 2021

Just to add something extra to this - the problem appears to only be shown on the ThunderX aarch64 systems (Not sure which OpenJ9 have but if it's the 96-core ones then it's likely to be ThunderX) so it's possible that a fix will be out of our control if it's hardware related. I may try upgrading one of ours to a later level (I'll do it under but I'm not overly confident that it will really change anything.

Current infrastructure issue at adoptopenjdk is adoptium/infrastructure#1897 - it may ultimately by a wontfix but this infromation at least defines the scope of the issue.

@pshipton
Copy link
Member

pshipton commented Feb 5, 2021

Re-reading Daryl's comment #9046 (comment) I realized I misunderstood the options and my earlier comment was bogus. What is the openssl version available on the build system? Considering it only fails on one target, it's likely the fault of the openssl installed on the target, but I'm not sure of the side affects of compiling against an out-dated openssl version.

@sxa
Copy link
Contributor

sxa commented Feb 9, 2021

As per recent updates on the infrastructure issue I would not assume it's related to the version that OpenJ9 is being linked with as I'm able to reproduce the same failures using the latest openssl111i codebase outside OpenJ9 on multiple ThunderX systems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants