-
-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ThunderX machines cannot reliably git clone due to random gnuTls recv errors #1897
Comments
Problem references:
I'm going to close each of those as both have now been mitigated as the ThunderX machines appear to be the cause of both and are no longer in use. The openjdk-build issue has a load of history, but it was mostly things that didn't resolve the problem sadly (so switching to this issue as a clean slate with the history in theres eems like a reasonable course of action) We can continue any investigation towards a resolution here (although it is likely the machines will be decomissioned in the next few months anyway) My plan, which I hadn't created an issue for, was to try doing an OS upgrade on one of the ThunderX systems - try Ubuntu 18.04 for example, but given that the problems has been seen on both CentOS and Ubuntu systems I'm not convinced that will make a difference if it's hardware related. The only thing may be if we try a TLS implementation build using a differnet compiler version in case we're hitting some sort of compiler bug that only effects these systems. |
I'm going to experiment with test-packet-ubuntu1604-armv8-2. Let's see if the problem shows up on there.
No issues with those jobs, although for a period last night I was consistently failing to complete a checkout of openjdk-tests on the machine After running multiple Grinders - https://ci.adoptopenjdk.net/job/Grinder/6507/console failed (as did the following two, the previous jobs completed ok. Whatever this issue is it's seemingly only happening at certain times (It's 1727 as I write this so the failure was in the last 5 minutes or so)
|
Now failing with this - may need git to be rebuilt (or use the system one)
|
Upgtrading to Ubuntu 18 has not resolved this, even going back to the system During the cloning of openjdk-tests I got this this on one run:
and on another run:
|
Wrong TLS version or net split? |
It's happening quite frequently and only on this hardware type. The same job running about 10 times failed twice so unlikely they've negotiated the wrong TLS version. We've also had it when not checksums (see the build issue referenced above whch appears to be in the same area - failures in crypto code) which wouldn't be affected by a netsplit |
Random musing ... would it react differently if OpenSSL was build without hardware crypto support (e.g. |
I've now got five loops running in parallel:
Initially I just had the first three, and they all showed the problem at one point (all at around the same ten second period) |
Maybe related: adoptium/aqa-systemtest#402 |
Two observations:
On this basis I'm going to set up one of the ThunderX machines with docker containers and attempt to replace the system openssl with one build with |
OK That experiment didn't quite go as well as I expected. First some good news: I'd replaced On test-docker-ubuntu2004-armv8-1 On test-docker-ubuntu1804-armv8-1 and test-docker-ubuntu1604-armv8-1 |
These machines will be decomissioned in favour of the ampere ones being set up as part of #2078 |
ThunderX nodes to be disabled: #1809 (comment)
The text was updated successfully, but these errors were encountered: