-
-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JLM remote tests failing on s390x openj9-openjdk8 #360
Comments
Looks like this may be the same as #246 |
So the 'server' process can be fixed by increasing the workload number of tests to run. The 'client' process has this loop:
The intent appears to be for this loop to take 5 minutes to complete (30 * 10 seconds). However, examining the job output shows this loop can take much longer than the anticipated 5 minutes owing to the amount of time it takes to write the test data: On an aarch64 test, this gave the following output:
So the 'Writing report data' in the run takes an average of about 8 secs, which would be (30 * 8 = 240 secs) for the whole test. Adding in the 300 one second sleeps gives a total run tom for the loop of 9 mins (300 + 240 secs) - a lot more than the expected 5 mins. In this test run the server process had been set to stop running new tests after 8 minutes, had therefore completed and as far as the test was concerned 'ended unexpectedly'. The server side process currently has a time limit of 30 minutes, but the number of tests specified might mean it takes no where near that time to run them all. On the s390x machine it took about 5 minutes - on the aarch64 machine about 8 minutes. In both cases this was less than the amount of time the client process took to run, so the test failed. Once the client side process has ended and the test has confirmed that the server side process is still running, the server process is killed, so it shouldn't matter if the workload (number of tests) specified would actually take longer than the 30 minutes time limit because the process will be killed as soon as the client side process completes. |
@Mesbah-Alam , @smlambert - if you could perhaps take a look at the analysis and the PRs, see if what I've done looks like a reasonable first step to cleaning up these jlm failures? |
@lumpfish - could you please explain this a bit more? If we un-exclude all the jlm tests and run all of them with #361 , do we see all of them pass? I think the final step of this clean up exercise would be to consolidate the excludes in the playlist and only have those excludes in place that are tied to issues not related to the current one (e.g. test timing out). |
I'll try putting it differently:
|
I've approved #361
Should those changes be done before we merged adoptium/aqa-tests#1948 ? |
I don't think that's necessary. There are no changes in the playlist for those tests so they'll continue to run as they are now. We may as well get the benefit for the non openj9 specific tests first (or find any new issues with the change!). |
adoptium/aqa-tests#1948 is also approved. Please merge the two PRs. Thanks a lot! |
Fixed by #361 |
Test targets affected in this run https://ci.adoptopenjdk.net/job/Test_openjdk8_j9_sanity.system_s390x_linux_xl/205/console were:
Looking at the
TestJlmRemoteClassNoAuth
failure, this is what appears to be going on:The output from the test step which fails is:
... and the reason:
a. connects to the server
b. enters a loop which should take 300 secs to complete:
19:50:30.042 - Starting thread. Suite=0 thread=0
19:50:30.044 - Starting thread. Suite=0 thread=1
19:50:30.044 - Starting thread. Suite=0 thread=2
19:50:30.044 - Starting thread. Suite=0 thread=3
19:50:30.044 - Starting thread. Suite=0 thread=4
19:50:30.044 - Starting thread. Suite=0 thread=5
19:50:30.044 - Starting thread. Suite=0 thread=6
19:50:30.044 - Starting thread. Suite=0 thread=7
19:50:30.044 - Starting thread. Suite=0 thread=8
19:50:30.045 - Starting thread. Suite=0 thread=9
19:50:30.045 - Starting thread. Suite=0 thread=10
19:50:30.045 - Starting thread. Suite=0 thread=11
19:50:30.045 - Starting thread. Suite=0 thread=12
19:50:30.045 - Starting thread. Suite=0 thread=13
19:50:30.045 - Starting thread. Suite=0 thread=14
19:50:30.045 - Starting thread. Suite=0 thread=15
19:50:30.045 - Starting thread. Suite=0 thread=16
19:50:30.046 - Starting thread. Suite=0 thread=17
19:50:30.046 - Starting thread. Suite=0 thread=18
19:50:30.046 - Starting thread. Suite=0 thread=19
19:50:30.046 - Starting thread. Suite=0 thread=20
19:50:30.046 - Starting thread. Suite=0 thread=21
19:50:30.046 - Starting thread. Suite=0 thread=22
19:50:30.046 - Starting thread. Suite=0 thread=23
19:50:30.046 - Starting thread. Suite=0 thread=24
19:50:30.046 - Starting thread. Suite=0 thread=25
19:50:30.047 - Starting thread. Suite=0 thread=26
19:50:30.047 - Starting thread. Suite=0 thread=27
19:50:30.047 - Starting thread. Suite=0 thread=28
19:50:30.047 - Starting thread. Suite=0 thread=29
19:50:50.046 - Completed 4.1%. Number of tests started=36949
19:51:10.081 - Completed 10.7%. Number of tests started=96540 (+59591)
19:51:30.084 - Completed 18.4%. Number of tests started=165692 (+69152)
19:51:50.131 - Completed 26.7%. Number of tests started=240584 (+74892)
19:52:10.046 - Completed 35.0%. Number of tests started=314749 (+74165)
19:52:30.042 - Completed 44.4%. Number of tests started=399346 (+84597)
19:52:50.075 - Completed 53.6%. Number of tests started=482697 (+83351)
19:53:10.121 - Completed 63.2%. Number of tests started=568508 (+85811)
19:53:30.094 - Completed 71.8%. Number of tests started=645908 (+77400)
19:53:50.110 - Completed 80.6%. Number of tests started=725710 (+79802)
19:54:10.049 - Completed 89.7%. Number of tests started=807530 (+81820)
19:54:30.095 - Completed 99.6%. Number of tests started=896148 (+88618)
19:54:31.038 - Thread completed. Suite=0 thread=8
19:54:31.039 - Thread completed. Suite=0 thread=24
19:54:31.039 - Thread completed. Suite=0 thread=18
19:54:31.039 - Thread completed. Suite=0 thread=13
19:54:31.040 - Thread completed. Suite=0 thread=29
19:54:31.040 - Thread completed. Suite=0 thread=15
19:54:31.041 - Thread completed. Suite=0 thread=17
19:54:31.041 - Thread completed. Suite=0 thread=4
19:54:31.041 - Thread completed. Suite=0 thread=28
19:54:31.041 - Thread completed. Suite=0 thread=12
19:54:31.041 - Thread completed. Suite=0 thread=22
19:54:31.041 - Thread completed. Suite=0 thread=9
19:54:31.042 - Thread completed. Suite=0 thread=14
19:54:31.043 - Thread completed. Suite=0 thread=2
19:54:31.043 - Thread completed. Suite=0 thread=19
19:54:31.043 - Thread completed. Suite=0 thread=3
19:54:31.043 - Thread completed. Suite=0 thread=20
19:54:31.044 - Thread completed. Suite=0 thread=1
19:54:31.045 - Thread completed. Suite=0 thread=26
19:54:31.045 - Thread completed. Suite=0 thread=7
19:54:31.045 - Thread completed. Suite=0 thread=5
19:54:31.046 - Thread completed. Suite=0 thread=11
19:54:31.047 - Thread completed. Suite=0 thread=0
19:54:31.048 - Thread completed. Suite=0 thread=25
19:54:31.039 - Thread completed. Suite=0 thread=16
19:54:31.048 - Thread completed. Suite=0 thread=27
19:54:31.049 - Thread completed. Suite=0 thread=10
19:54:31.050 - Thread completed. Suite=0 thread=6
19:54:31.053 - Thread completed. Suite=0 thread=21
19:54:31.055 - Thread completed. Suite=0 thread=23
19:54:31.104 - Load test completed
19:54:31.104 - Ran : 900000
19:54:31.104 - Passed : 900000
19:54:31.104 - Failed : 0
19:54:31.104 - Result : PASSED
So it seems that the workload being run by the server is not of sufficient size to guarantee that it will take >300 seconds in this s390x environment.
@Mesbah-Alam - does my analysis look reasonable?
The text was updated successfully, but these errors were encountered: