-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Java InterruptedException
at end of Python session
#38
Comments
So, first of all, my assessment in the original post was, to some extent, wrong: The problem was not reliably solved with that development version (1.2.2-dev0) of JPype. It seems (now) that (back then) I did not repeat the tests enough times for (new) problems to become evident. Mea culpa. On that note, as far as I can tell there is no observable difference in behavior between that dev version and the current JPype release, 1.3. I did, at some point, go back to the old dev build just to verify that. To recap… In the current release of MPh, 1.0.4, we force a "hard exit" of the Java VM by calling The idea, therefore, was to leverage the new config options introduced in JPype 1.3, in particular by setting However, in about 1 in 10 runs on Linux and 2 out of 10 on Windows (according to my tests) and something like 6 out of 10 on MacOS (according to Max), we now get access violations / segmentation faults. I've narrowed it down to the following minimal test script: import mph
client = mph.start()
model = client.create('empty')
parameters = model/'parameters'
(parameters/'Parameters 1').rename('parameters') All the script does is, it starts the Comsol client (a Java singleton), creates an empty model (a Java container object), and renames one of the model nodes that Comsol creates automatically from its default name to something else. Notably, no I/O is involved. On Windows, it would actually suffice to just start the client in order to observe the crashes, but on Linux I only see them if I also modify the model object, if only so slightly. I then wrote a second script that runs the above test a number of times as a subprocess and reports non-zero exits codes as well as any unexpected output on On Windows, the test script, the 10 or so percent of the time that it crashes, will exit with error code 1, report an access violation on On Linux (Ubuntu 20.04 LTS), the test script exits with code -6 and For MacOS, Max has reported in Chat that the behavior is pretty much the same as on Windows, crashes (due to a So the error manifests itself somewhat differently across platforms, but is probably the same in nature. Additionally, I can no longer create the coverage report successfully. Something strange is happening there, in that pyTest, towards the end of the test suite's run (when subprocesses are tested), seems to forget its own configuration and somehow adds the test folder to the code coverage, even though that folder should be excluded. (We only want to measure coverage of the actual library code.) I have no explanation for this, and no solution either. It's just frustrating that something so basic doesn't work as expected. It is certainly related to the crashes described above, but may also have something to do with pyTest turning on "faulthandlers" in its own exit handler. I tried to work around this, but to no avail. At this point, I'm quite pessimistic this issue will ever be resolved. I could try and narrow down further what triggers the crashes, like having only direct calls to the JPype API in the test script. And maybe activate tracing to gain further insight. But in the end, Comsol remains a fairly opaque black box to us, and we cannot expect any "upstream" support whatsoever. But if you have any ideas, @Thrameos, I could try and dig a little deeper. It should also be noted that Comsol itself is not the most stable piece of software. There have been many bugs in the past, and certainly many which still remain, that would cause even the GUI to crash and leave one of those fatal error log files in the user's working directory. So this is not unheard of, and it could all just be Comsol's fault somehow. In order to get unstuck, however, I think we should leave things as they are now (and have been for a good while), i.e. keep performing the hard JVM exit and hold on to all the work-arounds preserving the exit code. The |
I am not aware of any good solution here. The best would simply if Comsol client had a shutdown routine that the user can call the would politely terminate any resources so that you can proceed to destroying the JVM. But then needs to be a defined API. My general opinion is the source of crashes are resources that where something contains both Python and Java on the stack at the same time shutting down pulls the rug out and thus some method fails and has no way to return as the caller is no longer available. But as I don't see any usage of JImplements/JProxy in the example that doesn't really fit. I can give some details on the shutdown sequence for Python/JVM. It is of course very complicated as it depends how the sequence was triggered. If shutdown was invoked manually then it is straight forward to work out the order of operations, and you may be able to manually run the Python atexit routines before the shutdown proceeds. If called as part of the Python shutdown then it gets complicated as Python is processing the atexit routines and then we start terminate around it. Other than exit() the next best thing would be for the JVM to not get terminated at all. If we disable the shutdown hook on the atexit then the Python will shut itself down cleanly which then calls exit. Unfortunately if Java calls a Python method during that period then it will fail, though perhaps there is a way to determine if were the case and just fail the method with an exception. Once Python is complete it will terminate the process with exit. No Java shutdown, close, or atexit routines would take place. Unfortunately as this is proprietary software, there is very little advice I can give. If you know exactly where the Java code were going wrong and you need to fix a bug in Comsol itself, then your could use a classloader and asm library to "patch" the method to say catch the exception or handle termination gracefully. But that is a lot of effort. I did something similar to patch byte codes to catch NaN in a variable, but the effort and knowledge level is very high and only works if the library were pure Java. A mixed code with native components would be pretty much hopeless. |
Thanks for the explanations. Yeah, Comsol does have native components beside the Java code. Lots, actually. There's almost certainly no API method to terminate the client: I've spent hours with introspection of the Java object hierarchy. I don't see how Java would ever call anything Python, the way everything is set up. All I know is, without the explicit But I think I'll give it one more shot and try come up with a simple demo that does not use any MPh code, only direct JPype calls. Should be easier to reason about… |
I did some testing here and there and found this really hard to reproduce consistently - thus my idea of anything happening in COMSOL which just needs more time. This is in agreement what we saw with earlier versions. I have little to no hopes that anything will happen on the COMSOL side. They are heavily blocking anything that is not Mathworks for their API and in recent time are pretty clear that they have obligations to do so. Regarding bugs, I can confirm. COMSOL is just a massive black box and everything that comes out there needs more than serious control (guess thats the case with any (commercial) simulation software anyway). To move forward, I would reinforce the thought of not trying to fix too much which is not fixable on our end. Rather spend time and motivation to improve the functionality and stability on our side. I also think it is better to keep the exit routines and the above exception, which occurs rarely, instead of having the SEGFAULT and error log files. |
I am still not clear on the issue. What specific options leads to a SEGFAULT and what leads to the exception? The current path for shutdown.
So the easiest path here is to add a user installed hook that simply waits for 0.5 seconds so that anything else has time to clear before it proceeds from poking it with the sleep interrupted. |
Example. This program adds a hook that adds one second to the shutdown sequence to avoid a race condition.
You can take other actions like scanning the comsol threads to find a specific thread and send it the interrupt signal, then wait etc. Basically I tried to make it so that something with a complex race condition can be resolved by adding a hook that allowing the user to force the issue a particular direction. |
So I spent some more time on this, and I've come to the conclusion that it must be some kind of bug in that Java VM which ships with Comsol. Here is the test script I used. It makes all calls into the Comsol class library directly via JPype, no MPh business. import jpype
import jpype.imports
import jpype.config
print('# Configuring JPype.')
jpype.config.onexit = True # default: True
jpype.config.destroy_jvm = False # default: True
jpype.config.free_resources = True # default: True
print('# Starting JVM.')
root = 'C:/Program Files/COMSOL/COMSOL56/Multiphysics'
jvm = root + '/java/win64/jre/bin/server/jvm.dll'
jars = root + '/plugins/*'
jpype.startJVM(jvm, classpath=jars)
print('# Importing client.')
from com.comsol.model.util import ModelUtil
print('# Initializing client.')
ModelUtil.initStandalone(True)
print('# Creating model.')
model = ModelUtil.createUnique('model')
print('# Shutting down.') I have a second script that repeats the test until a crash occurs. On Windows, the test script crashes with a segmentation fault about 1 out of 10 times. Here, like in the MPh library code, we start Comsol's own Java VM, which is a recent build of the old JDK 8. C:\Program Files\COMSOL\COMSOL56\Multiphysics\java\win64\jre\bin
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.265-b01, mixed mode) However, no crashes occur (over more than 100 runs) with the current JDK 16. C:\programs\Java\bin
$ java -version
openjdk version "16.0.2" 2021-07-20
OpenJDK Runtime Environment Temurin-16.0.2+7 (build 16.0.2+7)
OpenJDK 64-Bit Server VM Temurin-16.0.2+7 (build 16.0.2+7, mixed mode, sharing) So, well, there's that. Ultimately though, it is some kind of design flaw in Comsol's architecture. The only reason we have to skip the In issue #15, specifically this comment, I overlooked that that's what's going on. It's possible to stop the non-daemon threads, as I observed there. They are in fact dead. But if we call I've also just now noticed that the same happens with pure Java code. The Java equivalent of the above Python test script is this: import com.comsol.model.*;
import com.comsol.model.util.*;
public class test {
public static void main(String[] args) {
System.out.println("# Initializing client.");
ModelUtil.initStandalone(false);
System.out.println("# Creating model.");
Model model = ModelUtil.create("model");
System.out.println("# Shutting down.");
}
} It is compiled with
Adding the line Now it's too bad that skipping the For the record, here is the trace log I get when the test script crashes: trace 1. I don't know how accurate these logs are towards the very end. I did run Python in unbuffered mode (with the Anyway, I think this is case closed. Ultimately it is Comsol's fault and there's not much we can do about it, except adding a dependency on a newer Java run-time, which I don't want to do. It should work with Comsol's very own JVM build. Importantly, @Thrameos , the issue is not with JPype. The only reason it once worked with an earlier version, I believe, is that the shutdown procedure was less clean back then, so it glossed over this issue. |
Okay good to hear that the options that I provided are helping. It is likely that offending code is simply a loop which catches all exception and then attempts to restart the threads which would typically happen if someone wants to make code respawn no matter what when wrong. Given your heroic effort in tracking down the issue, there isn't anything else that can be done. Their documentation makes it pretty clear there is no shutdown procedure other than pulling the plug with System.exit. You may want to see if there is a way to disable the JPype shutdown sequence entirely and add an atexit hook that calls the System.exit directly. At least that will keep the user from having to fight with it. |
Yeah, that might explain the behavior of these threads. They seem to be doing some kind of book-keeping as they often do eventually die on their own. Calling I did try out the three JPype config options, but no combination lead to an entirely crash-free shutdown. Though in these tests there was then no |
Tests have shown that this has no effect on Linux. With our without this, the `InterruptedExcetion` discussed in #38 still occurs at exit about the same number of times. So it's best not to touch this.
This issue seems to be resolved with the release of Comsol 6.0. I can no longer produce these errors in repeated tests, which I could before, with Comsol 5.6. I'm pretty sure this is because Comsol upgraded its vendored Java environment from JDK 8 to JDK 11. As I mentioned in an earlier comment, all bug reports on the OpenJDK issue tracker with similar error traces were for JDK 8, but never above. For the record, these were the test scripts I used (now) to confirm this...
import mph
version = '6.0'
mph.option('session', 'client-server')
client = mph.start(cores=1, version=version)
assert client.version == version
model = client.load('capacitor.mph')
model.solve('static')
from subprocess import run, PIPE, STDOUT
from textwrap import indent
from sys import executable as python
n = f = 0
while n < 20:
n += 1
print(f'Test run {n} started.')
command = [python, '-u', 'test.py']
process = run(command, stdout=PIPE, stderr=STDOUT, text=True)
if process.returncode:
print(f' Failed with return code {process.returncode}.')
if process.stdout:
if 'java.lang.InterruptedException' in process.stdout:
print(' Failed with InterruptedException.')
f += 1
else:
print(' Text printed to stdout.')
print(indent(process.stdout, ' '))
if process.stderr:
print(' Text printed to stdout.')
print(indent(process.stdout, ' '))
print(f'{f} runs failed with InterruptedException.') |
On Linux and macOS, we occasionally encounter this error:
The error is harmless, in that it occurs at the very end of the Python session, but never during simulations or any other normal interaction with Comsol. It is not 100% reproducible, though it does occur about half the time the test suite is run on Linux. The error happens when the client disconnects from the server during the shutdown sequence of the Python/Java session. A regular disconnect, pre-shutdown, does not produce the error.
All of this points to some kind of race condition in JPype's shutdown sequence of the Java VM. JPype pull request #937 would fix this, as repeated tests have shown, but work on JPype is currently on hold and it is unclear when that fix will be released. Hence this notice here.
The text was updated successfully, but these errors were encountered: