-
Notifications
You must be signed in to change notification settings - Fork 191
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Engine stability and communications improvements #2158
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Also ensure that the Runner's communicator is wrapped in CommunicatorWrapper Otherwise, receiving messages will be run on the communicating thread causing problems with SqlAlchemy sessions between it and the main thread.
When running tests, the communicators should have `testing_mode=True` to make sure that all created queues and exchanges on RabbitMQ are automatically deleted and cleaned up after the tests finish.
The unit tests for `verdi process` used to use separate threads to run the daemon runner and the runner used to submit a test process, however, using threads with SqlAlchemy is not supported, because they use separate sessions and are typically not commensurate even though the unit tests were assuming they are. Instead, we run a daemon runner in a separate process, which simulates the real scenario as closely as possible, while also avoiding the SqlAlchemy session problems.
Now we use representers and constructors registered with yaml to do all our serialization/deserialization of Process instances and payloads for messages sent over RabbitMQ.
When a process is submitted, we don't expect a response as we are returning just the calculation node that we created to represent the process in the database.
with an optional --wait flag to allow waiting for the action to complete (default False).
…epts Some of the calls in `save_checkpoint` and `load_checkpoint` in `AiiDAPersister` can except. The runner or process that will call this to either load or save a checkpoint will catch the `PersistenceError` but any other exception might be ignored and disappear silently. Here we explicitly catch these exceptions and raise `PersistenceError` instead. We also set logging handlers for the `kiwipy` module with default level set to `WARNING` such that they will be logged to the daemon logger.
The user is allowed to place unstored nodes in the context. As a result, the engine needs to ensure that before a workchain transitions from one step to the other, those nodes are stored, because after the state transition the process will be saved. In order to do so, a checkpoint will be created of the serialized process instance. A checkpoint can only be reloaded if all its nodes are stored in the database.
The runner has a poll interval, which should not be 0 as that will lead to the daemon runners using a lot of processor resources. The two important settings for the RabbitMQ connection are the prefetch count and the heartbeat timeout. The former is still experimental. Setting it too high can lead to a daemon runner being overwhelmed, whereas setting it too low can cause a the daemon to be deadlocked. The heartbeat timeout is set to the maximum value that will be accepted anyway by a default server configuration. The changes in kiwipy should make the dropping of the heartbeat very unlikely anyway, but this is just an extra precaution. Note that all these parameters will have to become configurable on a per profile basis at some point.
muhrin
approved these changes
Nov 7, 2018
2 similar comments
This was referenced Nov 7, 2018
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR will update
plumpy
to version0.11.4
which underneath runs onkiwipy
version0.3.*
. The biggest change is that the communications of aRunner
now happen on a separate thread, while all the real operations and callbacks are scheduled on the main thread. This should ensure that even under heavy load, a (daemon) runner remains responsive to heartbeats and remote procedure calls (RPCs) from RabbitMQ. The missing of heartbeats and the subsequent cutting of the connection by RabbitMQ spawned a whole host of other problems that will now no longer surface.Fixes #1994
Fixes #2088
Fixes #1748
Fixes #1426
Fixes #1821
Fixes #1822
Fixes #2157