Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Engine stability and communications improvements #2158

Merged
merged 9 commits into from
Nov 7, 2018
Merged

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Nov 7, 2018

This PR will update plumpy to version 0.11.4 which underneath runs on kiwipy version 0.3.*. The biggest change is that the communications of a Runner now happen on a separate thread, while all the real operations and callbacks are scheduled on the main thread. This should ensure that even under heavy load, a (daemon) runner remains responsive to heartbeats and remote procedure calls (RPCs) from RabbitMQ. The missing of heartbeats and the subsequent cutting of the connection by RabbitMQ spawned a whole host of other problems that will now no longer surface.

Fixes #1994
Fixes #2088
Fixes #1748
Fixes #1426
Fixes #1821
Fixes #1822
Fixes #2157

sphuber and others added 9 commits November 6, 2018 22:19
Also ensure that the Runner's communicator is wrapped in CommunicatorWrapper
Otherwise, receiving messages will be run on the communicating thread
causing problems with SqlAlchemy sessions between it and the main thread.
When running tests, the communicators should have `testing_mode=True`
to make sure that all created queues and exchanges on RabbitMQ
are automatically deleted and cleaned up after the tests finish.
The unit tests for `verdi process` used to use separate threads to
run the daemon runner and the runner used to submit a test process,
however, using threads with SqlAlchemy is not supported, because they
use separate sessions and are typically not commensurate even though
the unit tests were assuming they are. Instead, we run a daemon runner
in a separate process, which simulates the real scenario as closely
as possible, while also avoiding the SqlAlchemy session problems.
Now we use representers and constructors registered with yaml to do all
our serialization/deserialization of Process instances and payloads for
messages sent over RabbitMQ.
When a process is submitted, we don't expect a response as we are
returning just the calculation node that we created to represent
the process in the database.
with an optional --wait flag to allow waiting for the action to complete
(default False).
…epts

Some of the calls in `save_checkpoint` and `load_checkpoint` in
`AiiDAPersister` can except. The runner or process that will call this
to either load or save a checkpoint will catch the `PersistenceError`
but any other exception might be ignored and disappear silently. Here we
explicitly catch these exceptions and raise `PersistenceError` instead.

We also set logging handlers for the `kiwipy` module with default level
set to `WARNING` such that they will be logged to the daemon logger.
The user is allowed to place unstored nodes in the context. As a result,
the engine needs to ensure that before a workchain transitions from one
step to the other, those nodes are stored, because after the state
transition the process will be saved. In order to do so, a checkpoint
will be created of the serialized process instance. A checkpoint can
only be reloaded if all its nodes are stored in the database.
The runner has a poll interval, which should not be 0 as that will lead
to the daemon runners using a lot of processor resources. The two
important settings for the RabbitMQ connection are the prefetch count
and the heartbeat timeout. The former is still experimental. Setting it
too high can lead to a daemon runner being overwhelmed, whereas setting
it too low can cause a the daemon to be deadlocked. The heartbeat
timeout is set to the maximum value that will be accepted anyway by a
default server configuration. The changes in kiwipy should make the
dropping of the heartbeat very unlikely anyway, but this is just an
extra precaution.

Note that all these parameters will have to become configurable on a per
profile basis at some point.
@sphuber sphuber requested a review from muhrin November 7, 2018 08:37
@sphuber sphuber merged commit fae3dfc into develop Nov 7, 2018
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 68.545% when pulling 2e7b0fa on new_engine into a3f8c84 on develop.

2 similar comments
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 68.545% when pulling 2e7b0fa on new_engine into a3f8c84 on develop.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.2%) to 68.545% when pulling 2e7b0fa on new_engine into a3f8c84 on develop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants