Skip to content
robnagler edited this page Jan 19, 2020 · 2 revisions

Application Programming with Coroutines

Sirepo runs simulations on laptops, local clusters, and (now) supercomputers. Asynchrony has always been a part of Sirepo's simulation execution environment, and we had a lot of defects as a result. We relied on Python multi-threading to handle some of the concurrency, and now we're using Python async/await with Tornado managing the event loop. It's worked out, but there are some gotchas we've discovered that haven't been described elsewhere.

We deal with a different class of execution that is found in most web apps. Some simulations run for seconds and others run for hours (or days, if we let them). Simulations can take one core or many hundreds. We also need to perform in situ visualization for Fortran/C++ software via reading files that are being dynamically written. On supercomputers and departmental clusters, we have to manage simulations run via sbatch-style job submission systems. There are other issues, just to set things up, that this isn't a typical job for Celery or other "classical" task engine. One other wrinkle is that the legacy codes cannot be trusted (and sometimes user input files are directly executed) so they have to be containerized and executed in a restricted environment.

Success with async/await

Many people have written about Python's async/await, e.g. Some thoughts on asynchronous API design in a post-async/await world.

We have found it to be advantageous to use async/await over callbacks. We also have found it easier to think about cooperative over preemptive scheduling.

The main advantage to us is that you don't need locks, and that makes it easier to create reliable (deadlock free) code. This allowed us to build in resource sharing without a lot of complexity, especially given the complexity of the underlying problem. For example, we can just use the file system for the database (json-store), since we have a limited number of users.

Global State

Execution runs through a central supervisor process that runs independently of Flask. This allowed us to use async/await while the main app server still runs Python 2 in Flask. This modularization is effective, given that we can't easily migrate all our code to Python 3. The Flask server does some heavy lifting that requires preemptable scheduling.

The supervisor holds the state of all jobs in memory or on disk. There is a lot of information about the jobs such as start time, end time, whether it ended in errored or completed successfully, what parameters were used, etc. This global state needs to be shared across requests. For example, a job might be running while a visualization request comes in. Jobs may be canceled by the GUI and quickly restarted if the user switches tabs back and forth. Managing this state has been difficult.

Complex Resources

Jobs can run locally or remotely on NERSC. Local execution is performed on a Docker cluster. Simulations can easily generate megabytes and sometimes gigabytes of output that must be postprocessed and delivered in real time to the user. Local simulations can take one or many cores so resources need to be dynamically allocated.

Remove execution on NERSC has the same set of problems as local execution with the addition of no direct access to compute nodes, that is, all communication with the compute nodes to the login nodes (for visualization) must occur through files.

Resource Caching

When a user requests local execution, there might already be an existing Docker container running with access to the user's data. The codes are complex so the Docker images are large, and it's too slow to start a container on demand. Due to the restricted execution environment requirement, containers cannot be shared between users.

The input files for some codes are complex so a user can keep a library of such files, which can be referenced dynamically by codes. This library needs to be available in the container or on NERSC. Sometimes these files already exist in the Docker image so we can avoid downloading megabytes on demand when the simulation executes.

Resource Retrieval

While local execution can rely on NFS to download files, remote execution at NERSC cannot. We had to build a protocol to upload and download files via the supervisor. The user's data must be secure (no open s3 buckets here!).

Supervising

The supervisor is the traffic cop. It does no work directly. Agents call in remotely to the supervisor from login nodes at NERSC or inside Docker containers and get access only to a specific user's resources. Compute time has to be limited, because simulations can sometimes get in infinite loops. There is no coordination between running, canceling, or visualization requests. These happen asyncronously in the GUI.

The agents take time to boot and communicate with the supervisor when they are ready. Once the websocket is open, messages can flow asynchronously, and with some latency, across the Internet.

Easier To Ask Permission

In order to coordinate all of this asynchronous resource management, we have found it easier to check for a resource synchronously, and only await if it is not available than to use await. With preemption, you would use a mutex/lock, but mutexes are complicated to work with, and unnecessary for coroutines. However, an await may result in a context switch, which means anything goes. By checking synchronously for all resources (has the agent called in, are there enough cores, etc.) we avoid calling await, and ensure a consistent and known state before having to await, e.g. sending a message.

await and return or restart

When an await does occur, there are two cases: the request can complete or maybe other resources were affected while waiting. For example, to run a simulation, the simulation has to be number one in the queue (no other simulations for that user running), enough cores (appropriately-sized Docker container), and an agent called back to the supervisor. If a simulation waits in the queue, gets cores, and then a cancel comes in, it should not proceed to sending the message to the agent. It no longer is in the right position in the queue even though it has cores and an agent. Without locks, there's no way of knowing if the request can proceed. The coroutine has to recheck all prerequisites after an await.

We first tried to add the necessary checks after each "edge" await, but this is complicated and error prone. The implementation that evolved was to use a restart exception after edge awaits, that is, whenever an await to some external event has to happen (message receive, resource allocation, etc.), the application throws Awaited, and the supervisor (dispatcher) restarts the request from the beginning.

Note that Python guarantees that an awaited function does not suspend execution if the callee does. This is why coroutines are better for control flow.

Happy Path is Awaitless

By eliminating awaits (and locks), the normal path through the code has no awaits. This makes it easy to reason about the control flow.

This also means that any await which might have to suspend, needs to have a check before it. This requires careful coding, but the check has a very limited scope: check the resource or wait for it. More importantly, permission (allocation) checks are composable: you can have as many of them as you like.

Since the supervisor has to handle exceptions normally, the code to manage restarts is easily implemented. All the code down the stack has to handle exceptions (freeing resources) so an Awaited exception is not something special, really.

This makes the happy path awaitless and exceptionless, i.e. it's "fast". It's only when resources are unavailable and have to be awaited that an exception is thrown and the request has to be restarted.

Cancel is Free

Cancel is a hard problem but so is resource management. The awaitless solution to the resource management problem solves the cancel problem. Any request that can be restarted can also be canceled.

Any simulation can be canceled at any time. We have to handle that. We can simply cancel a task that is operating on a request, and it will reply canceled when the CancelledError is caught in the supervisor.