Three main Styx components are described in this section: RunState, EventVisitor and OutputHandlers.
A RunState is an object describing the current state of execution of a workflow instance. As such, a RunState relates to a specific state in the state graph (reported in the next section). The RunState object potentially carries more information regarding the execution flow of the workflow instance, for example the execution id of the latest attempt or the current number of attempts. According to the event received for a certain workflow instance, the associated RunState transitions into a new RunState (if allowed by the state machine rules). The map of the RunState objects for all the active workflow instances in Styx is also a fundamental component of the service. Such map can always be restored at Styx's startup using the information from the persistent layer.
The whole set of events that trigger RunState transitions is defined in the EventVisitor class. These events can be generated by Styx itself or received from external systems. In the latter case, the received events are translated into Styx compliant events before being injected into the system.
When a RunState transition occurs, a set of OutputHandlers is executed accordingly. The OutputHandlers are used to execute all the operations related to a certain transition, for example starting a new Docker run or storing execution information in the persistent layer.
All the events and the related RunState transitions happens on a single thread. However, the OutputHandlers are executed on an executor service, so they could happen in parallel.
See RunState.java
, EventVisitor.java
and OutputHandlers
.
Each workflow instance will be executed until completion or until the maximum number of retries is reached. This means that one or more actual docker runs will happen per workflow instance. How many depends on the exit codes of the runs:
- Exit code 0 is treated as a successful run and the associated workflow instance is removed from the active set;
- Exit code 20 is considered by Styx as a run that failed due to missing dependencies that are expected to be present in the following executions. The approach in this case is to schedule a retry after a fixed timeout (10 minutes);
- Exit code 50 will cause an immediate failure of the workflow instance (no re-try will be scheduled). This can be used by workflow to indicate an unrecoverable failure and instruct Styx not to retry.
- Other exit codes are treated as generic execution errors. In this case Styx reruns the workflow instance after a timeout that increases with the number of attempts (exponential backoff).
This whole process follows this state graph:
This is the entry state for a workflow instance.
Workflow instances in this state are awaiting for the next execution to start. Delays for retries or resource limits are possible criteria to determine if a workflow instance should be in the QUEUED state.
This state looks up metadata about the workflow instance and stores it within the state machine. Metadata is information such as docker image, source code repository, etc.
This state sends a request for starting a new pod to GKE.
The request for starting a new pod has been sent to GKE. Styx is now waiting for GKE to report back that the pod has started.
The pod has started and the container is running. Styx is waiting for the workflow instance to finish.
The container within the pod is done and the pod has terminated.
A workflow instance has succeed and is done.
A workflow instance has permanently failed. No more retries will be executed.
Submission of job failed and Styx tries to determine if the job shall be rescheduled for a retry or not.