Skip to content

Commit

Permalink
update runner daemon and docs
Browse files Browse the repository at this point in the history
  • Loading branch information
gpetretto committed Oct 22, 2024
1 parent d01d19e commit 2be9e45
Show file tree
Hide file tree
Showing 12 changed files with 494 additions and 33 deletions.
2 changes: 2 additions & 0 deletions doc/source/user/errors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,8 @@ rerun a Job. In particular
inconsistencies.


.. _errors runner:

Runner errors and Locked jobs
=============================

Expand Down
1 change: 1 addition & 0 deletions doc/source/user/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ details are found in :ref:`reference`.
tuning
errors
states
runner
advancedoptions
backup

Expand Down
2 changes: 2 additions & 0 deletions doc/source/user/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ All of these should have a python environment with at least jobflow-remote insta
However, only **USER** and **RUNNER** need to have access to the database. If not overlapping
with the other **RUNNER** only needs ``jobflow-remote`` and its dependencies to be installed.

.. _setup options:

Setup options
=============

Expand Down
217 changes: 217 additions & 0 deletions doc/source/user/runner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,217 @@
.. _runner:

======
Runner
======

In jobflow-remote the Runner refers to one or more processes that handle the whole
execution of the jobflow workflows, including the interaction with the worker and the
writing of the outputs in the ``JobStore``.

The way the Job states change based on the action of the Runner has already been
described in the introductory :ref:`workingprinciple` section. This section will
instead focus on the technical aspects of the runner execution.

Setup
=====

As explained in the :ref:`setup options` section, and exemplified in the figure below,
the Runner process must have **access to all required resources** as specified in the project
configuration. In particular all the workers, the MongoDB database defined in the
``queue`` section and the output ``JobStore``.

.. image:: ../_static/img/daemon_schema.svg
:width: 50%
:alt: All-in-one configuration
:align: center

Runner processes
================

The ``Runner`` performs different tasks, mainly divided in

1) checking out jobs from database to start a Flow execution
2) updating the states of the Jobs in the ``queue`` database
3) interacting with the worker hosts to upload/download files and check the job status
4) inserting the output data in the output ``JobStore``

While all these can be executed in a single process, to speed up the execution of the jobs,
the default option is to start different daemonized processes, each of which takes care
of one of the actions listed above. In addition, if many Jobs need to be dealt with
simultaneously, it is also possible to start multiple independent processes that will
deal with the tasks 3 and 4. This can allow to increase the throughput.

.. note::
This means that multiple instances of the ``Runner`` process are simultaneously
running on the machine, each one requiring a certain amount of memory. If the
system memory is a limiting factor, all the actions can be executed in a single
process by starting the daemon with the ``--single`` option.

When activated from the CLI, these run as daemonized processes, that are handled by
`Supervisor <http://supervisord.org/index.html>`_. The CLI, through the ``DaemonManager``
object, will provide an interface to interact with the daemon and start, stop and kill
the ``Runner`` processes.

Process management
==================

Start
-----

The ``Runner`` is usually started using the CLI command::

jf runner start

This will start the Supervisor process, that will then spawn the single or
multiple Runner processes. Note that the command will not wait for all the processes
to start, so the successful completion of the command does not necessarily imply
that all the Runner processes are active.
The number of runner processes can only be managed at start time. The ``--single``
option will run all the actions described in the previous section in a single
process, instead that in multiple ones, which is the default. The ``--transfer``
and ``--complete`` options allow to increase the number of processes dedicated to
the steps 3 and 4.

.. warning::
The ``Runner`` reads the project configurations when each of the processes is
started and does not attempt to refresh them during the execution. Whenever
changing

.. _runner stop:

Stop
----

Executing the stop command::

jf runner stop

relies on Supervisor to send a ``SIGTERM`` signal (a termination signal that allows
the process to exit cleanly) to all the ``Runner`` processes.
In this case the supervisor process will remain active. Unless the ``--wait`` option
is specified, the completion of the command will not imply that all the ``Runner``
processes have been terminated.

.. warning::
The ``Runner`` is designed to recognize the signal and **wait the completion of
the action being performed**, before actually exiting.

.. note::
Since the supervisor processes remains active, when starting the runner again after
a stop it is not possible to switch from a single process to a split configuration
or the other way round. It is necessary to shut down the whole daemon in that case.

Shutdown
--------
Shutting down the runner with the command::

jf runner shutdown

is equivalent to the :ref:`runner stop`, except that also the Supervisor process will
be stopped.

Kill
----

It is possible to directly kill all the processes, without sending the ``SIGTERM``
signal and thus without waiting for the current action to be completed with the
command::

jf runner kill

.. warning::
If an was action was being performed, it is possible that the database may be
left in an inconsistent state and/or that the Job that was being processed
will be *locked*, as the runner puts an explicit lock on the document while
working on a Job and/or on a Flow. See the :ref:`errors runner` section for
how to handle these cases.

Information
-----------

It is possible to get an overall state of the runner daemon executing::

jf runner status

This returns a custom global state defined in jobflow-remote. Typical
values are ``shut_down``, ``stopped`` and ``running``. A ``partially_running``
state means that some of the daemonized processes are active, while other
are either not yet started or have been stopped.
To get more details about the single processes it is possible to run::

jf runner info

This prints a table like::

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ Process ┃ PID ┃ State ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ supervisord │ 12305 │ RUNNING │
│ runner_daemon_checkout:run_jobflow_checkout │ 90127 │ RUNNING │
│ runner_daemon_complete:run_jobflow_complete0 │ 90128 │ RUNNING │
│ runner_daemon_queue:run_jobflow_queue │ 90129 │ RUNNING │
│ runner_daemon_transfer:run_jobflow_transfer0 │ 90130 │ RUNNING │
└──────────────────────────────────────────────┴───────┴─────────┘

providing the state of each individual daemon process and its system
process ID.

Running daemon check
====================

There are no strict limitations to which machine should be used to execute the ``Runner``
as a daemon, and, as explained in the :ref:`setup options` section, there are several
possible configurations. It is thus possible for a user to mistakenly start
the runner daemon on two different locations. While this should not corrupt
the database, thanks to the locking mechanism, it may be confusing as a user
may be unaware that a runner is already active on some machine.
To mitigate the possibility of this to occur, jobflow-remote also adds information
in the database about the machine where a ``Runner`` daemon is started. The
code will then prevent the system to start a daemon on a different machine. All the
commands will instead be allowed if the information about the machine where are
executed match those in the database.
If a machine where a ``Runner`` was previously active was switched off without
explicitly stopping it, the database will still consider that daemon to be active.
To start the daemon on another machine, if it is certain that ``Runner`` is not
active anymore, it is possible to clean the database reference to the previous
process with the command::

jf runner reset

A new ``Runner`` daemon can then be started anywhere.

.. warning::
This procedure is applied only for ``Runner`` processes started as a daemon.
No check is done and no data is added to the database if the ``Runner`` is
started directly. See the :ref:`runner direct` section below.

Backoff algorithm
=================

While performing its actions on the Jobs, the ``Runner`` processes may incur in some
issues. For example a connection error may occur. In order to avoid overloading
the processes and/or the resources, when such an error occurs the process will not
immediately retry to execute the action, but wait an increasingly larger amount of time
before retrying. After three failures, the Job will be set to the ``REMOTE_ERROR``
state. See the :ref:`remoteerrors` for more details.


.. _runner direct:

Direct execution
================

It is not the standard usage, but in some cases, for example during development
or debugging, it may be useful to run the ``Runner`` processes directly and not as
a daemon. The simplest option to do that would be to run::

jf runner run

This will start a single ``Runner`` process performing all the actions.
Similarly, it is also possible to execute this from the python API with the code
below

.. code-block::python
from jobflow_remote.jobs.runner import Runner
Runner(project_name="xxx").run()
14 changes: 11 additions & 3 deletions src/jobflow_remote/cli/admin.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
from typing import Annotated, Optional

import typer
from packaging.version import parse as parse_version
from rich.prompt import Confirm
from rich.text import Text

Expand All @@ -25,6 +26,7 @@
check_stopped_runner,
confirm_project_name,
exit_with_error_msg,
exit_with_warning_msg,
get_config_manager,
get_job_controller,
get_job_ids_indexes,
Expand Down Expand Up @@ -69,12 +71,20 @@ def upgrade(
) -> None:
"""
Upgrade the jobflow database.
WARNING: can modify all the data. Previous version could not be retrieved anymore.
WARNING: can modify all the data. Previous version cannot be retrieved anymore.
It preferable to perform a backup before upgrading.
"""
check_stopped_runner(error=True)

jc = get_job_controller()
upgrader = DatabaseUpgrader(jc)
target_version = parse_version(test_version_upgrade) or upgrader.current_version
db_version = jc.get_current_db_version()
if db_version >= target_version:
exit_with_warning_msg(
f"Current DB version: {db_version}. No upgrade required for target version {target_version}"
)

jobflow_check = jc.upgrade_check_jobflow()
if jobflow_check:
out_console.print(jobflow_check)
Expand All @@ -89,9 +99,7 @@ def upgrade(
)
out_console.print(text)

upgrader = DatabaseUpgrader(jc)
if not no_dry_run:
target_version = test_version_upgrade or upgrader.current_version
actions = upgrader.dry_run(target_version=test_version_upgrade)
if not actions:
out_console.print(
Expand Down
Loading

0 comments on commit 2be9e45

Please sign in to comment.