Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Check only one runner #150

Merged
merged 23 commits into from
Oct 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
3d0c45c
Added is_locked property method for MongoLock.
davidwaroquiers Jul 17, 2024
c98c121
Added lock_auxiliary method in JobController.
davidwaroquiers Jul 17, 2024
61d8be8
First implementation of runner info in auxiliary collection.
davidwaroquiers Jul 17, 2024
6600b55
Fixed problem when running runner document is not yet in the auxiliary
davidwaroquiers Jul 18, 2024
ec4361b
Added test for MongoLock.
davidwaroquiers Jul 18, 2024
dfd79af
Added upgrade utility.
davidwaroquiers Jul 22, 2024
620d96a
Added reset of the runnning_runner in the database for kill and
davidwaroquiers Jul 22, 2024
23e104b
Fixed RuntimeError message in run_one_job.
davidwaroquiers Jul 23, 2024
1342356
Fixed problems for tests failing locally.
davidwaroquiers Jul 23, 2024
1979851
Added some tests.
davidwaroquiers Jul 23, 2024
943ce50
Fixed upgrades. Added tests.
davidwaroquiers Jul 25, 2024
ed0f5b9
Fixes after review.
davidwaroquiers Jul 25, 2024
747e8b5
Updates following review.
davidwaroquiers Jul 26, 2024
2307c6d
Added pytest-mock to test deps.
davidwaroquiers Jul 26, 2024
8114ff5
Reorganized upgrade procedure.
davidwaroquiers Jul 26, 2024
ab9c78e
update tests
gpetretto Oct 3, 2024
c79e134
updates to daemon check procedure
gpetretto Oct 4, 2024
cac2898
refactor upgrade procedure
gpetretto Oct 8, 2024
a9b85d2
upgrade documentation
gpetretto Oct 8, 2024
be10f71
remove print statement
gpetretto Oct 8, 2024
4f7b5c4
update runner daemon and docs
gpetretto Oct 22, 2024
1a1c2e8
fix comments and more docs
gpetretto Oct 23, 2024
8fc6704
fix pyproject
gpetretto Oct 23, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@
"sphinx_copybutton",
"sphinxcontrib.autodoc_pydantic",
"sphinxcontrib.mermaid",
"sphinxcontrib.typer",
]

# Add any paths that contain templates here, relative to this directory.
Expand Down
26 changes: 26 additions & 0 deletions doc/source/user/cli.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.. _cli:

===
CLI
===

Jobflow-remote allows to manage Jobs and Flows through the ``jf`` command line
interface (CLI). The most useful commands are already discussed in the
specific sections. A list of all the commands available can be obtained
running::

jf --tree

or for the commands available for a subsection with, for example::

jf job --tree

All the commands have an associated help that can be shown with the
``--help`` flag. Below are reported the help for all the commands
available in ``jf``.

.. typer:: jobflow_remote.cli:app
:preferred: html
:width: 65
:show-nested:
:make-sections:
2 changes: 2 additions & 0 deletions doc/source/user/errors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,8 @@ rerun a Job. In particular
inconsistencies.


.. _errors runner:

Runner errors and Locked jobs
=============================

Expand Down
2 changes: 2 additions & 0 deletions doc/source/user/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -18,8 +18,10 @@ details are found in :ref:`reference`.
tuning
errors
states
runner
advancedoptions
backup
cli

.. toctree::
:hidden:
Expand Down
58 changes: 55 additions & 3 deletions doc/source/user/install.rst
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
.. _install:

**********************
Setup and installation
**********************
*******************************
Setup, installation and upgrade
*******************************

Introduction
============
Expand All @@ -25,6 +25,8 @@ All of these should have a python environment with at least jobflow-remote insta
However, only **USER** and **RUNNER** need to have access to the database. If not overlapping
with the other **RUNNER** only needs ``jobflow-remote`` and its dependencies to be installed.

.. _setup options:

Setup options
=============

Expand Down Expand Up @@ -97,6 +99,50 @@ or, for the development version::

pip install git+https://github.com/Matgenix/jobflow-remote.git

.. _upgrade :

Upgrade
=======

If you upgraded ``jobflow-remote`` to a new version and plan to use it with an
already existing project, it is possible that there will be incompatibilities
between the existing database or project configuration and those used in the upgraded
version. In order to smooth the upgrade procedure a tool to upgrade the configuration
has been implemented. This is exposed through the ``jf`` command line tool::

jf admin upgrade

This performs the following steps:

* Compare the version of the installed ``jobflow-remote`` with the one stored
in the database (set when executing a ``jf admin reset``) and use this as a
reference to determine which upgrades will be applied.
* Check the version of ``jobflow`` installed and compare with the version stored
in the database. Optionally compare the versions of all the other packages
installed (use the ``--check-env`` option).
* Provide a list of upgrades that will be performed.
* Ask the user for confirmation
* Sequentially apply the required upgrades.
* Update the version information in the database.

This will resolve potential incompatibilities and make the configuration compatible
with the current version of ``jobflow-remote``.

.. warning::
It is advisable to perform a backup of the content of the queue database
before performing the upgrade. See the :ref:`backup` section for more details.

.. note::
The version will be upgraded in steps, so that if multiple versions have
been skipped before the current upgrade, the code will proceed by upgrading
between subsequent versions, one at the time.

.. note::
A difference in the packages does not necessarily imply issues for the upgrade.
It may help checking if anything problematic or an unexpected difference may
be present.


Environments
============

Expand Down Expand Up @@ -216,4 +262,10 @@ As a last step you should reset the database with the command::
This will also delete the content of the database. If are reusing an existing database
and do not want to erase your data skip this step.

.. note::

This will also set the information about the ``jobflow-remote`` version and
python environment in the database. This will be used during the :ref:`upgrade`
procedure.

You are now ready to start running workflows with jobflow-remote!
3 changes: 2 additions & 1 deletion doc/source/user/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ equivalent for the other kinds of setup.

.. image:: ../_static/img/daemon_schema.svg
:width: 50%
:alt: All-in-one configuration
:alt: Runner daemon
:align: center

Once the daemon is started, the runner loops over the different actions that it can
Expand All @@ -68,6 +68,7 @@ perform and updates the state of Jobs in the database performing some actions on
- resolving all the references of the Job from the database (including everything in additional stores)
- using those data to generate a JSON representation of the Job without external references
- uploading a JSON file with this information on the runner

Once this is done, the state of the Job is ``UPLOADED``.
* The runner generates a submission script suitable for the type of chosen worker.
Uploads it and submits the job. The Job is now ``SUBMITTED``.
Expand Down
220 changes: 220 additions & 0 deletions doc/source/user/runner.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,220 @@
.. _runner:

======
Runner
======

In jobflow-remote the Runner refers to one or more processes that handle the whole
execution of the jobflow workflows, including the interaction with the worker and the
writing of the outputs in the ``JobStore``.

The way the Job states change based on the action of the Runner has already been
described in the introductory :ref:`workingprinciple` section. This section will
instead focus on the technical aspects of the runner execution.

Setup
=====

As explained in the :ref:`setup options` section, and exemplified in the figure below,
the Runner process must have **access to all required resources** as specified in the project
configuration. In particular all the workers, the MongoDB database defined in the
``queue`` section and the output ``JobStore``.

.. image:: ../_static/img/daemon_schema.svg
:width: 50%
:alt: Runner daemon
:align: center

Runner processes
================

The ``Runner`` performs different tasks, mainly divided in

1) checking out jobs from database to start a Flow execution
2) updating the states of the Jobs in the ``queue`` database
3) interacting with the worker hosts to upload/download files and check the job status
4) inserting the output data in the output ``JobStore``

While all these can be executed in a single process, to speed up the execution of the jobs,
the default option is to start different daemonized processes, each of which takes care
of one of the actions listed above. In addition, if many Jobs need to be dealt with
simultaneously, it is also possible to start multiple independent processes that will
deal with the tasks 3 and 4. This can allow to increase the throughput.

.. note::
This means that multiple instances of the ``Runner`` process are simultaneously
running on the machine, each one requiring a certain amount of memory. If the
system memory is a limiting factor, all the actions can be executed in a single
process by starting the daemon with the ``--single`` option.

When activated from the CLI, these run as daemonized processes, that are handled by
`Supervisor <http://supervisord.org/index.html>`_. The CLI, through the ``DaemonManager``
object, will provide an interface to interact with the daemon and start, stop and kill
the ``Runner`` processes.

Process management
==================

Start
-----

The ``Runner`` is usually started using the CLI command::

jf runner start

This will start the Supervisor process, that will then spawn the single or
multiple Runner processes. Note that the command will not wait for all the processes
to start, so the successful completion of the command does not necessarily imply
that all the Runner processes are active.
The number of runner processes can only be managed at start time. The ``--single``
option will run all the actions described in the previous section in a single
process, instead that in multiple ones, which is the default. The ``--transfer``
and ``--complete`` options allow to increase the number of processes dedicated to
the steps 3 and 4.

.. warning::
The ``Runner`` **reads the project configurations when the processes is
started** and does not attempt to refresh them during the execution. Whenever
the project configuration is changed the ``Runner`` needed the runner needs
to be restarted.

.. _runner stop:

Stop
----

Executing the stop command::

jf runner stop

relies on Supervisor to send a ``SIGTERM`` signal (a termination signal that allows
the process to exit cleanly) to all the ``Runner`` processes.
In this case the supervisor process will remain active. Unless the ``--wait`` option
is specified, the completion of the command will not imply that all the ``Runner``
processes have been terminated.

.. warning::
The ``Runner`` is designed to recognize the signal and **wait for the completion of
the action being performed**, before actually exiting.

.. note::
Since the supervisor process remains active, when starting the runner again after
a stop it is not possible to switch from a single process to a split configuration
or the other way round. It is necessary to shut down the whole daemon in that case.

Shutdown
--------
Shutting down the runner with the command::

jf runner shutdown

is equivalent to the :ref:`runner stop`, except that also the Supervisor process will
be stopped.

Kill
----

It is possible to directly kill all the processes, without sending the ``SIGTERM``
signal and thus without waiting for the current action to be completed with the
command::

jf runner kill

.. warning::
If an was action was being performed, it is possible that the database may be
left in an inconsistent state and/or that the Job that was being processed
will be *locked*, as the runner puts an explicit lock on the document while
working on a Job and/or on a Flow. See the :ref:`errors runner` section for
how to handle these cases.

Information
-----------

It is possible to get an overall state of the runner daemon executing::

jf runner status

This returns a custom global state defined in jobflow-remote. Typical
values are ``shut_down``, ``stopped`` and ``running``. A ``partially_running``
state means that some of the daemonized processes are active, while other
are either not yet started or have been stopped.
To get more details about the single processes it is possible to run::

jf runner info

This prints a table like::

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━┳━━━━━━━━━┓
┃ Process ┃ PID ┃ State ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━╇━━━━━━━━━┩
│ supervisord │ 12305 │ RUNNING │
│ runner_daemon_checkout:run_jobflow_checkout │ 90127 │ RUNNING │
│ runner_daemon_complete:run_jobflow_complete0 │ 90128 │ RUNNING │
│ runner_daemon_queue:run_jobflow_queue │ 90129 │ RUNNING │
│ runner_daemon_transfer:run_jobflow_transfer0 │ 90130 │ RUNNING │
└──────────────────────────────────────────────┴───────┴─────────┘

providing the state of each individual daemon process and its system
process ID.

Running daemon check
====================

There are no strict limitations to which machine should be used to execute the ``Runner``
as a daemon, and, as explained in the :ref:`setup options` section, there are several
possible configurations. It is thus possible for a user to mistakenly start
the runner daemon on two different locations. While this should not corrupt
davidwaroquiers marked this conversation as resolved.
Show resolved Hide resolved
the database, thanks to the locking mechanism, it may still generate errors
since the Job outputs could be downloaded from one of the runners and another one
may try to complete it, but without having access to the downloaded files.
This can be confusing as a user may be unaware that a runner is already active on some machine.
To mitigate the possibility of this to occur, jobflow-remote also adds information
in the database about the machine where a ``Runner`` daemon is started. The
code will then prevent the system to start a daemon on a different machine. All the
commands will instead be allowed if the information about the machine where are
executed match those in the database.
If a machine where a ``Runner`` was previously active was switched off without
explicitly stopping it, the database will still consider that daemon to be active.
To start the daemon on another machine, if it is certain that ``Runner`` is not
active anymore, it is possible to clean the database reference to the previous
process with the command::

jf runner reset

A new ``Runner`` daemon can then be started anywhere.

.. warning::
This procedure is applied only for ``Runner`` processes started as a daemon.
No check is done and no data is added to the database if the ``Runner`` is
started directly. See the :ref:`runner direct` section below.

Backoff algorithm
=================

While performing its actions on the Jobs, the ``Runner`` processes may incur in some
issues. For example a connection error may occur. In order to avoid overloading
the processes and/or the resources, when such an error occurs the process will not
immediately retry to execute the action, but wait an increasingly larger amount of time
before retrying. After three failures, the Job will be set to the ``REMOTE_ERROR``
state. See the :ref:`remoteerrors` for more details.


.. _runner direct:

Direct execution
================

It is not the standard usage, but in some cases, for example during development
or debugging, it may be useful to run the ``Runner`` processes directly and not as
a daemon. The simplest option to do that would be to run::

jf runner run

This will start a single ``Runner`` process performing all the actions.
Similarly, it is also possible to execute this from the python API with the code
below

.. code-block::python

from jobflow_remote.jobs.runner import Runner
Runner(project_name="xxx").run()
3 changes: 2 additions & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,14 +38,15 @@ dependencies = [

[project.optional-dependencies]
dev = ["pre-commit>=3.0.0"]
tests = ["docker ~= 7.0", "pytest ~= 8.0", "pytest-cov >= 4,< 6"]
tests = ["docker ~= 7.0", "pytest ~= 8.0", "pytest-cov >= 4,< 6", "pytest-mock ~= 3.14"]
docs = [
"autodoc_pydantic>=2.0.0",
"pydata-sphinx-theme",
"sphinx",
"sphinx-copybutton",
"sphinx_design",
"sphinxcontrib-mermaid",
"sphinxcontrib-typer[html]",
]

[project.scripts]
Expand Down
Loading