Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👌 IMPROVE: add DaemonClient.lock() #4924

Closed
wants to merge 1 commit into from
Closed

Conversation

chrisjsewell
Copy link
Member

@chrisjsewell chrisjsewell commented May 6, 2021

This is a proposal for part of the solution to #4321, in order to provide a basic locking mechanism for the daemon; to "ensure" that it can not be running or started from another process:

In [29]: from aiida.engine.daemon.client import get_daemon_client
    ...: client = get_daemon_client()
    ...: with client.lock():
    ...:     # try to get another client
    ...:     get_daemon_client()
    ...: 
---------------------------------------------------------------------------
DaemonLock                                Traceback (most recent call last)
<ipython-input-29-58ecd1c03adb> in <module>
      3 with client.lock():
      4     # try to get another client
----> 5     get_daemon_client()
      6 

~/Documents/GitHub/aiida-core-origin/aiida/engine/daemon/client.py in get_daemon_client(profile_name)
     61         profile = config.current_profile
     62 
---> 63     return DaemonClient(profile)
     64 
     65 

~/Documents/GitHub/aiida-core-origin/aiida/engine/daemon/client.py in __init__(self, profile)
     87         self._DAEMON_TIMEOUT: int = config.get_option('daemon.timeout')  # pylint: disable=invalid-name
     88         if Path(self.daemon_lock_file).exists():
---> 89             raise DaemonLock(
     90                 f"The '{self.profile.name}' daemon is locked for use. "
     91                 'Use `verdi daemon unlock` if this is unexpected.'

DaemonLock: The quicksetup daemon is locked for use. Use `verdi daemon unlock` if this is unexpected.

cc @giovannipizzi, @sphuber for comment

(if you think its ok, I'll add tests and was also thinking of adding verdi daemon unlock, in case for whatever reason the lock file is not removed)

@sphuber
Copy link
Contributor

sphuber commented May 6, 2021

Concept looks good I think. There is just one edge case that I realized that I wanted to note before I forget. The check that the "daemon is running" is actually checking that the circus daemon pid file exists. However, that doesn't check that there are daemon worker processes running. Now typically if the circus daemon is not running, there shouldn't be daemon workers, however that is not a guarantee. It is perfectly possible to launch a daemon worker outside the circus daemon. For example verdi devel run_daemon directly runs a daemon worker process, without circus. Now since this is under verdi devel we might not want or need to have to take this into account, but just wanted to mention it for discussion.

def lock(self):
if self.is_daemon_running:
raise RuntimeError(f"Cannot lock the '{self.profile.name}' daemon whilst it is running.")
Path(self.daemon_lock_file).touch()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think here you need to avoid concurrent creation of the file to succeed silently, so exist_ok=False

@@ -95,6 +102,14 @@ def daemon_name(self) -> str:
"""
return self._DAEMON_NAME.format(name=self.profile.name)

@contextmanager
def lock(self):
if self.is_daemon_running:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are mixing the "old" check with the PID file with the "new" check with the lock file. However the logic sounds a bit complex to me, because the check on whether is locked is actually done in the __init__. Shouldn't this function raise if there is already a lock file? I.e. in the line with the touch below, and using exist_ok=False, catch that exception and re-raise the DaemonLock exception? Maybe I'm just missing how this lock manager would actually be used in practice (I know it's in the commit message but I'm still not sure of how you envision to use it in practice).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeh I can do that as well 👍
The key point is to inhibit the instantiatuon of any further DaemonClient (from this or any other processes), since this is required for both verdi daemon start-circus and verdi devel run-daemon.
Naturally you could bypass this in some way, so it's not a "100% guarantee", but I think this is a 99% solution to provide a good level of insurance that no daemons are running while we clean the repo

raise RuntimeError(f"Cannot lock the '{self.profile.name}' daemon whilst it is running.")
Path(self.daemon_lock_file).touch()
yield self
Path(self.daemon_lock_file).unlink()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you actually want to do a:

try:
   ...
   yield self
finally:
   Path(self.daemon_lock_file).unlink()

to ensure the file is removed even if there are exceptions

Copy link
Member Author

@chrisjsewell chrisjsewell May 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yeh, I was mistaking it with pytest fixtures, that have this try/finally logic built-in.
Thought that was the case here, but no

@giovannipizzi
Copy link
Member

I think this is a 99% solution to provide a good level of insurance that no daemons are running while we clean the repo

I see. However, in principle, we'd like also to avoid that there are verdi shell running while the repo is cleaned (as well as avoiding that two cleaners work at the same time).

What, instead, about this different approach.

In load_profile

  • This should currently be called by any process that needs to use data from AiiDA, so both verdi shells, Jupiter notebooks, daemons, etc
  • This already calls load_backend_environment, that already does checks e.g. on the Schema Version from DbSetting:
    def load_backend_environment(self, profile, validate_schema=True, **kwargs):
    """Load the backend environment.
    :param profile: the profile whose backend environment to load
    :param validate_schema: boolean, if True, validate the schema first before loading the environment.
    :param kwargs: keyword arguments that will be passed on to the backend specific scoped session getter function.
    """
    self._load_backend_environment(**kwargs)
    if validate_schema:
    self.validate_schema(profile)
  • We add another check of a new locked_profile key (whose value is set to the PID of the process that has locked the profile)
    • if the key exists, stop and exit
      • if the value (that is the PID of the process locking the profile) is a valid PID of a running process, mention in the message the PID of the process that is locking this profile
      • if there is no process with that PID, maybe mention a verdi level unlock-deamon if people are sure? (then implement this command, but also there do the same logic, to avoid that a user just calls the command because time has been passing, preventing the lock to properly work

In any process that needs to acquire a lock (verdi clean subcommands etc.)

Call a locking mechanism (with try/finally or a context manager, to be implemented).
The locking mechanism should do the following checks:

  • already it should load the dbenv before, so it does the checks above
  • it should perform a SQL query asking PostgreSQL how many connections there are to the current DB; stop if there is any other connection except itself [1]
  • add (in an atomic way, and failing if a row with the key already exists) a new row in DbSetting, with key locked_profile and value the PID of the current process. This can be achieved by just adding a row to DbSetting, where the table has a uniqueness constraint on the key (locked_profile in this case)
    • if the transaction fails, it means someone else acquired a lock. Print a message (maybe mention the PID) and exit
    • if it goes through, the lock was correctly acquired and no-one else could acquire it or can acquire it in the future, continue
  • there is still the possibility that another process called load_profile/dbenv at the same time. Do again the check [1] above on the processes; if there is some other process, remove the lock, and exit with an error message saying that there is some other process (same error as in [1])
  • At this point, it's safe to assume that there is a single process that has locked the profile and no other process can neither acquire another lock, nor use the DB. Start performing the operations
  • Make sure to unlock (remove the DbSetting row) in a try/finally block (or in a context manager) so we reduce the risks that the profile remains locked inadvertently

This should be quite safe to avoid multiple uses from any client. It only remains to check which query to run to accomplish [1], but I'm quite sure a google search will help.
The only risk is that the python process is killed (e.g. with kill -9, SIGSEGV, system reboot) and does not remove the row in the DbSetting. But the the check in load_profile will detect that the PID does not exist anymore and suggest the user to run verdi level unlock-deamon

Also, in this way, we can keep the current logic for the daemons avoiding to touch that part.

@giovannipizzi
Copy link
Member

giovannipizzi commented May 7, 2021

For [1] the query could be:

select pid, client_port from pg_stat_activity where datname='PUT_YOUR_DB_NAME_HERE';

This seems to correctly count the number of connections (e.g. 1 per daemon, 1 per verdi shell, and itself from psql.
however the pid is not the python pid but the pid of the psql process launched to reply to requests.
Maybe the port can help checking, but with this query it's not possible to easily check which python process is connected (so detect "yourself").
Maybe it's enough (and one just checks that there is only 1 connection?) or maybe there is a way to either get the actual python process ID, or to check (from the python side) the port from which we are connected to postgres (via the psypopg2 library or similar?)

EDIT: at least we can remove "ourselves" from the query using the pg_backend_pid() function that returns the backend PID for the current connection.
To check it:

select pg_backend_pid();

For the query above:

select pid, client_port from pg_stat_activity where datname='PUT_YOUR_DB_NAME_HERE' and pid != pg_backend_pid();

would return non-zero results if there is anybody else.

This is enough I think: we can tell the user that there are XX other connections, and the PID of the backends (even if not the PID of the python process - but we could provide 1. a utility function in AiiDA to check it in verdi shell, and 2. print it when a daemon is started, in the logs). And then we would just put our current PID in the DbSetting table, so load_profile knows which (python) PID what to check at load, and in the unlock part.

@dev-zero
Copy link
Contributor

When I was reading the summary of this I got confused: why would you need a lock file when you have a pid file already which in the Linux world is usually sufficient to determine whether a service is up. Only when I finished reading I realized that you need it to make sure no daemon gets started (or is running).
Since a lock (file) is often meant to prevent a task from running more than once at the same time I would hence suggest to reword it to something more specific like inhibitor_lock as being used in the context of inhibit start of the screensaver or maybe call it maintenance_mode (often used in webapps) or a profile_lock (rather than a daemon lock).

I'd prefer @giovannipizzi's approach: since we have a DB with OS-portable transactions it should be preferred. Only change: add a timestamp to it as well (that would be included in a file-based lock file for free and can give the user more info to decide whether it is safe to manually unlock) and call it unlock-profile.

@chrisjsewell
Copy link
Member Author

Yeh cheers for the feedback @dev-zero, indeed it was very much a rough initial proposal. I'll take the feedback and do a "take 2"

@sphuber
Copy link
Contributor

sphuber commented Jan 20, 2022

@chrisjsewell I take it this can be closed since @ramirezfranciscof is implementing the approach that also includes the functionality of tracking access by process other than the daemon workers.

@sphuber
Copy link
Contributor

sphuber commented Jan 21, 2022

Superseded by #5270

@sphuber sphuber closed this Jan 21, 2022
@sphuber sphuber deleted the daemon-lock branch January 27, 2022 16:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants