-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
👌 IMPROVE: add DaemonClient.lock()
#4924
Conversation
Concept looks good I think. There is just one edge case that I realized that I wanted to note before I forget. The check that the "daemon is running" is actually checking that the circus daemon pid file exists. However, that doesn't check that there are daemon worker processes running. Now typically if the circus daemon is not running, there shouldn't be daemon workers, however that is not a guarantee. It is perfectly possible to launch a daemon worker outside the circus daemon. For example |
def lock(self): | ||
if self.is_daemon_running: | ||
raise RuntimeError(f"Cannot lock the '{self.profile.name}' daemon whilst it is running.") | ||
Path(self.daemon_lock_file).touch() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here you need to avoid concurrent creation of the file to succeed silently, so exist_ok=False
@@ -95,6 +102,14 @@ def daemon_name(self) -> str: | |||
""" | |||
return self._DAEMON_NAME.format(name=self.profile.name) | |||
|
|||
@contextmanager | |||
def lock(self): | |||
if self.is_daemon_running: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are mixing the "old" check with the PID file with the "new" check with the lock file. However the logic sounds a bit complex to me, because the check on whether is locked is actually done in the __init__
. Shouldn't this function raise if there is already a lock file? I.e. in the line with the touch
below, and using exist_ok=False
, catch that exception and re-raise the DaemonLock exception? Maybe I'm just missing how this lock manager would actually be used in practice (I know it's in the commit message but I'm still not sure of how you envision to use it in practice).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeh I can do that as well 👍
The key point is to inhibit the instantiatuon of any further DaemonClient (from this or any other processes), since this is required for both verdi daemon start-circus
and verdi devel run-daemon
.
Naturally you could bypass this in some way, so it's not a "100% guarantee", but I think this is a 99% solution to provide a good level of insurance that no daemons are running while we clean the repo
raise RuntimeError(f"Cannot lock the '{self.profile.name}' daemon whilst it is running.") | ||
Path(self.daemon_lock_file).touch() | ||
yield self | ||
Path(self.daemon_lock_file).unlink() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you actually want to do a:
try:
...
yield self
finally:
Path(self.daemon_lock_file).unlink()
to ensure the file is removed even if there are exceptions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeh, I was mistaking it with pytest fixtures, that have this try/finally logic built-in.
Thought that was the case here, but no
I see. However, in principle, we'd like also to avoid that there are What, instead, about this different approach. In
|
def load_backend_environment(self, profile, validate_schema=True, **kwargs): | |
"""Load the backend environment. | |
:param profile: the profile whose backend environment to load | |
:param validate_schema: boolean, if True, validate the schema first before loading the environment. | |
:param kwargs: keyword arguments that will be passed on to the backend specific scoped session getter function. | |
""" | |
self._load_backend_environment(**kwargs) | |
if validate_schema: | |
self.validate_schema(profile) |
locked_profile
key (whose value is set to the PID of the process that has locked the profile)
- if the key exists, stop and exit
- if the value (that is the PID of the process locking the profile) is a valid PID of a running process, mention in the message the PID of the process that is locking this profile
- if there is no process with that PID, maybe mention a
verdi level unlock-deamon
if people are sure? (then implement this command, but also there do the same logic, to avoid that a user just calls the command because time has been passing, preventing the lock to properly work
In any process that needs to acquire a lock (verdi clean
subcommands etc.)
Call a locking mechanism (with try/finally
or a context manager, to be implemented).
The locking mechanism should do the following checks:
- already it should load the dbenv before, so it does the checks above
- it should perform a SQL query asking PostgreSQL how many connections there are to the current DB; stop if there is any other connection except itself [1]
- add (in an atomic way, and failing if a row with the key already exists) a new row in
DbSetting
, with keylocked_profile
and value the PID of the current process. This can be achieved by just adding a row to DbSetting, where the table has a uniqueness constraint on the key (locked_profile
in this case)- if the transaction fails, it means someone else acquired a lock. Print a message (maybe mention the PID) and exit
- if it goes through, the lock was correctly acquired and no-one else could acquire it or can acquire it in the future, continue
- there is still the possibility that another process called
load_profile/dbenv
at the same time. Do again the check [1] above on the processes; if there is some other process, remove the lock, and exit with an error message saying that there is some other process (same error as in [1]) - At this point, it's safe to assume that there is a single process that has locked the profile and no other process can neither acquire another lock, nor use the DB. Start performing the operations
- Make sure to unlock (remove the DbSetting row) in a try/finally block (or in a context manager) so we reduce the risks that the profile remains locked inadvertently
This should be quite safe to avoid multiple uses from any client. It only remains to check which query to run to accomplish [1], but I'm quite sure a google search will help.
The only risk is that the python process is killed (e.g. with kill -9, SIGSEGV, system reboot) and does not remove the row in the DbSetting. But the the check in load_profile
will detect that the PID does not exist anymore and suggest the user to run verdi level unlock-deamon
Also, in this way, we can keep the current logic for the daemons avoiding to touch that part.
For [1] the query could be: select pid, client_port from pg_stat_activity where datname='PUT_YOUR_DB_NAME_HERE'; This seems to correctly count the number of connections (e.g. 1 per daemon, 1 per verdi shell, and itself from EDIT: at least we can remove "ourselves" from the query using the select pg_backend_pid(); For the query above: select pid, client_port from pg_stat_activity where datname='PUT_YOUR_DB_NAME_HERE' and pid != pg_backend_pid(); would return non-zero results if there is anybody else. This is enough I think: we can tell the user that there are XX other connections, and the PID of the backends (even if not the PID of the python process - but we could provide 1. a utility function in AiiDA to check it in verdi shell, and 2. print it when a daemon is started, in the logs). And then we would just put our current PID in the DbSetting table, so |
When I was reading the summary of this I got confused: why would you need a lock file when you have a I'd prefer @giovannipizzi's approach: since we have a DB with OS-portable transactions it should be preferred. Only change: add a timestamp to it as well (that would be included in a file-based lock file for free and can give the user more info to decide whether it is safe to manually unlock) and call it |
Yeh cheers for the feedback @dev-zero, indeed it was very much a rough initial proposal. I'll take the feedback and do a "take 2" |
@chrisjsewell I take it this can be closed since @ramirezfranciscof is implementing the approach that also includes the functionality of tracking access by process other than the daemon workers. |
Superseded by #5270 |
This is a proposal for part of the solution to #4321, in order to provide a basic locking mechanism for the daemon; to "ensure" that it can not be running or started from another process:
cc @giovannipizzi, @sphuber for comment
(if you think its ok, I'll add tests and was also thinking of adding
verdi daemon unlock
, in case for whatever reason the lock file is not removed)