Initial Feedback #4

chrisjsewell · 2021-09-06T11:20:15Z

chrisjsewell
Sep 6, 2021
Maintainer

The proposal is fully described in the README.md, but essentially a lot less code and complexity (compared to kiwipy+rabbitmq), yet better stability and functionality

Feedback welcome and encouraged (see if you can break it!) 😄

muhrin · 2021-09-06T11:44:42Z

muhrin
Sep 6, 2021

Very cool Chris. I've also been increasingly thinking that moving away from RMQ (if possible) would be a good thing. There is some nice functionality that we get from RMQ but I'm not sure it's worth the complexity, especially in light of the mismatch of use-cases you mention in the README.

So, in your proposed system what handles

the locking of a process (such that it is only being executed by one worker)
the unlocking of a process, e.g. when there is a crash, such that a worker will re-start working on a process that was unexpectedly interrupted.

?

7 replies

muhrin Sep 6, 2021

It's hard for me to say without studying your proposal more deeply, but I think you're pretty well versed now in what kiwiPy/RMQ do here. In short the goals were create a separation of concerns where the workflow engine could submit tasks which had certain guarantees associated with them (e.g. only one worker working on a task a time, automatic requeueing of tasks if the worker died for whatever reason).

In the back of my mind was also to keep the door open to usage scenarios such as multiple-users per database and multiple daemons running on different clusters pulling jobs from the same database. Naturally, in these kinds of senarios it's useful to have a central 'routers' like RMQ that can either distribute tasks or route actions (e.g. kill, pause) rather than point-to-point comms. I think it's unlikely that AiiDA will evolve in this direction so it's not clear we need this.

chrisjsewell Sep 6, 2021
Maintainer Author

Thanks, food for thought 🤔

As you say, I think we probably should not get too hung up on use-cases that may never come to fruition. That being said, I'm not necessarily convinced a central router could actually achieve this any better, as opposed to distributed routers accessing a central database

muhrin Sep 6, 2021

Yup, no probs.

Actually, that reminds me of another of the original goals. As you say, the DB can play the role of the 'router' using polling which is kind of how the old workflow system worked (or course, point to point comms can still be done by sockets as you do but this isn't suitable for persistable messages).

Now, @giovannipizzi explicitly didn't want any solution that would cause a CPU load when the daemon was doing nothing which meant that we would have to raise the poll interval leading to a decreased performance for scenarios involving lots of very short jobs (e.g. Seb's CIF cleaners). This is why 'push based' solutions became attractive which typically involve a separate service that also has persistence.

In short, whatever solution is adopted it should try to achieve these two goals simultaneously:

Low background CPU loads
Low latency in launching new jobs

chrisjsewell Sep 6, 2021
Maintainer Author

didn't want any solution that would cause a CPU load when the daemon was doing nothing
This is why 'push based' solutions became attractive which typically involve a separate service that also has persistence.

Hmm, I'd note the daemon will have lower CPU with a separate service, but then you have to take into account the CPU of that service as well, plus the heartbeat running on all the workers (I guess it's not too much though).
I reckon it would be possible though to utilise the controller here in "push" mode or potentially a hybrid push/pull, e.g.

When the servers starts, it reads the database to get "actionable" processes (as it does currently)
It then polls the database at a slow rate or even not at all (addressing point 2)
At the same time, the controller accepts client connection(s) from aiida
When AiiDA submits a process or performs an action (kill/pause/play), it writes it to the database, and then also messages the controller (if running), to inform it to immediately check the database (similar to how it does with rabbitmq, addressing point 1)
These messages could/should be throttled, so if you submit 1000s of processes in a short period it would not check the database for all these messages

chrisjsewell Sep 6, 2021
Maintainer Author

See: #5

muhrin · 2021-09-06T11:54:42Z

muhrin
Sep 6, 2021

Sweet, I'll have a closer read.

One thing I thought workers could do for this kind of database-centred approach is just have a thread periodically writing a heartbeat timestamp to the table every x seconds. This way, e.g., a daemon started on a difference machine could just just assume that anything with a timestamp older than some multiple of x is a stale process.

1 reply

chrisjsewell Sep 6, 2021
Maintainer Author

Yeh possibly cheers.
I actually replaced all thread based code, with asyncio in #2; I had a heartbeat between the coordinator (server) and worker (client), but felt that was overcomplicated.

Having a heartbeat write to the database though maybe indeed be a better approach.
I guess for the heartbeat you would have to add threading back, to make the heartbeat more "responsive" than possible with asyncio (particularly when you have many processes running on the worker).

But yeh, once an "inactive" worker is identified, it would be easy for the coordinator to simply call the circus controller, to kill that process.

ltalirz · 2021-09-06T12:42:47Z

ltalirz
Sep 6, 2021

Thanks a lot @chrisjsewell !
I'll let @muhrin and @sphuber comment on the implementation details.

From my side, reducing complexity is very valuable, especially in the engine code that has proven to be very difficult to debug (e.g. aiidateam/aiida-core#4603), so everything else being equal I would certainly be in favor.

I just wanted to add one minor comment since you mention circus: as one can see from the contributors tab, circus was developed in 2012-2015, with active maintenance until ~2017
Since then maintenance has been more "barebones"; there've been <100 commits to the code base, some of which by us because circus was not keeping up with the rest of the python ecosystem. Today, this makes circus a problematic dependency to base our engine on.

I never figured out what happened in 2014/2015, i.e. why the development stopped and whether there was some alternative that people switched to instead. Perhaps we should just ask the creator Tarek Ziade for advice.

1 reply

chrisjsewell Sep 6, 2021
Maintainer Author

Yep cheers, indeed it would be nice if circus could e.g. move from tornado to asyncio.
It would definitely be good to look into this and see if there are any other alternatives (or, push come to shove, fork the repo).

That being said, the actual system design, IMO is not problematic, i.e. even if we replaced circus, I don't feel it would really change the SysML model I put forward in the README.
This is the key thing for me, more so than the underlying code

giovannipizzi · 2021-09-07T08:49:13Z

giovannipizzi
Sep 7, 2021

Thanks Chris!
Looks very interesting, and indeed removing one of the server dependencies would be very beneficial to users and developers.

I put here a few questions/comments for discussion - scheduling tasks is always a tricky business, not much in designing the system for normal usage, but to ensure all goes well also when things go bad :-D Therefore it would be good that we collectively find a way to debug the system in corner cases, to see how it responds (many of my comments below are written thinking to this).

You don't need to comment point by point (if you don't want), probably the easiest is that you think about it, and then we organise a meeting next week with interested people as, in one hour, I think we can discuss and better understand the design and ask questions live.

What are the possible ways the system can "break" and the user is asked to manually "recover"? From the discussion above I see that you mention that there is no heartbeat anymore - is this a potential issue, and what can go wrong?
- for instance (related to the discussion above with Martin): at the moment there is no heartbeat, if I understand correctly from the discussion above - what happens when a worker crashes? Does the system detects it, or should the user detects it? And is there some guarantee that the process is not scheduled twice (even in these cases)?
I suggest to use PostgreSQL for stress testing - as you say there is only one writer to a SQLite DB, and this will remove concurrency and reduce the throughput, possibly hiding potential issues. I suggest to have the option to run with PSQL, and submit something of the order of 10'000 processes on a few (~4-8) workers, that should be something that AiiDA currently can handle, to see how it manages this.
- it would actually be ideal to design a similar "test" both for AiiDA (with PSQL and RMQ) and your new system (with PSQL), submitting say 10'000 processes that just log start time, maybe sleep randomly between 0 and 1 seconds, and log the end time; and you also log the time each process was scheduled, and the time it was considered as completed by the system. So we can check/monitor: 1. the throughput, 2. the latency for each job (e.g. if the sleep is zero, the time between submission and considering it finished); 3. the CPU and memory usage (as Martin said, if there is nothing to be done, the polling should be very inexpensive on CPU and not create infinite loops with 100% CPU (I guess your async.sleep should do this, if the sleep time is tuned to also guarantee some low latency).
is there still a concept of "slots" per worker, like now? (i.e. how many AiiDA processes a worker can take care of)?
- related to this: will the system still work as now (i.e., if a process calls a subprocess, the process still blocks a slot until the subprocess is completed?)
from the README, do I understand correctly that (like now) an entry in db_dbnode is created as soon as the node is created, and the entry in db_dbprocess is created at the same time, and the db_dbprocess entry is deleted when the process execution is complete?
I see you mention some security issues at the end - what would those be, and how can these be alleviated? For me it would be important to at least know that there is a way to run securely, even if not implemented at the very beginning (in particular, that a different Unix user on the same machine, or even on a different machine, can schedule tasks without at least some password protection like the DB password).

Here some suggestions for stress testing that I think would need to be done, while many processes are running:

check what happens if kill -9 is issued on one of the workers
check what happens if kill -9 is issued on the singleton coordinator
check what happens if the machine is rebooted without soft stopping the processes
check what happens with repeated loops of starting/stopping the daemon
check what happens with a slow machine, e.g. with all RAM used and swapping (maybe one can simulate this with some long sleep? or maybe there is no need to test this, depending on the design that I couldn't check yet)
check what happens if each process runs a long, CPU-intensive job that does not return back to asyncio (e.g. some numpy calculation) - this is already something that happens in AiiDA (people e.g. doing expensive numpy stuff in a calcfunction) and it would be good to understand if something breaks - especially if we reintroduce the heartbeat, or in general on how to detect if the worker is just slow in calculating, or is dead. [a multithreaded heartbeat can help, but wouldn't solve the problem of long numpy calculations that don't give back control to the GIL; so maybe one can just store the Unix PID of the worker, and check if it's still alive and the correct process to determine if it's alive - if we are OK that the workers should live on the same machine as the singleton process, that to me seems a reasonable limitation]

Happy to discuss this further!

0 replies

giovannipizzi · 2021-09-07T08:52:42Z

giovannipizzi
Sep 7, 2021

One more thing: for executing "actions" like kill, pause, play, ... -> do we need just a single "place" (i.e. a single column for the row in db_dbprocess), i.e. there is at most one action happening at a given time? Or do we need a "queue" of tasks?
Of course if we are sure that only one at a given time is enough, the design is simpler (e.g. once the action is kill, no other actions can be performed; if it's play or pause and another one is triggered, this replaces the previous one, ...). But we need to be extra careful that e.g. there are no two concurrent requests at the same time (e.g. from two different verdi shells) and one request "eats" the other one (e.g. you kill from shell 1, pause from shell 2 -> in this scenario, in whichever way it's implemented, I think that the intended behaviour should be that eventually the process is killed, and not just paused, i.e. action from shell 2 should not override action from shell 1).

0 replies

sphuber · 2021-09-08T15:09:07Z

sphuber
Sep 8, 2021

Thanks @chrisjsewell . I am in principle supportive of this change if it really means that we can remove RabbitMQ as a requirement while mainting feature, stability and performance parity, or improve it. If we have a slight downgrade in performance, than that could be excused given the reduced complexity. I think @giovannipizzi and @muhrin have voiced most of the concerns that I had. The only remaining question that sprung to mind was the usage of broadcasting where processes broadcasted state changes. There are some basic usages of this functionality, for example verdi process watch which allows to watch a process' state changes, but these are not that important. What is important though, is how workchains used this functionality to quickly start running again once the child processes they were waiting for were, reach the terminated state:

https://github.com/aiidateam/aiida-core/blob/706e48ba0a7ae050dddaac915db57aa76f7b984a/aiida/engine/processes/workchains/workchain.py#L320

This feature is crucial when running many workchains that launch sub processes that are shortly lived. If you rely solely on polling, there can be long unnecessary waits. I think that the work I did for the 3DCD would have taken a lot longer (especially the CIF cleaning) if the broadcast system wouldn't have been there. This would be important to include in the benchmarking.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial Feedback #4

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Initial Feedback #4

chrisjsewell Sep 6, 2021 Maintainer

Replies: 6 comments · 9 replies

chrisjsewell Sep 6, 2021 Maintainer Author

chrisjsewell Sep 6, 2021 Maintainer Author

chrisjsewell Sep 6, 2021 Maintainer Author

chrisjsewell Sep 6, 2021 Maintainer Author

chrisjsewell Sep 6, 2021 Maintainer Author

chrisjsewell
Sep 6, 2021
Maintainer

Replies: 6 comments 9 replies

chrisjsewell Sep 6, 2021
Maintainer Author

chrisjsewell Sep 6, 2021
Maintainer Author

chrisjsewell Sep 6, 2021
Maintainer Author

chrisjsewell Sep 6, 2021
Maintainer Author

chrisjsewell Sep 6, 2021
Maintainer Author