Initial Feedback #4
Replies: 6 comments 9 replies
-
Very cool Chris. I've also been increasingly thinking that moving away from RMQ (if possible) would be a good thing. There is some nice functionality that we get from RMQ but I'm not sure it's worth the complexity, especially in light of the mismatch of use-cases you mention in the README. So, in your proposed system what handles
? |
Beta Was this translation helpful? Give feedback.
-
Sweet, I'll have a closer read. One thing I thought workers could do for this kind of database-centred approach is just have a thread periodically writing a heartbeat timestamp to the table every x seconds. This way, e.g., a daemon started on a difference machine could just just assume that anything with a timestamp older than some multiple of x is a stale process. |
Beta Was this translation helpful? Give feedback.
-
Thanks a lot @chrisjsewell ! From my side, reducing complexity is very valuable, especially in the engine code that has proven to be very difficult to debug (e.g. aiidateam/aiida-core#4603), so everything else being equal I would certainly be in favor. I just wanted to add one minor comment since you mention I never figured out what happened in 2014/2015, i.e. why the development stopped and whether there was some alternative that people switched to instead. Perhaps we should just ask the creator Tarek Ziade for advice. |
Beta Was this translation helpful? Give feedback.
-
Thanks Chris! I put here a few questions/comments for discussion - scheduling tasks is always a tricky business, not much in designing the system for normal usage, but to ensure all goes well also when things go bad :-D Therefore it would be good that we collectively find a way to debug the system in corner cases, to see how it responds (many of my comments below are written thinking to this). You don't need to comment point by point (if you don't want), probably the easiest is that you think about it, and then we organise a meeting next week with interested people as, in one hour, I think we can discuss and better understand the design and ask questions live.
Here some suggestions for stress testing that I think would need to be done, while many processes are running:
Happy to discuss this further! |
Beta Was this translation helpful? Give feedback.
-
One more thing: for executing "actions" like kill, pause, play, ... -> do we need just a single "place" (i.e. a single column for the row in db_dbprocess), i.e. there is at most one action happening at a given time? Or do we need a "queue" of tasks? |
Beta Was this translation helpful? Give feedback.
-
Thanks @chrisjsewell . I am in principle supportive of this change if it really means that we can remove RabbitMQ as a requirement while mainting feature, stability and performance parity, or improve it. If we have a slight downgrade in performance, than that could be excused given the reduced complexity. I think @giovannipizzi and @muhrin have voiced most of the concerns that I had. The only remaining question that sprung to mind was the usage of broadcasting where processes broadcasted state changes. There are some basic usages of this functionality, for example This feature is crucial when running many workchains that launch sub processes that are shortly lived. If you rely solely on polling, there can be long unnecessary waits. I think that the work I did for the 3DCD would have taken a lot longer (especially the CIF cleaning) if the broadcast system wouldn't have been there. This would be important to include in the benchmarking. |
Beta Was this translation helpful? Give feedback.
-
The proposal is fully described in the README.md, but essentially a lot less code and complexity (compared to kiwipy+rabbitmq), yet better stability and functionality
Feedback welcome and encouraged (see if you can break it!) 😄
Beta Was this translation helpful? Give feedback.
All reactions