AEP: Replace RabbitMQ #30

muhrin · 2021-12-09T17:40:04Z

chrisjsewell · 2021-12-09T18:03:08Z

Cheers @muhrin, very helpful, will read through in due course.
Also just to link to the documentation and discussion I've recently had on this topic:

giovannipizzi · 2021-12-09T18:31:57Z

Thanks! looks good, I'm not sure however if the requirement "The daemon running processes and the user submitting them should be able to be on separate machines with communication handled by standard protocols (e.g. TCP/IP)" is really needed - I would drop it. I don't know if this is what you mention in your AEP, however the numbers are different (you discuss dropping requirement 5 but I see a different number). But from the content, we're discussing the same thing?

giovannipizzi · 2021-12-09T18:32:08Z

(sorry, closed by accident)

006_replace_rabbitmq/readme.md

muhrin · 2021-12-09T20:52:08Z

Thanks @giovannipizzi , you're quite right. This point isn't explained very well, I'll rewrite it tomorrow to explain it starting from the use case I had in mind many years ago when we started this process. Basically I would like for it to be possible to have, say, a group AiiDA instance with many users concurrently interacting with the same AiiDA DB.

chrisjsewell · 2021-12-10T00:27:32Z

006_replace_rabbitmq/readme.md

+   2. Remote Procedure Calls (e.g. asking a process to pause/play/kill)
+   3. Broadcasts (e.g. a parent listening for the termination of a child)
+1. An AiiDA process should only ever be ran by one worker at a time
+2. An error can lead to an AiiDA process never running but should not lead to the process being executed by multiple workers simultaneously


This seems very lenient no?
For sure, it is a problem more than one people have encountered with the current system; rabbitmq loses the task and the ProcessNode now is in an unterminated state, with a checkpoint, but nothing trying to run it, and no way to know if rabbitmq still has the task or not. (e.g. if you use verdi process list, it will tell you it is running, but that is incorrect)
Then, if you gamble that there is indeed no current task, you have to run some API "hacks" to submit a new continue task.

Would it then be a change of requirements, that there should indeed be a guarantee that processes are run to termination?
Or at least that the task queue is easily introspectable, and that there is a front-end mechanism in aiida to start a new continue task on an unterminated ProcessNode that no longer has an associated task

I haven't had the time to go through the entire proposal yet, but I will just say that I agree with this observation. Not being able to easily introspect the state of the task queue has been a great source of problems. It made it very difficult to debug, even when you had access to the machine. At the very least this should be made very clear, if we cannot guarantee it to not occur of course, that would always be best.

@chrisjsewell , yes, this isn't very clear. This point was meant to address a binary decision that must be made by any concurrent system. In such systems one can only guarantee one or other of the following:

A task will run at least once, but possibly more times. This is the most appropriate choice for systems that have tasks that are idempotent and where the resources consumed by a task not a concern. Clearly, this choice is not a good fit for AiiDA.

A task will run zero or one time. This is pretty much the only choice that makes sense for AiiDA hence the wording of requirement 2.

All this said, this requirement wasn't meant to imply that we shouldn't do everything in our power to make sure that the probability that a task runs zero times is low (and in fact, this is exactly one of the things that RMQ is meant to do).

As for making it easy to resubmit a task that failed to run at all, I think we all agree that this should be a high priority.

Added a clarification to the SI of the document.

1. Fixed requirements numbering 2. Clarified point on requirement related to race conditions (processes should run zero or one times) 2. Expanded on requirements for the running of long and short jobs 3. Removed incorrect statement claiming that we would need to drop requirement 6 to support using PID to detect dead worker.

Using Chris' prototype as a guide I've extracted the high-level features that we would need to implement to be able to support a RMQ replacement.

csadorf · 2022-02-24T13:48:58Z

006_replace_rabbitmq/readme.md

+### Proposed replacement
+
+This section will evolve into an actionable solution but, for now, will be used to flesh out and discuss a solution.


I think for a balanced and more complete discussion, we should explicitly consider alternative solutions in addition to Chris' pure-Python prototype. I understand that needing this service comes with a lot of issues in itself, as outlined in the introduction, but I would like to say with confidence that the problems to be addressed with the removal of RabbitMQ could not also be addressed with

Using RabbitMQ differently. Is it an absolute requirement that RabbitMQ manages the complete state of the process over its lifetime? Or could we also just push events like the "request to start a process", "a change in process state", and so on and then have more lightweight process handlers react to that?

Would a switch to something like Apache Kafka (or a pure-Python alternative like faust-streaming) be a suitable alternative?

Especially using something like Kafka would potentially give us some avenues for network-distributed and multi-user support. It is possible (maybe even probable) that those are not the right models, but I think we should make sure that we can say that with confidence before venturing on this major architecture change.

Think you raise a valid point @csadorf that this change should certainly not be taken too lightly, and so merits to have other potential solutions investigated.

Just wanted to already quickly give some feedback on your point 1. I think we currently already use RabbitMQ in as light a manner as possible. What happens is exactly as you describe: when you submit a process, a request to do so is sent to RabbitMQ, which will then send this "task" to a worker. As part of its guarantees that each task will be at some point executed, RabbitMQ will wait for the worker to confirm the task is done and if not, it will at some point send it to another worker, if the heartbeat is missed (and so it has to assume the worker died). This is where the problem lies though as RabbitMQ was designed for short running tasks (order of seconds/minutes) and not the tasks typically run with AiiDA (hours/days). This can be configured, but not out of the box, which is the reason that we are considering replacing it. I don't think that we can get around this feature of RabbitMQ though as the whole reason we selected it in the first place was for its guarantees of eventually completing the tasks it receives and its persistence.

So, what we really need is a robust queuing system not a message broker, correct?

At least for that part, but we also use the message broker part. As a process changes state, we emit broadcast. This is very useful as it makes the system event-based instead of polling based. For example, parent processes that are waiting for their child process to finish will be notified immediately when this happens and can continue running. With a polling system you always have a delay. And making this delay too short tends to overload the system unnecessarily.

I am not even remotely convinced that this is the right solution, but trying to understand: in principle we could push a request for a process to be executed onto a Kafka topic (event log), which could then be picked up by a worker. The same topic could then also be used to communicate back any state changes.

I do like the sound of the Kafka architecture but conceptually for this use case we wouldn't really be using it in its most powerful mode (an event log). I did once, mostly jokingly, propose rewriting the AiiDA DAG in Kafka which would be a good fit for an event store.

Anyway, to the matter at hand, I think Kafka could work but I do think there is a pretty large advantage to not requiring the installation of a system level service (which I think Kafla is). I don't know if faust-streaming would get around this.

Essentially, whatever solution we adopt there will need to be at least two components:

A mechanism to handle incoming messages (through polling or events), and,

A persistent store of messages, ideally with some guarantees.

I played around last year with using xmlrpc (a standard python module) for the event based messages part (it was pretty easy to create a kiwipy compatible communicator), but this would still require the storage component which in Chris' proposal comes in the form of additional database tables. Personally I'm not so keen for AiiDA to take responsibility for the storage of events but then I haven't seen any persistent, non-OS-service based solutions.

to not requiring the installation of a system level service

Yeh I'm very keen to avoid swapping out one "user complication" for another 😬
I'll have alook at faust though

Just to note here, I'm going to be delving in to the internals of rabbitmq a bit more in the coming month(s), to understand it more thoroughly.
Crudely speaking though, its not magic; it is a daemon service, that uses a database to persist data, and a JSON based communication protocol (with additional caching features etc https://blog.rabbitmq.com/posts/2015/10/new-credit-flow-settings-on-rabbitmq-3-5-5)
With the aiida daemon, we already have a daemon service, that uses a database to persist data, plus we also have direct control over what process are running etc, which rabbitmq does not, and I feel we are just duplicating effort to some extent

(I agree that a push model would be better than polling though)

Here's a nice summary: https://hvalls.dev/posts/polling-long-polling-push, which specifically references:

RabbitMQ uses a push model, which is appropriate when you have limited clients/consumers and messages

Kafka is an example of technology implementing a long polling model

Essentially, whatever solution we adopt there will need to be at least two components:

A mechanism to handle incoming messages (through polling or events), and,

A persistent store of messages, ideally with some guarantees.

Unless I completely misunderstand the scope and mission of Kafka, that is its exact purpose. So it sounds to me like https://github.com/faust-streaming might indeed be important to evaluate then. Not only with respect to its fit as a technical solution, but also with respect to its sustainability (the project was abandoned by its original creator "Robinhood" and is currently maintained as a community fork).

Edit: Just wanted to mention that apparently my browser spaced out and didn't show @chrisjsewell 's last two comments before mine, so I did not take those into account for this comment.

unkcpz · 2024-11-11T11:29:53Z

Hi all,
I open a new AEP for how I think it can be implemented at #42, please have a look and give your feedback.

Proposal to replace RabbitMQ

7ad5802

muhrin added status/submitted type/process Process AEP labels Dec 9, 2021

muhrin requested review from chrisjsewell and giovannipizzi December 9, 2021 17:40

giovannipizzi closed this Dec 9, 2021

giovannipizzi reopened this Dec 9, 2021

greschd reviewed Dec 9, 2021

View reviewed changes

006_replace_rabbitmq/readme.md Outdated Show resolved Hide resolved

chrisjsewell reviewed Dec 10, 2021

View reviewed changes

muhrin added 2 commits December 15, 2021 14:19

Added a high-level description of the replacement

c502ea3

Using Chris' prototype as a guide I've extracted the high-level features that we would need to implement to be able to support a RMQ replacement.

sphuber changed the title ~~Proposal to replace RabbitMQ~~ AEP: Replace RabbitMQ Dec 16, 2021

sphuber added type/standard Standard Track AEP and removed type/process Process AEP labels Dec 16, 2021

This was referenced Jan 24, 2022

Improving AiiDA terminology aiidateam/aiida-core#5324

Open

♻️ REFACTOR: Profile storage backend configuration aiidateam/aiida-core#5320

Merged

csadorf suggested changes Feb 24, 2022

View reviewed changes

csadorf mentioned this pull request Feb 24, 2022

Other alternatives to RabbitMQ chrisjsewell/aiida-process-coordinator#7

Open

unkcpz mentioned this pull request Jan 25, 2024

Doc: quick start to lighting dive user in aiidateam/aiida-core#6261

Closed

unkcpz mentioned this pull request Nov 11, 2024

Replace RabbitMQ (concrete design notes) #42

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AEP: Replace RabbitMQ #30

AEP: Replace RabbitMQ #30

muhrin commented Dec 9, 2021 •

edited

Loading

chrisjsewell commented Dec 9, 2021

giovannipizzi commented Dec 9, 2021

giovannipizzi commented Dec 9, 2021

muhrin commented Dec 9, 2021

chrisjsewell Dec 10, 2021 •

edited

Loading

sphuber Dec 10, 2021

muhrin Dec 10, 2021

muhrin Dec 15, 2021

csadorf Feb 24, 2022

sphuber Feb 24, 2022

csadorf Feb 24, 2022

sphuber Feb 24, 2022

csadorf Feb 24, 2022

muhrin Feb 24, 2022

chrisjsewell Feb 24, 2022

chrisjsewell Feb 24, 2022

chrisjsewell Feb 24, 2022

csadorf Feb 24, 2022 •

edited

Loading

unkcpz commented Nov 11, 2024

		### Proposed replacement

		This section will evolve into an actionable solution but, for now, will be used to flesh out and discuss a solution.

AEP: Replace RabbitMQ #30

Are you sure you want to change the base?

AEP: Replace RabbitMQ #30

Conversation

muhrin commented Dec 9, 2021 • edited Loading

chrisjsewell commented Dec 9, 2021

giovannipizzi commented Dec 9, 2021

giovannipizzi commented Dec 9, 2021

muhrin commented Dec 9, 2021

chrisjsewell Dec 10, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

csadorf Feb 24, 2022 • edited Loading

Choose a reason for hiding this comment

unkcpz commented Nov 11, 2024

muhrin commented Dec 9, 2021 •

edited

Loading

chrisjsewell Dec 10, 2021 •

edited

Loading

csadorf Feb 24, 2022 •

edited

Loading