RabbitMQ debug tools #5718

sphuber · 2022-10-24T22:30:21Z

No description provided.

codecov · 2022-10-27T10:09:03Z

Codecov Report

Base: 79.84% // Head: 79.66% // Decreases project coverage by -0.19% ⚠️

Coverage data is based on head (b3ea08f) compared to base (9a9440b).
Patch coverage: 74.53% of modified lines in pull request are covered.

❗ Current head b3ea08f differs from pull request most recent head 6e28f08. Consider uploading reports for the commit 6e28f08 to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5718      +/-   ##
==========================================
- Coverage   79.84%   79.66%   -0.18%     
==========================================
  Files         491      496       +5     
  Lines       36760    36873     +113     
==========================================
+ Hits        29348    29370      +22     
- Misses       7412     7503      +91

Impacted Files	Coverage Δ
aiida/cmdline/commands/cmd_devel.py	`48.28% <ø> (-19.58%)`	⬇️
aiida/engine/processes/control.py	`69.50% <0.00%> (-0.25%)`	⬇️
aiida/manage/manager.py	`87.75% <33.34%> (-0.49%)`	⬇️
aiida/cmdline/commands/cmd_rabbitmq.py	`65.12% <65.12%> (ø)`
aiida/manage/external/rmq/utils.py	`91.67% <91.67%> (ø)`
aiida/manage/external/rmq/client.py	`93.94% <93.94%> (ø)`
aiida/cmdline/utils/decorators.py	`91.67% <100.00%> (+1.51%)`	⬆️
aiida/manage/external/rmq/__init__.py	`100.00% <100.00%> (ø)`
aiida/manage/external/rmq/defaults.py	`100.00% <100.00%> (ø)`
aiida/manage/external/rmq/launcher.py	`74.14% <100.00%> (ø)`
... and 15 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

sphuber · 2022-11-03T21:11:11Z

Anyone want to review this? @chrisjsewell @ramirezfranciscof @ltalirz @unkcpz ? I would like to get this in 2.1 as it will help users massively with debugging and fixing RabbitMQ issues.

ltalirz · 2022-11-04T10:32:40Z

Let me have a look

ltalirz

thanks @sphuber !

I clearly see the usefulness of these tools in helping debug issues, so this is a great addition.

Just some minor questions about where the code should go.
I've tested it locally and it worked fine for me; I have not read line by line through all of the code you added - if there is something in particular you'd like me to look at, let me know.

aiida/cmdline/commands/cmd_rabbitmq.py

ltalirz · 2022-11-04T10:30:11Z

aiida/cmdline/commands/cmd_rabbitmq.py

+
+@wrapt.decorator
+@click.pass_context
+def with_client(ctx, wrapped, _, args, kwargs):


a significant portion of the code in this file seems quite independent of aiida and could possibly fit inside, say, kiwipy as well.

I'm not saying this has to happen, just worth consideration.
We may also want to comment in the module docstring in why these tools are needed, and under which circumstances we may want to remove them again (if that is the case).

Fair, except for the tasks analysis command though, that is purely AiiDA specific. But you are right, the rest could be abstracted (currently depends tightly on AiiDA's Manager to get the communicator) and be put in kiwipy. For me would be fine to move it there at some point if we find it is useful more generically. I had a quick search and there are a few Python libraries out there that provide an interface to the HTTP API, but they are not well maintained.

environment.yml

aiida/cmdline/commands/cmd_rabbitmq.py

aiida/manage/external/rmq/client.py

ltalirz · 2022-11-04T10:50:09Z

P.S. I happened to not have any processes running (and the list of rabbitmq tasks is empty), but I still had two queues

$ verdi devel rabbitmq queues list
Name                                                    Messages  State
----------------------------------------------------  ----------  -------
aiida-9e50e1e81b3c4dccadabde65a8a745bc.process.queue           0  running
aiida-e86ff3e139f3443a959445e9444f2e3f.process.queue           0  running

I guess this points to some issue?

Are these scoped by profile?

unkcpz · 2022-11-04T11:56:14Z

I am also interested in reviewing this.

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.
Then I cannot start the management plugin. I using the docker to start the rabbitmq as instructed here. I start the plugin with the instruction, the "Enabling plugins" section but still get the error "Critical: Could not connect to the management API.", any idea how to start it?

sphuber · 2022-11-04T12:19:14Z

I guess this points to some issue?

No issue here. These are the process task queues and they are declared as being persistent, even if they are empty. Since there is just one per profile, the overhead of keeping them is a lot less then deleting and recreating it each time it is emptied.

Are these scoped by profile?

Yes they are, the UUID you see is the profile UUID.

sphuber · 2022-11-04T12:22:42Z

I am also interested in reviewing this.

That'd be great, thanks!

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.

I think I added a bug by adding a check on the lower bound, but I inverted the condition. Will fix it now.

Then I cannot start the management plugin. I using the docker to start the rabbitmq as instructed here. I start the plugin with the instruction, the "Enabling plugins" section but still get the error "Critical: Could not connect to the management API.", any idea how to start it?

The HTTP API is served on port 15672, is that accessible from outside the container? Are you forwarding it just like the 5672 port?

sphuber · 2022-11-04T12:40:02Z

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.

This should now be fixed. I have also added explicit tests for this.

sphuber · 2022-11-04T12:43:15Z

@ltalirz thanks for the review. I have addressed your comments. I agree that some of the code is generic enough to be moved to a separate library (kiwipy or a standalone one) but don't want to do this now if you don't mind. I don't need you to go through the whole code, but it would be good if you could sign off that you think it is ok to drop support for RabbitMQ 3.5 and older.

unkcpz · 2022-11-04T12:48:30Z

The HTTP API is served on port 15672, is that accessible from outside the container? Are you forwarding it just like the 5672 port?

Ah, true, I forget to forward this port. It works now. Thanks!

ltalirz · 2022-11-04T12:48:54Z

I guess this points to some issue?

No issue here. These are the process task queues and they are declared as being persistent, even if they are empty. Since there is just one per profile, the overhead of keeping them is a lot less then deleting and recreating it each time it is emptied.

Would it be possible to add a first column with the profile name that corresponds to the UUID?
Most users probably never see the profile UUID and so they will anyhow need to look it up in order to understand what the output of the command means.

ltalirz · 2022-11-04T12:51:36Z

@ltalirz thanks for the review. I have addressed your comments. I agree that some of the code is generic enough to be moved to a separate library (kiwipy or a standalone one) but don't want to do this now if you don't mind.

Fine

I don't need you to go through the whole code, but it would be good if you could sign off that you think it is ok to drop support for RabbitMQ 3.5 and older.

Yes. I haven't seen anyone using rabbitmq <3.6 for a very long time and its EOL was in 2016.

sphuber · 2022-11-04T17:11:10Z

Would it be possible to add a first column with the profile name that corresponds to the UUID?

Possible yes, but not very straightforward. When running normally, these aren't the only queues that there will be, plus RabbitMQ might be used for something other than AiiDA that creates queues that have nothing to do with AiiDA. So for those a profile wouldn't even make sense. I don't think this command will be for normal users anyway. These will at best be for us at developers, so don't think we should tightly couple the output to AiiDA profiles.

unkcpz

@sphuber Thanks! This is a very useful feature. I did not review it in detail but just give it a try with some questions I have in mind.

If I understand correctly, the devel rabbitmq tasks analyze --fix will run the revive_processces and what the reviev_processes did is to requeue the process?
I run it analyze --fix for the lost process (the workchain process that has child process killed by ctrl+C), when I run analyze it again they disappear but still staled process in verdi process list, where is the process list read from, do this command need to take care of clean it?

unkcpz · 2022-11-04T16:40:09Z

.github/workflows/ci-code.yml

@@ -66,9 +66,10 @@ jobs:
        ports:
          - 5432:5432
      rabbitmq:
-        image: rabbitmq:latest
+        image: rabbitmq:3.8-management


3.8-management points to 3.8.34, there is the tag 3.8.14-management can be used here. Same for image tags below.

Ah great, will change that then, thanks

unkcpz · 2022-11-04T17:01:20Z

tests/cmdline/commands/test_rabbitmq.py

+    # Need to manually stop the daemon such that it cleans all state related to the submitted processes. It is not quite
+    # understood how, but leaving the daemon running, a following test that will submit to the daemon may hit an
+    # an exception because the node submitted here would no longer exist and the daemon would try to operate on it.
+    started_daemon_client.stop_daemon(wait=True)


If understand correctly, this stops the daemon the second time since stop_daemon is already called from daemon_client fixture of started_daemon_client?

Hmm, I think it might indeed no longer be necessary. When I introduced the daemon_client fixture, it was originally session-scoped, just for efficiency to only stop it once at the end of the session. But I think exactly for this reason I decided to make it function scoped because it can cause side effects between tests. I can go back and remove it and see if the tests run. They probably will now.

aiida/manage/external/rmq/client.py

sphuber · 2022-11-04T17:50:59Z

If I understand correctly, the devel rabbitmq tasks analyze --fix will run the revive_processces and what the reviev_processes did is to requeue the process?

Yes, revive_processes simply send a new task to the process queue for each of the processes it gets.

I run it analyze --fix for the lost process (the workchain process that has child process killed by ctrl+C), when I run analyze it again they disappear but still staled process in verdi process list

Not sure I understand what you mean. The command is not supposed to clean up killed processes. When you say there is still a stale process in verdi process list based on what do you say that?

where is the process list read from, do this command need to take care of clean it?

The analyze command gets the active process tasks from the process queue by reading all the tasks it has in it (each task contains the process id it corresponds to). Then it gets the "active" processes from the database, by querying for all process nodes that have a process state in waiting, created and running. Then it does a cross-correlation and check the difference between these sets. Any tasks that do not have a corresponding active node in the database get discarded. Any active process nodes in the database that don't have a corresponding task, get their task recreated ("revived").

unkcpz · 2022-11-04T18:01:46Z

The command is not supposed to clean up killed processes. When you say there is still a stale process in verdi process list based on what do you say that?

It is not the killed processes, they are still all in Waiting state and related to the issue #3776. I have a workchain run without using the daemon, when I kill it by ctrl+C the child process is not properly killed. They are all listed by the tasks analyze command, and --fix drop them from the list I run task analyze second time but they still in the verdi process list. But it is the issue of #3776 so may not be related to the rabbitmq feature here.

unkcpz

It is all good to me. 😄

This is in anticipation of more code being added soon and the file was already getting large and difficult to read. Turning it into a package with individual modules helps readability greatly.

RabbitMQ 3.5 has been EOL since 31 October 2016: https://www.rabbitmq.com/versions.html This version required to explicitly run `support_deprecated_rabbitmq` on the `pamqp` library (which is used all the way down the stack by `aiormq` to handle interactions with RabbitMQ) to enable compatibility. In removing this, the explicit dependency on `pamqp` is also removed. The `aiida.manage.manager.is_rabbitmq_version_supported` function which is used to check compatibility when a profile is loaded is updated to include the lower bound.

RabbitMQ provides a plugin `rabbitmq_management` that provides an HTTP API to manage a RabbitMQ instance. For example, it can be used to inspect existing queues, create new ones, or delete queues. For the client to work, the plugin should first be enabled with: sudo rabbitmq-plugins enable rabbitmq_management If the API cannot be connected to a `ManagementApiConnectionError` exception is raised. In the Github Action workflows, the command above won't work, since RabbitMQ is installed as a service in a Docker container. However, it can be enabled in the Docker container by adding the `-management` suffix to the version specification of the RabbitMQ service. Since the API is exposed on port 15672, that is also added to the forward list.

The command group has three subcommands: * `server-properties`: Report properties of RabbitMQ server * `queues`: Interact with RabbitMQ queues * `tasks`: Deal with messages stored in tasks queues The first command is a simple utility that can come in handy during debugging as it reports various server settings, more so than the version of RabbitMQ that is reported by `verdi status`. The `queues` command is also mostly useful for debugging. It currently allows to list, create and delete queues. The most important command is the `tasks` command. Currently, there is a bug where it is possible that tasks in the process queue (which are consumed by daemon workers to continue submitted processes) can get lost, despite the persistence requirements on the queue and the messages. When a tasks gets lost, the corresponding process will become a "zombie" and seems stuck. Currently there was no easy way to confirm that a process had become a zombie due to its task having been lost. The `list` method fixes this problem as it will introspect all task messages in the process queue and collect the process ids that they correspond to. This can be used to see which active processes no longer have a corresponding task and so are a zombie. The `analysis` command is a utility command that automates this process. It will get all process ids of the existing tasks in the process queue and cross-reference those with the active processes as stored in the database of AiiDA's storage backend. By default it reports if there are inconsistencies between the two sets. If there are, the user can run the command again with the `--fix` flag and the command will automatically discard tasks that are incorrect and recreate those that are missing.

The command was recently added before the `verdi devel rabbitmq` subcommand existed. At this point it makes more sense to group it there because it is directly related to RabbitMQ and if this service is ever removed, this command also no longer serves a purpose.

sphuber · 2022-11-04T18:53:46Z

Thanks a lot for the reviews @ltalirz and @unkcpz !

sphuber force-pushed the feature/cli-rabbitmq branch 5 times, most recently from 6f36bdc to b3ea08f Compare October 27, 2022 09:45

sphuber force-pushed the feature/cli-rabbitmq branch from b3ea08f to b86ff5e Compare October 27, 2022 11:56

sphuber force-pushed the feature/cli-rabbitmq branch 2 times, most recently from 0028190 to 6e28f08 Compare November 3, 2022 22:14

sphuber added this to the v2.1.0 milestone Nov 4, 2022

sphuber force-pushed the feature/cli-rabbitmq branch from 6e28f08 to c931df8 Compare November 4, 2022 07:50

ltalirz self-requested a review November 4, 2022 10:35

ltalirz reviewed Nov 4, 2022

View reviewed changes

sphuber requested a review from unkcpz November 4, 2022 12:27

sphuber force-pushed the feature/cli-rabbitmq branch from c931df8 to 20e7bdf Compare November 4, 2022 12:37

unkcpz reviewed Nov 4, 2022

View reviewed changes

aiida/manage/external/rmq/client.py Show resolved Hide resolved

sphuber force-pushed the feature/cli-rabbitmq branch from 20e7bdf to 8031be8 Compare November 4, 2022 17:44

unkcpz approved these changes Nov 4, 2022

View reviewed changes

sphuber added 5 commits November 4, 2022 19:23

Refactor: Turn aiida.manage.external.rmq into a package

da51c93

This is in anticipation of more code being added soon and the file was already getting large and difficult to read. Turning it into a package with individual modules helps readability greatly.

sphuber force-pushed the feature/cli-rabbitmq branch from 8031be8 to 60342fe Compare November 4, 2022 18:23

sphuber merged commit 257a509 into aiidateam:main Nov 4, 2022

sphuber deleted the feature/cli-rabbitmq branch November 4, 2022 18:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RabbitMQ debug tools #5718

RabbitMQ debug tools #5718

sphuber commented Oct 24, 2022

codecov bot commented Oct 27, 2022 •

edited

Loading

sphuber commented Nov 3, 2022

ltalirz commented Nov 4, 2022

ltalirz left a comment •

edited

Loading

ltalirz Nov 4, 2022

sphuber Nov 4, 2022

ltalirz commented Nov 4, 2022 •

edited

Loading

unkcpz commented Nov 4, 2022 •

edited

Loading

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

unkcpz commented Nov 4, 2022

ltalirz commented Nov 4, 2022

ltalirz commented Nov 4, 2022

sphuber commented Nov 4, 2022

unkcpz left a comment

unkcpz Nov 4, 2022

sphuber Nov 4, 2022

unkcpz Nov 4, 2022

sphuber Nov 4, 2022

sphuber commented Nov 4, 2022

unkcpz commented Nov 4, 2022

unkcpz left a comment

sphuber commented Nov 4, 2022

RabbitMQ debug tools #5718

RabbitMQ debug tools #5718

Conversation

sphuber commented Oct 24, 2022

codecov bot commented Oct 27, 2022 • edited Loading

Codecov Report

sphuber commented Nov 3, 2022

ltalirz commented Nov 4, 2022

ltalirz left a comment • edited Loading

Choose a reason for hiding this comment

ltalirz Nov 4, 2022

Choose a reason for hiding this comment

sphuber Nov 4, 2022

Choose a reason for hiding this comment

ltalirz commented Nov 4, 2022 • edited Loading

unkcpz commented Nov 4, 2022 • edited Loading

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

sphuber commented Nov 4, 2022

unkcpz commented Nov 4, 2022

ltalirz commented Nov 4, 2022

ltalirz commented Nov 4, 2022

sphuber commented Nov 4, 2022

unkcpz left a comment

Choose a reason for hiding this comment

unkcpz Nov 4, 2022

Choose a reason for hiding this comment

sphuber Nov 4, 2022

Choose a reason for hiding this comment

unkcpz Nov 4, 2022

Choose a reason for hiding this comment

sphuber Nov 4, 2022

Choose a reason for hiding this comment

sphuber commented Nov 4, 2022

unkcpz commented Nov 4, 2022

unkcpz left a comment

Choose a reason for hiding this comment

sphuber commented Nov 4, 2022

codecov bot commented Oct 27, 2022 •

edited

Loading

ltalirz left a comment •

edited

Loading

ltalirz commented Nov 4, 2022 •

edited

Loading

unkcpz commented Nov 4, 2022 •

edited

Loading