Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RabbitMQ debug tools #5718

Merged
merged 5 commits into from
Nov 4, 2022
Merged

RabbitMQ debug tools #5718

merged 5 commits into from
Nov 4, 2022

Conversation

sphuber
Copy link
Contributor

@sphuber sphuber commented Oct 24, 2022

No description provided.

@sphuber sphuber force-pushed the feature/cli-rabbitmq branch 5 times, most recently from 6f36bdc to b3ea08f Compare October 27, 2022 09:45
@codecov
Copy link

codecov bot commented Oct 27, 2022

Codecov Report

Base: 79.84% // Head: 79.66% // Decreases project coverage by -0.19% ⚠️

Coverage data is based on head (b3ea08f) compared to base (9a9440b).
Patch coverage: 74.53% of modified lines in pull request are covered.

❗ Current head b3ea08f differs from pull request most recent head 6e28f08. Consider uploading reports for the commit 6e28f08 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5718      +/-   ##
==========================================
- Coverage   79.84%   79.66%   -0.18%     
==========================================
  Files         491      496       +5     
  Lines       36760    36873     +113     
==========================================
+ Hits        29348    29370      +22     
- Misses       7412     7503      +91     
Impacted Files Coverage Δ
aiida/cmdline/commands/cmd_devel.py 48.28% <ø> (-19.58%) ⬇️
aiida/engine/processes/control.py 69.50% <0.00%> (-0.25%) ⬇️
aiida/manage/manager.py 87.75% <33.34%> (-0.49%) ⬇️
aiida/cmdline/commands/cmd_rabbitmq.py 65.12% <65.12%> (ø)
aiida/manage/external/rmq/utils.py 91.67% <91.67%> (ø)
aiida/manage/external/rmq/client.py 93.94% <93.94%> (ø)
aiida/cmdline/utils/decorators.py 91.67% <100.00%> (+1.51%) ⬆️
aiida/manage/external/rmq/__init__.py 100.00% <100.00%> (ø)
aiida/manage/external/rmq/defaults.py 100.00% <100.00%> (ø)
aiida/manage/external/rmq/launcher.py 74.14% <100.00%> (ø)
... and 15 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@sphuber sphuber force-pushed the feature/cli-rabbitmq branch from b3ea08f to b86ff5e Compare October 27, 2022 11:56
@sphuber
Copy link
Contributor Author

sphuber commented Nov 3, 2022

Anyone want to review this? @chrisjsewell @ramirezfranciscof @ltalirz @unkcpz ? I would like to get this in 2.1 as it will help users massively with debugging and fixing RabbitMQ issues.

@sphuber sphuber force-pushed the feature/cli-rabbitmq branch 2 times, most recently from 0028190 to 6e28f08 Compare November 3, 2022 22:14
@sphuber sphuber added this to the v2.1.0 milestone Nov 4, 2022
@sphuber sphuber force-pushed the feature/cli-rabbitmq branch from 6e28f08 to c931df8 Compare November 4, 2022 07:50
@ltalirz
Copy link
Member

ltalirz commented Nov 4, 2022

Let me have a look

@ltalirz ltalirz self-requested a review November 4, 2022 10:35
Copy link
Member

@ltalirz ltalirz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @sphuber !

I clearly see the usefulness of these tools in helping debug issues, so this is a great addition.

Just some minor questions about where the code should go.
I've tested it locally and it worked fine for me; I have not read line by line through all of the code you added - if there is something in particular you'd like me to look at, let me know.

aiida/cmdline/commands/cmd_rabbitmq.py Outdated Show resolved Hide resolved

@wrapt.decorator
@click.pass_context
def with_client(ctx, wrapped, _, args, kwargs):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a significant portion of the code in this file seems quite independent of aiida and could possibly fit inside, say, kiwipy as well.

I'm not saying this has to happen, just worth consideration.
We may also want to comment in the module docstring in why these tools are needed, and under which circumstances we may want to remove them again (if that is the case).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair, except for the tasks analysis command though, that is purely AiiDA specific. But you are right, the rest could be abstracted (currently depends tightly on AiiDA's Manager to get the communicator) and be put in kiwipy. For me would be fine to move it there at some point if we find it is useful more generically. I had a quick search and there are a few Python libraries out there that provide an interface to the HTTP API, but they are not well maintained.

environment.yml Show resolved Hide resolved
aiida/cmdline/commands/cmd_rabbitmq.py Outdated Show resolved Hide resolved
aiida/cmdline/commands/cmd_rabbitmq.py Outdated Show resolved Hide resolved
aiida/manage/external/rmq/client.py Show resolved Hide resolved
@ltalirz
Copy link
Member

ltalirz commented Nov 4, 2022

P.S. I happened to not have any processes running (and the list of rabbitmq tasks is empty), but I still had two queues

$ verdi devel rabbitmq queues list
Name                                                    Messages  State
----------------------------------------------------  ----------  -------
aiida-9e50e1e81b3c4dccadabde65a8a745bc.process.queue           0  running
aiida-e86ff3e139f3443a959445e9444f2e3f.process.queue           0  running

I guess this points to some issue?

Are these scoped by profile?

@unkcpz
Copy link
Member

unkcpz commented Nov 4, 2022

I am also interested in reviewing this.

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.
Then I cannot start the management plugin. I using the docker to start the rabbitmq as instructed here. I start the plugin with the instruction, the "Enabling plugins" section but still get the error "Critical: Could not connect to the management API.", any idea how to start it?

@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

I guess this points to some issue?

No issue here. These are the process task queues and they are declared as being persistent, even if they are empty. Since there is just one per profile, the overhead of keeping them is a lot less then deleting and recreating it each time it is emptied.

Are these scoped by profile?

Yes they are, the UUID you see is the profile UUID.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

I am also interested in reviewing this.

That'd be great, thanks!

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.

I think I added a bug by adding a check on the lower bound, but I inverted the condition. Will fix it now.

Then I cannot start the management plugin. I using the docker to start the rabbitmq as instructed here. I start the plugin with the instruction, the "Enabling plugins" section but still get the error "Critical: Could not connect to the management API.", any idea how to start it?

The HTTP API is served on port 15672, is that accessible from outside the container? Are you forwarding it just like the 5672 port?

@sphuber sphuber requested a review from unkcpz November 4, 2022 12:27
@sphuber sphuber force-pushed the feature/cli-rabbitmq branch from c931df8 to 20e7bdf Compare November 4, 2022 12:37
@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

But first I still get a warning with Warning: RabbitMQ v3.8.14 is not supported and will cause unexpected problems!. I think v3.8.14 is supported.

This should now be fixed. I have also added explicit tests for this.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

@ltalirz thanks for the review. I have addressed your comments. I agree that some of the code is generic enough to be moved to a separate library (kiwipy or a standalone one) but don't want to do this now if you don't mind. I don't need you to go through the whole code, but it would be good if you could sign off that you think it is ok to drop support for RabbitMQ 3.5 and older.

@unkcpz
Copy link
Member

unkcpz commented Nov 4, 2022

The HTTP API is served on port 15672, is that accessible from outside the container? Are you forwarding it just like the 5672 port?

Ah, true, I forget to forward this port. It works now. Thanks!

@ltalirz
Copy link
Member

ltalirz commented Nov 4, 2022

I guess this points to some issue?

No issue here. These are the process task queues and they are declared as being persistent, even if they are empty. Since there is just one per profile, the overhead of keeping them is a lot less then deleting and recreating it each time it is emptied.

Would it be possible to add a first column with the profile name that corresponds to the UUID?
Most users probably never see the profile UUID and so they will anyhow need to look it up in order to understand what the output of the command means.

@ltalirz
Copy link
Member

ltalirz commented Nov 4, 2022

@ltalirz thanks for the review. I have addressed your comments. I agree that some of the code is generic enough to be moved to a separate library (kiwipy or a standalone one) but don't want to do this now if you don't mind.

Fine

I don't need you to go through the whole code, but it would be good if you could sign off that you think it is ok to drop support for RabbitMQ 3.5 and older.

Yes. I haven't seen anyone using rabbitmq <3.6 for a very long time and its EOL was in 2016.

@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

Would it be possible to add a first column with the profile name that corresponds to the UUID?

Possible yes, but not very straightforward. When running normally, these aren't the only queues that there will be, plus RabbitMQ might be used for something other than AiiDA that creates queues that have nothing to do with AiiDA. So for those a profile wouldn't even make sense. I don't think this command will be for normal users anyway. These will at best be for us at developers, so don't think we should tightly couple the output to AiiDA profiles.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sphuber Thanks! This is a very useful feature. I did not review it in detail but just give it a try with some questions I have in mind.

If I understand correctly, the devel rabbitmq tasks analyze --fix will run the revive_processces and what the reviev_processes did is to requeue the process?
I run it analyze --fix for the lost process (the workchain process that has child process killed by ctrl+C), when I run analyze it again they disappear but still staled process in verdi process list, where is the process list read from, do this command need to take care of clean it?

@@ -66,9 +66,10 @@ jobs:
ports:
- 5432:5432
rabbitmq:
image: rabbitmq:latest
image: rabbitmq:3.8-management
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3.8-management points to 3.8.34, there is the tag 3.8.14-management can be used here. Same for image tags below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah great, will change that then, thanks

Comment on lines 135 to 138
# Need to manually stop the daemon such that it cleans all state related to the submitted processes. It is not quite
# understood how, but leaving the daemon running, a following test that will submit to the daemon may hit an
# an exception because the node submitted here would no longer exist and the daemon would try to operate on it.
started_daemon_client.stop_daemon(wait=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If understand correctly, this stops the daemon the second time since stop_daemon is already called from daemon_client fixture of started_daemon_client?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, I think it might indeed no longer be necessary. When I introduced the daemon_client fixture, it was originally session-scoped, just for efficiency to only stop it once at the end of the session. But I think exactly for this reason I decided to make it function scoped because it can cause side effects between tests. I can go back and remove it and see if the tests run. They probably will now.

@sphuber sphuber force-pushed the feature/cli-rabbitmq branch from 20e7bdf to 8031be8 Compare November 4, 2022 17:44
@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

If I understand correctly, the devel rabbitmq tasks analyze --fix will run the revive_processces and what the reviev_processes did is to requeue the process?

Yes, revive_processes simply send a new task to the process queue for each of the processes it gets.

I run it analyze --fix for the lost process (the workchain process that has child process killed by ctrl+C), when I run analyze it again they disappear but still staled process in verdi process list

Not sure I understand what you mean. The command is not supposed to clean up killed processes. When you say there is still a stale process in verdi process list based on what do you say that?

where is the process list read from, do this command need to take care of clean it?

The analyze command gets the active process tasks from the process queue by reading all the tasks it has in it (each task contains the process id it corresponds to). Then it gets the "active" processes from the database, by querying for all process nodes that have a process state in waiting, created and running. Then it does a cross-correlation and check the difference between these sets. Any tasks that do not have a corresponding active node in the database get discarded. Any active process nodes in the database that don't have a corresponding task, get their task recreated ("revived").

@unkcpz
Copy link
Member

unkcpz commented Nov 4, 2022

The command is not supposed to clean up killed processes. When you say there is still a stale process in verdi process list based on what do you say that?

It is not the killed processes, they are still all in Waiting state and related to the issue #3776. I have a workchain run without using the daemon, when I kill it by ctrl+C the child process is not properly killed. They are all listed by the tasks analyze command, and --fix drop them from the list I run task analyze second time but they still in the verdi process list. But it is the issue of #3776 so may not be related to the rabbitmq feature here.

Copy link
Member

@unkcpz unkcpz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is all good to me. 😄

This is in anticipation of more code being added soon and the file was
already getting large and difficult to read. Turning it into a package
with individual modules helps readability greatly.
RabbitMQ 3.5 has been EOL since 31 October 2016:

    https://www.rabbitmq.com/versions.html

This version required to explicitly run `support_deprecated_rabbitmq` on
the `pamqp` library (which is used all the way down the stack by
`aiormq` to handle interactions with RabbitMQ) to enable compatibility.
In removing this, the explicit dependency on `pamqp` is also removed.

The `aiida.manage.manager.is_rabbitmq_version_supported` function which
is used to check compatibility when a profile is loaded is updated to
include the lower bound.
RabbitMQ provides a plugin `rabbitmq_management` that provides an HTTP
API to manage a RabbitMQ instance. For example, it can be used to
inspect existing queues, create new ones, or delete queues.

For the client to work, the plugin should first be enabled with:

    sudo rabbitmq-plugins enable rabbitmq_management

If the API cannot be connected to a `ManagementApiConnectionError`
exception is raised.

In the Github Action workflows, the command above won't work, since
RabbitMQ is installed as a service in a Docker container. However, it
can be enabled in the Docker container by adding the `-management`
suffix to the version specification of the RabbitMQ service. Since the
API is exposed on port 15672, that is also added to the forward list.
The command group has three subcommands:

 * `server-properties`: Report properties of RabbitMQ server
 * `queues`: Interact with RabbitMQ queues
 * `tasks`: Deal with messages stored in tasks queues

The first command is a simple utility that can come in handy during
debugging as it reports various server settings, more so than the
version of RabbitMQ that is reported by `verdi status`.

The `queues` command is also mostly useful for debugging. It currently
allows to list, create and delete queues.

The most important command is the `tasks` command. Currently, there is a
bug where it is possible that tasks in the process queue (which are
consumed by daemon workers to continue submitted processes) can get
lost, despite the persistence requirements on the queue and the
messages. When a tasks gets lost, the corresponding process will become
a "zombie" and seems stuck. Currently there was no easy way to confirm
that a process had become a zombie due to its task having been lost.

The `list` method fixes this problem as it will introspect all task
messages in the process queue and collect the process ids that they
correspond to. This can be used to see which active processes no longer
have a corresponding task and so are a zombie.

The `analysis` command is a utility command that automates this process.
It will get all process ids of the existing tasks in the process queue
and cross-reference those with the active processes as stored in the
database of AiiDA's storage backend. By default it reports if there are
inconsistencies between the two sets. If there are, the user can run the
command again with the `--fix` flag and the command will automatically
discard tasks that are incorrect and recreate those that are missing.
The command was recently added before the `verdi devel rabbitmq`
subcommand existed. At this point it makes more sense to group it there
because it is directly related to RabbitMQ and if this service is ever
removed, this command also no longer serves a purpose.
@sphuber sphuber force-pushed the feature/cli-rabbitmq branch from 8031be8 to 60342fe Compare November 4, 2022 18:23
@sphuber sphuber merged commit 257a509 into aiidateam:main Nov 4, 2022
@sphuber sphuber deleted the feature/cli-rabbitmq branch November 4, 2022 18:53
@sphuber
Copy link
Contributor Author

sphuber commented Nov 4, 2022

Thanks a lot for the reviews @ltalirz and @unkcpz !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants