-
Notifications
You must be signed in to change notification settings - Fork 192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RabbitMQ debug tools #5718
RabbitMQ debug tools #5718
Conversation
6f36bdc
to
b3ea08f
Compare
Codecov ReportBase: 79.84% // Head: 79.66% // Decreases project coverage by
Additional details and impacted files@@ Coverage Diff @@
## main #5718 +/- ##
==========================================
- Coverage 79.84% 79.66% -0.18%
==========================================
Files 491 496 +5
Lines 36760 36873 +113
==========================================
+ Hits 29348 29370 +22
- Misses 7412 7503 +91
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
b3ea08f
to
b86ff5e
Compare
Anyone want to review this? @chrisjsewell @ramirezfranciscof @ltalirz @unkcpz ? I would like to get this in 2.1 as it will help users massively with debugging and fixing RabbitMQ issues. |
0028190
to
6e28f08
Compare
6e28f08
to
c931df8
Compare
Let me have a look |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sphuber !
I clearly see the usefulness of these tools in helping debug issues, so this is a great addition.
Just some minor questions about where the code should go.
I've tested it locally and it worked fine for me; I have not read line by line through all of the code you added - if there is something in particular you'd like me to look at, let me know.
|
||
@wrapt.decorator | ||
@click.pass_context | ||
def with_client(ctx, wrapped, _, args, kwargs): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a significant portion of the code in this file seems quite independent of aiida and could possibly fit inside, say, kiwipy as well.
I'm not saying this has to happen, just worth consideration.
We may also want to comment in the module docstring in why these tools are needed, and under which circumstances we may want to remove them again (if that is the case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair, except for the tasks analysis
command though, that is purely AiiDA specific. But you are right, the rest could be abstracted (currently depends tightly on AiiDA's Manager
to get the communicator) and be put in kiwipy
. For me would be fine to move it there at some point if we find it is useful more generically. I had a quick search and there are a few Python libraries out there that provide an interface to the HTTP API, but they are not well maintained.
P.S. I happened to not have any processes running (and the list of rabbitmq tasks is empty), but I still had two queues
I guess this points to some issue? Are these scoped by profile? |
I am also interested in reviewing this. But first I still get a warning with |
No issue here. These are the process task queues and they are declared as being persistent, even if they are empty. Since there is just one per profile, the overhead of keeping them is a lot less then deleting and recreating it each time it is emptied.
Yes they are, the UUID you see is the profile UUID. |
That'd be great, thanks!
I think I added a bug by adding a check on the lower bound, but I inverted the condition. Will fix it now.
The HTTP API is served on port 15672, is that accessible from outside the container? Are you forwarding it just like the 5672 port? |
c931df8
to
20e7bdf
Compare
This should now be fixed. I have also added explicit tests for this. |
@ltalirz thanks for the review. I have addressed your comments. I agree that some of the code is generic enough to be moved to a separate library ( |
Ah, true, I forget to forward this port. It works now. Thanks! |
Would it be possible to add a first column with the profile name that corresponds to the UUID? |
Fine
Yes. I haven't seen anyone using rabbitmq <3.6 for a very long time and its EOL was in 2016. |
Possible yes, but not very straightforward. When running normally, these aren't the only queues that there will be, plus RabbitMQ might be used for something other than AiiDA that creates queues that have nothing to do with AiiDA. So for those a profile wouldn't even make sense. I don't think this command will be for normal users anyway. These will at best be for us at developers, so don't think we should tightly couple the output to AiiDA profiles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sphuber Thanks! This is a very useful feature. I did not review it in detail but just give it a try with some questions I have in mind.
If I understand correctly, the devel rabbitmq tasks analyze --fix
will run the revive_processces
and what the reviev_processes did is to requeue the process?
I run it analyze --fix
for the lost process (the workchain process that has child process killed by ctrl+C), when I run analyze it again they disappear but still staled process in verdi process list
, where is the process list read from, do this command need to take care of clean it?
.github/workflows/ci-code.yml
Outdated
@@ -66,9 +66,10 @@ jobs: | |||
ports: | |||
- 5432:5432 | |||
rabbitmq: | |||
image: rabbitmq:latest | |||
image: rabbitmq:3.8-management |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3.8-management
points to 3.8.34
, there is the tag 3.8.14-management
can be used here. Same for image tags below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah great, will change that then, thanks
# Need to manually stop the daemon such that it cleans all state related to the submitted processes. It is not quite | ||
# understood how, but leaving the daemon running, a following test that will submit to the daemon may hit an | ||
# an exception because the node submitted here would no longer exist and the daemon would try to operate on it. | ||
started_daemon_client.stop_daemon(wait=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If understand correctly, this stops the daemon the second time since stop_daemon
is already called from daemon_client
fixture of started_daemon_client
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think it might indeed no longer be necessary. When I introduced the daemon_client
fixture, it was originally session-scoped, just for efficiency to only stop it once at the end of the session. But I think exactly for this reason I decided to make it function scoped because it can cause side effects between tests. I can go back and remove it and see if the tests run. They probably will now.
20e7bdf
to
8031be8
Compare
Yes,
Not sure I understand what you mean. The command is not supposed to clean up killed processes. When you say there is still a stale process in
The |
It is not the killed processes, they are still all in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is all good to me. 😄
This is in anticipation of more code being added soon and the file was already getting large and difficult to read. Turning it into a package with individual modules helps readability greatly.
RabbitMQ 3.5 has been EOL since 31 October 2016: https://www.rabbitmq.com/versions.html This version required to explicitly run `support_deprecated_rabbitmq` on the `pamqp` library (which is used all the way down the stack by `aiormq` to handle interactions with RabbitMQ) to enable compatibility. In removing this, the explicit dependency on `pamqp` is also removed. The `aiida.manage.manager.is_rabbitmq_version_supported` function which is used to check compatibility when a profile is loaded is updated to include the lower bound.
RabbitMQ provides a plugin `rabbitmq_management` that provides an HTTP API to manage a RabbitMQ instance. For example, it can be used to inspect existing queues, create new ones, or delete queues. For the client to work, the plugin should first be enabled with: sudo rabbitmq-plugins enable rabbitmq_management If the API cannot be connected to a `ManagementApiConnectionError` exception is raised. In the Github Action workflows, the command above won't work, since RabbitMQ is installed as a service in a Docker container. However, it can be enabled in the Docker container by adding the `-management` suffix to the version specification of the RabbitMQ service. Since the API is exposed on port 15672, that is also added to the forward list.
The command group has three subcommands: * `server-properties`: Report properties of RabbitMQ server * `queues`: Interact with RabbitMQ queues * `tasks`: Deal with messages stored in tasks queues The first command is a simple utility that can come in handy during debugging as it reports various server settings, more so than the version of RabbitMQ that is reported by `verdi status`. The `queues` command is also mostly useful for debugging. It currently allows to list, create and delete queues. The most important command is the `tasks` command. Currently, there is a bug where it is possible that tasks in the process queue (which are consumed by daemon workers to continue submitted processes) can get lost, despite the persistence requirements on the queue and the messages. When a tasks gets lost, the corresponding process will become a "zombie" and seems stuck. Currently there was no easy way to confirm that a process had become a zombie due to its task having been lost. The `list` method fixes this problem as it will introspect all task messages in the process queue and collect the process ids that they correspond to. This can be used to see which active processes no longer have a corresponding task and so are a zombie. The `analysis` command is a utility command that automates this process. It will get all process ids of the existing tasks in the process queue and cross-reference those with the active processes as stored in the database of AiiDA's storage backend. By default it reports if there are inconsistencies between the two sets. If there are, the user can run the command again with the `--fix` flag and the command will automatically discard tasks that are incorrect and recreate those that are missing.
The command was recently added before the `verdi devel rabbitmq` subcommand existed. At this point it makes more sense to group it there because it is directly related to RabbitMQ and if this service is ever removed, this command also no longer serves a purpose.
8031be8
to
60342fe
Compare
No description provided.