Skip to content

Commit

Permalink
Auto-delete the failed quorum rabbit queues
Browse files Browse the repository at this point in the history
When rabbit is failing for a specific quorum queue, the only thing to
do is to delete the queue (as per rabbit doc, see [1]).

So, to avoid the RPC service to be broken until an operator eventually
do a manual fix on it, catch any INTERNAL ERROR (code 541) and trigger
the deletion of the failed queues under those conditions.
So on next queue declare (triggered from various retries), the queue
will be created again and the service will recover by itself.

Closes-Bug: #2028384
Related-bug: #2031497

[1] https://www.rabbitmq.com/quorum-queues.html#availability

Signed-off-by: Arnaud Morin <arnaud.morin@ovhcloud.com>
Change-Id: Ib8dba833542973091a4e0bf23bb593aca89c5905
  • Loading branch information
arnaudmorin committed Nov 11, 2023
1 parent f23f327 commit 8e3c523
Show file tree
Hide file tree
Showing 2 changed files with 39 additions and 4 deletions.
34 changes: 30 additions & 4 deletions oslo_messaging/_drivers/impl_rabbit.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@
from urllib import parse
import uuid

from amqp import exceptions as amqp_exec
from amqp import exceptions as amqp_ex
import kombu
import kombu.connection
import kombu.entity
Expand Down Expand Up @@ -419,7 +419,7 @@ def declare(self, conn):
conn.connection_id, self.queue_name)
try:
self.queue.declare()
except amqp_exec.PreconditionFailed as err:
except amqp_ex.PreconditionFailed as err:
# NOTE(hberaud): This kind of exception may be triggered
# when a control exchange is shared between services and
# when services try to create it with configs that differ
Expand Down Expand Up @@ -457,6 +457,14 @@ def declare(self, conn):
'Queue: [%(queue)s], '
'error message: [%(err_str)s]', info)
time.sleep(interval)
if self.queue_arguments.get('x-queue-type') == 'quorum':
# Before re-declare queue, try to delete it
# This is helping with issue #2028384
# NOTE(amorin) we need to make sure the connection is
# established again, because when an error occur, the
# connection is closed.
conn.ensure_connection()
self.queue.delete()
self.queue.declare()
else:
raise
Expand Down Expand Up @@ -499,6 +507,24 @@ def consume(self, conn, tag):
nowait=self.nowait)
else:
raise
except amqp_ex.InternalError as exc:
if self.queue_arguments.get('x-queue-type') == 'quorum':
# Before re-consume queue, try to delete it
# This is helping with issue #2028384
if exc.code == 541:
LOG.warning('Queue %s seems broken, will try delete it '
'before starting over.', self.queue.name)
# NOTE(amorin) we need to make sure the connection is
# established again, because when an error occur, the
# connection is closed.
conn.ensure_connection()
self.queue.delete()
self.declare(conn)
self.queue.consume(callback=self._callback,
consumer_tag=str(tag),
nowait=self.nowait)
else:
raise

def cancel(self, tag):
LOG.trace('ConsumerBase.cancel: canceling %s', tag)
Expand Down Expand Up @@ -1208,7 +1234,7 @@ def _heartbeat_thread_job(self):
ConnectionRefusedError,
OSError,
kombu.exceptions.OperationalError,
amqp_exec.ConnectionForced) as exc:
amqp_ex.ConnectionForced) as exc:
LOG.info("A recoverable connection/channel error "
"occurred, trying to reconnect: %s", exc)
self.ensure_connection()
Expand Down Expand Up @@ -1410,7 +1436,7 @@ def _publish(self, exchange, msg, routing_key=None, timeout=None,
if not (exchange.passive or exchange.name in self._declared_exchanges):
try:
exchange(self.channel).declare()
except amqp_exec.PreconditionFailed as err:
except amqp_ex.PreconditionFailed as err:
# NOTE(hberaud): This kind of exception may be triggered
# when a control exchange is shared between services and
# when services try to create it with configs that differ
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
fixes:
- |
Auto-delete the failed quorum rabbit queues.
When rabbit is failing for a specific quorum queue, delete the queue
before trying to recreate it.
This may happen if the queue is not recoverable on rabbit side.
See https://www.rabbitmq.com/quorum-queues.html#availability for more
info on this specific case.

0 comments on commit 8e3c523

Please sign in to comment.