unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

VuiDJi · 2023-02-19T01:12:00Z

Hello everyone!

We have encountered an important problem when using dapr resilience policies with rabbitmq pubsub.

Previously, we used kafka, and if some message got into the topic that could not be processed and for which the dead letter topic was not enabled, dapr continued trying to process this message every 5 seconds, and it was OK.

After migrating to rabbitmq, we encountered the problem that the broken message tried to be re-processed immediately after the crash, that is, it made thousands of attempts per minute.

To solve this problem, we added a resiliency policy for DefaultRetryPolicy, setting the frequency of retries every 5 seconds:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      # Global Retry Policy
      DefaultRetryPolicy:
        policy: constant
        duration: 5s

  targets:
    apps:
      # apps and their applied policies here

It worked, retries began to occur every 5 seconds. We had one corrupted message left in the queue (a message about the need to send an SMS message, but with an incorrect number), and it turned out that the frequency of re-processing was increasing every hour, and after 18 hours the frequency reached 400-430 attempts per MINUTE instead of the expected 12 attempts. In fact, we made a DDoS attack on our SMS gateway.

We tried to re-deploy our applications and monitor the frequency of exceptions, and that bug was detected! I wasn't quite right. The frequency remains every 5 seconds, but at some point the message starts to be processed twice (but we have only one message).

Then we faced an even more important problem, which in my opinion is critical.
We assumed that the bug could only be in the constant policy, and tried to rewrite it to the exponential type:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      # Global Retry Policy
      DefaultPubsubComponentInboundRetryPolicy:
        policy: exponential
        duration: 5s
        maxInterval: 60s
        maxRetries: -1 # Retry indefinitely
  targets:
    apps:
      # apps and their applied policies here

Firstly, we were faced with the fact that the retry period was unstable and went beyond the allowed values.

At first it was less than 5 seconds:

Then the period began to increase, but chaotically. The period could decrease by 3 times, and then increase immediately by 6 times, and even became more than 60 seconds, i.e. exceeded the specified maxInterval:

But the worst thing is that at some point message processing attempts ended, however, not because the message was successfully processed (it was impossible, it was broken), but because this message disappeared from the queue! It looks like dapr sent a successful ack to rabbitmq, although MaxRetries=-1 and a different behavior is specified in pubsub:

apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: donum-pubsub
spec:
  type: pubsub.rabbitmq
  version: v1
  metadata:
  - name: host
    value: "amqp://guest:guest@rabbitmq:5672"
  - name: durable
    value: true
  - name: deletedWhenUnused
    value: false
  **- name: autoAck
    value: false**
  - name: deliveryMode
    value: 2
  **- name: requeueInFailure
    value: true**
  - name: prefetchCount
    value: 0
  - name: reconnectWait
    value: 0
  - name: concurrencyMode
    value: parallel
  - name: publisherConfirm
    value: true
  - name: enableDeadLetter
    value: false

There is nothing new in the dapr sidecar logs, the last entry was still about the inability to process the message:

The text was updated successfully, but these errors were encountered:

VuiDJi · 2023-02-19T01:33:36Z

Now I noticed that according to the documentation, "duration" should be specified only for the "constant" type. I'm not sure if it affects anything, but I tried updating the policy and restarting the applications. I'll let you know if the problem persists.
Actual policy:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      DefaultPubsubComponentInboundRetryPolicy:
        policy: exponential
        maxInterval: 60s
        maxRetries: -1

  targets:
    apps:

At least the problem of exceeding the allowed maximum interval remained:

VuiDJi · 2023-02-19T02:08:48Z

Gotcha!
It turned out that sometimes if I terminate the application, for example, by reducing the number of replicas to 0, the message disappears from the queue, because dapr is sending an ack to rabbitmq at that moment, but not everytime.

VuiDJi · 2023-02-19T02:34:56Z

Updated information:
Few minutes ago dapr again sent an ack to rabbitmq:

App is up for 23 mins, but 10 minutes ago suddenly sent an ack to rabbitmq and finished processing the message. Here are the latest dapr sidecar logs (nothing new):

yaron2 · 2023-02-21T05:15:44Z

Thanks for reporting this.

VuiDJi · 2023-02-21T16:12:36Z

I have updated information. Apparently, the message didn't just disappear from rabbitmq, but dapr tried to send it to DLQ.
So we are have three issues:

with infinite repeats configured every 5 seconds (constant mode), dapr eventually starts doing repetitions more and more often, duplicating calls, reaching thousands of repetitions per minute
with the exponential variant, dapr incorrectly sets the period (it deviates sharply up, then down, then exceeds the maxInterval).
with the exponential option, dapr ignores the infinite or large MaxRetries (-1 or, for example, 1.000.000), and at some point after ~15 minutes it still throws the message to DLQ.

yaron2 · 2023-02-21T16:16:01Z

cc @halspang

VuiDJi · 2023-03-13T15:46:13Z

Hi all!
tell me, please, is there any news?

halspang · 2023-03-13T18:05:19Z

@VuiDJi - Sorry for not responding to this earlier, I thought I did but I guess not.

From what I'm seeing, the main thing that I worry about is that the listener is receiving the message again after some kind of lease duration on the message times out. With infinite retries and the component not interacting with runtime retries, it seems plausible that every so often it's just starting a new retry chain as it thinks the message has been dropped and needs to be read again.

As for the 2 issues with exponential retries, I do think that it could be partially due to what I explained above, but they need further investigation.

VuiDJi · 2023-05-13T20:26:08Z

Hi all!
tell me, please, is there any news?

berndverst · 2023-05-31T15:52:46Z

This was not investigated for release 1.11.

artursouza · 2023-06-30T17:43:19Z

@halspang do we have anyone looking into resiliency policies?

halspang · 2023-06-30T17:57:09Z

@artursouza - Component resiliency is in the v1.12 release planning, though I don't know if anyone has committed to doing it.

VuiDJi · 2023-12-17T19:12:20Z

Hi all!
tell me, please, is there any news?

VuiDJi added the kind/bug Something isn't working label Feb 19, 2023

berndverst added this to the v1.11 milestone Feb 21, 2023

berndverst added the help wanted Extra attention is needed label Feb 24, 2023

VuiDJi mentioned this issue Apr 21, 2023

PubSub retries only four times since 1.10 #2757

Closed

berndverst modified the milestones: v1.11, v1.12 May 31, 2023

ItalyPaleAle modified the milestones: v1.12, v1.13 Sep 12, 2023

ItalyPaleAle modified the milestones: v1.13, v1.14 Feb 26, 2024

berndverst modified the milestones: v1.14, v1.15 Jul 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

VuiDJi commented Feb 19, 2023

VuiDJi commented Feb 19, 2023 •

edited

Loading

VuiDJi commented Feb 19, 2023 •

edited

Loading

VuiDJi commented Feb 19, 2023

yaron2 commented Feb 21, 2023

VuiDJi commented Feb 21, 2023

yaron2 commented Feb 21, 2023

VuiDJi commented Mar 13, 2023

halspang commented Mar 13, 2023

VuiDJi commented May 13, 2023

berndverst commented May 31, 2023

artursouza commented Jun 30, 2023

halspang commented Jun 30, 2023

VuiDJi commented Dec 17, 2023

unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

Comments

VuiDJi commented Feb 19, 2023

VuiDJi commented Feb 19, 2023 • edited Loading

VuiDJi commented Feb 19, 2023 • edited Loading

VuiDJi commented Feb 19, 2023

yaron2 commented Feb 21, 2023

VuiDJi commented Feb 21, 2023

yaron2 commented Feb 21, 2023

VuiDJi commented Mar 13, 2023

halspang commented Mar 13, 2023

VuiDJi commented May 13, 2023

berndverst commented May 31, 2023

artursouza commented Jun 30, 2023

halspang commented Jun 30, 2023

VuiDJi commented Dec 17, 2023

VuiDJi commented Feb 19, 2023 •

edited

Loading

VuiDJi commented Feb 19, 2023 •

edited

Loading