Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557

Open
VuiDJi opened this issue Feb 19, 2023 · 13 comments
Labels
help wanted Extra attention is needed kind/bug Something isn't working
Milestone

Comments

@VuiDJi
Copy link

VuiDJi commented Feb 19, 2023

Hello everyone!

We have encountered an important problem when using dapr resilience policies with rabbitmq pubsub.

Previously, we used kafka, and if some message got into the topic that could not be processed and for which the dead letter topic was not enabled, dapr continued trying to process this message every 5 seconds, and it was OK.

After migrating to rabbitmq, we encountered the problem that the broken message tried to be re-processed immediately after the crash, that is, it made thousands of attempts per minute.

To solve this problem, we added a resiliency policy for DefaultRetryPolicy, setting the frequency of retries every 5 seconds:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      # Global Retry Policy
      DefaultRetryPolicy:
        policy: constant
        duration: 5s

  targets:
    apps:
      # apps and their applied policies here

It worked, retries began to occur every 5 seconds. We had one corrupted message left in the queue (a message about the need to send an SMS message, but with an incorrect number), and it turned out that the frequency of re-processing was increasing every hour, and after 18 hours the frequency reached 400-430 attempts per MINUTE instead of the expected 12 attempts. In fact, we made a DDoS attack on our SMS gateway.
image

We tried to re-deploy our applications and monitor the frequency of exceptions, and that bug was detected! I wasn't quite right. The frequency remains every 5 seconds, but at some point the message starts to be processed twice (but we have only one message).
image

Then we faced an even more important problem, which in my opinion is critical.
We assumed that the bug could only be in the constant policy, and tried to rewrite it to the exponential type:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      # Global Retry Policy
      DefaultPubsubComponentInboundRetryPolicy:
        policy: exponential
        duration: 5s
        maxInterval: 60s
        maxRetries: -1 # Retry indefinitely
  targets:
    apps:
      # apps and their applied policies here

Firstly, we were faced with the fact that the retry period was unstable and went beyond the allowed values.

  1. At first it was less than 5 seconds:

image

  1. Then the period began to increase, but chaotically. The period could decrease by 3 times, and then increase immediately by 6 times, and even became more than 60 seconds, i.e. exceeded the specified maxInterval:
    image

image

  1. But the worst thing is that at some point message processing attempts ended, however, not because the message was successfully processed (it was impossible, it was broken), but because this message disappeared from the queue! It looks like dapr sent a successful ack to rabbitmq, although MaxRetries=-1 and a different behavior is specified in pubsub:
apiVersion: dapr.io/v1alpha1
kind: Component
metadata:
  name: donum-pubsub
spec:
  type: pubsub.rabbitmq
  version: v1
  metadata:
  - name: host
    value: "amqp://guest:guest@rabbitmq:5672"
  - name: durable
    value: true
  - name: deletedWhenUnused
    value: false
  **- name: autoAck
    value: false**
  - name: deliveryMode
    value: 2
  **- name: requeueInFailure
    value: true**
  - name: prefetchCount
    value: 0
  - name: reconnectWait
    value: 0
  - name: concurrencyMode
    value: parallel
  - name: publisherConfirm
    value: true
  - name: enableDeadLetter
    value: false

There is nothing new in the dapr sidecar logs, the last entry was still about the inability to process the message:
image

@VuiDJi VuiDJi added the kind/bug Something isn't working label Feb 19, 2023
@VuiDJi
Copy link
Author

VuiDJi commented Feb 19, 2023

Now I noticed that according to the documentation, "duration" should be specified only for the "constant" type. I'm not sure if it affects anything, but I tried updating the policy and restarting the applications. I'll let you know if the problem persists.
Actual policy:

apiVersion: dapr.io/v1alpha1
kind: Resiliency
metadata:
  name: donum-resiliency
spec:
  policies:
    retries:
      DefaultPubsubComponentInboundRetryPolicy:
        policy: exponential
        maxInterval: 60s
        maxRetries: -1

  targets:
    apps:

At least the problem of exceeding the allowed maximum interval remained:
image

@VuiDJi
Copy link
Author

VuiDJi commented Feb 19, 2023

Gotcha!
It turned out that sometimes if I terminate the application, for example, by reducing the number of replicas to 0, the message disappears from the queue, because dapr is sending an ack to rabbitmq at that moment, but not everytime.
image

@VuiDJi
Copy link
Author

VuiDJi commented Feb 19, 2023

Updated information:
Few minutes ago dapr again sent an ack to rabbitmq:
image

App is up for 23 mins, but 10 minutes ago suddenly sent an ack to rabbitmq and finished processing the message. Here are the latest dapr sidecar logs (nothing new):
image

@yaron2
Copy link
Member

yaron2 commented Feb 21, 2023

Thanks for reporting this.

@VuiDJi
Copy link
Author

VuiDJi commented Feb 21, 2023

I have updated information. Apparently, the message didn't just disappear from rabbitmq, but dapr tried to send it to DLQ.
So we are have three issues:

  1. with infinite repeats configured every 5 seconds (constant mode), dapr eventually starts doing repetitions more and more often, duplicating calls, reaching thousands of repetitions per minute

  2. with the exponential variant, dapr incorrectly sets the period (it deviates sharply up, then down, then exceeds the maxInterval).

  3. with the exponential option, dapr ignores the infinite or large MaxRetries (-1 or, for example, 1.000.000), and at some point after ~15 minutes it still throws the message to DLQ.

@yaron2
Copy link
Member

yaron2 commented Feb 21, 2023

cc @halspang

@berndverst berndverst added this to the v1.11 milestone Feb 21, 2023
@berndverst berndverst added the help wanted Extra attention is needed label Feb 24, 2023
@VuiDJi
Copy link
Author

VuiDJi commented Mar 13, 2023

Hi all!
tell me, please, is there any news?

@halspang
Copy link
Contributor

@VuiDJi - Sorry for not responding to this earlier, I thought I did but I guess not.

From what I'm seeing, the main thing that I worry about is that the listener is receiving the message again after some kind of lease duration on the message times out. With infinite retries and the component not interacting with runtime retries, it seems plausible that every so often it's just starting a new retry chain as it thinks the message has been dropped and needs to be read again.

As for the 2 issues with exponential retries, I do think that it could be partially due to what I explained above, but they need further investigation.

@VuiDJi
Copy link
Author

VuiDJi commented May 13, 2023

Hi all!
tell me, please, is there any news?

@berndverst
Copy link
Member

This was not investigated for release 1.11.

@berndverst berndverst modified the milestones: v1.11, v1.12 May 31, 2023
@artursouza
Copy link
Member

@halspang do we have anyone looking into resiliency policies?

@halspang
Copy link
Contributor

@artursouza - Component resiliency is in the v1.12 release planning, though I don't know if anyone has committed to doing it.

@ItalyPaleAle ItalyPaleAle modified the milestones: v1.12, v1.13 Sep 12, 2023
@VuiDJi
Copy link
Author

VuiDJi commented Dec 17, 2023

Hi all!
tell me, please, is there any news?

@ItalyPaleAle ItalyPaleAle modified the milestones: v1.13, v1.14 Feb 26, 2024
@berndverst berndverst modified the milestones: v1.14, v1.15 Jul 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed kind/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants