-
Notifications
You must be signed in to change notification settings - Fork 473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unstable work of rabbitmq pubsub with resilience policies (with the loss of messages!) #2557
Comments
Thanks for reporting this. |
I have updated information. Apparently, the message didn't just disappear from rabbitmq, but dapr tried to send it to DLQ.
|
cc @halspang |
Hi all! |
@VuiDJi - Sorry for not responding to this earlier, I thought I did but I guess not. From what I'm seeing, the main thing that I worry about is that the listener is receiving the message again after some kind of lease duration on the message times out. With infinite retries and the component not interacting with runtime retries, it seems plausible that every so often it's just starting a new retry chain as it thinks the message has been dropped and needs to be read again. As for the 2 issues with exponential retries, I do think that it could be partially due to what I explained above, but they need further investigation. |
Hi all! |
This was not investigated for release 1.11. |
@halspang do we have anyone looking into resiliency policies? |
@artursouza - Component resiliency is in the v1.12 release planning, though I don't know if anyone has committed to doing it. |
Hi all! |
Hello everyone!
We have encountered an important problem when using dapr resilience policies with rabbitmq pubsub.
Previously, we used kafka, and if some message got into the topic that could not be processed and for which the dead letter topic was not enabled, dapr continued trying to process this message every 5 seconds, and it was OK.
After migrating to rabbitmq, we encountered the problem that the broken message tried to be re-processed immediately after the crash, that is, it made thousands of attempts per minute.
To solve this problem, we added a resiliency policy for DefaultRetryPolicy, setting the frequency of retries every 5 seconds:
It worked, retries began to occur every 5 seconds. We had one corrupted message left in the queue (a message about the need to send an SMS message, but with an incorrect number), and it turned out that the frequency of re-processing was increasing every hour, and after 18 hours the frequency reached 400-430 attempts per MINUTE instead of the expected 12 attempts. In fact, we made a DDoS attack on our SMS gateway.
We tried to re-deploy our applications and monitor the frequency of exceptions, and that bug was detected! I wasn't quite right. The frequency remains every 5 seconds, but at some point the message starts to be processed twice (but we have only one message).
Then we faced an even more important problem, which in my opinion is critical.
We assumed that the bug could only be in the constant policy, and tried to rewrite it to the exponential type:
Firstly, we were faced with the fact that the retry period was unstable and went beyond the allowed values.
There is nothing new in the dapr sidecar logs, the last entry was still about the inability to process the message:
The text was updated successfully, but these errors were encountered: