-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception From Send (Rate Exceeded) #866
Comments
@Selimmo Thanks for that feedback. I'll forward this for triage. Just to be sure, this is currently an issue in your development environment? Open a support case via https://particular.net/support if this is affecting you in production. Could you share more details like throughput and number of endpoints/instances? How is this affecting you? Is the system eventually able to recover from this state? |
Hi @ramonsmits Yet we have another set of services we are trying to also migrate to use SQS/SNS, but those ones are still in a test env, and as said above each subscribes to 40 Topics on average. and this exception happens a lot with just running some tests. There are in total around 30 endpoints majority of them have their separate deployments infra (ECS/Fargate) but we still have some sharing an EC2 instance and these are the ones we are migrating right now. (not that the underlined infra makes any difference here but i thought to share more details) Unfortunately its not recovering, the code in NserviceBus.AamazonSQS tries to dispatch the message but it first tries to retrieve the subscriptions which fails with Rate Exceeded (it has up to 30 trx per second) and this causes the event to not be dispatched. and i happen to have a question in here :-
|
Ok, so not a production environment. Does this also happen outside of tests?
Does this mean that these messages never get processed and are forwarded to the error queue or are you experiencing other behavior? Are you dispatching these messages outside of a incoming message context (via |
Good question and unfortunately not any code comments that clarify anything. I'm not familiar with this code base and will forward this question. |
This question is spot on, when i check the outbox table (because we are in compatibility mode and i except the messages should be there as well, i don't see any none dispatched messages (
My theory so far is, the dispatch operation gets retried some how, but a bit late, and since our tests trying to poll for a result (in those contexts) it fails before the dispatch finally succeed. Now those services are relying heavily on polling which has a specific timeout, and even if they got executed eventually, a web user (for an example) will think that their request failed. I just thought about this theory now, and will try to prove it and will come back to you, although proving it doesn't help much still but at least we know it doesn't abandon the events (which is good news for services that are already in prod).
With events we only use |
Using outbox should result in at-least-once delivery. Yes, it could be that there is some additional latency due to the immediate and maybe even delayed retries. For sure the current state is not optimal. From what I've understood by one of our developers is that the current implementation is executing this code to prevent issues due to a misconfigured "hybrid" configuration. When solely relying on native pubsub this code path is not be required. |
Thanks @Selimmo for that additional info. For now I can share that this is a confirmed bug and that our triage process added this issue is added to our internal moderate priority bug list. Could you please reach out to me via support@particular.net (just mention my name) using your corporate details so that we can link your account to this issue and I can provide you with follow-up on any prioritization/progress. |
Hi @ramonsmits
does this mean if i remove the compatibility mode, i will not see this issue?
can u please give me like a very rough estimate / timing for this to be complete? like i said we have merged many other services and those are remaining, and we can't complete this without those services.
Done |
That's correct. The bug surfaces only when using the hybrid mode for the reasons explained here: NServiceBus.AmazonSQS/src/NServiceBus.Transport.SQS/MessageDispatcher.cs Lines 248 to 254 in 37edd0d
We started working on this today, the resolution time really depends on the way we're going to fix it. We might have a few more questions for you and we will reach out through this issue soon. |
Thanks @mauroservienti really can't wait. |
@Selimmo we seem to be on the right track. We've introduced a rate limiter concept to guarantee that we don't exceed the Initial experiments with a publisher endpoint (in hybrid mode) publishing 300 events in a loop to a native pub/sub subscriber AND to a message-driven pub/sub subscriber is working as expected and doesn't seem to show any performance penalty. |
@Selimmo we have a couple of questions for you. It would be of great help if you could provide estimate numbers for the following scenarios:
Thanks! |
Hi @mauroservienti 1- what's the publishers' message publishing throughput? (msgs/sec) (will have to come back to you on that, but will explain a case at the end hopefully will help you a bit). 2- we always use 3- yes to explain this, we have new services (Microservices) using .Net Core those ones are Migrated to Use One thing i noticed, when i was trying to migrate the old service to use the Native Pub/sub (again those are using .Net Framework) , some of them created the Sns Topic twice and i can see the same queue subscribed twice to the same topic (2 versions of the topic) which was weird and i couldn't understand why it happens. (this could be insignificant but thought to mention it) On question no1 above, with the new services migrated we also get this error But again i will come back to you on the average throughput per second, but hopefully this should get you going with the problem for now. |
wanna also add here that those old services are the reason we are still on the hybrid mode, once we move them to use Native Pub/Sub we can switch off the compatibility mode and from what i understand, everything should be fine |
This is extremely strange, the SNS API should be idempotent. By any chance, do you remember the topics' names and the names/types of the events that were causing that behavior? |
Thanks. Can you give us a rough estimation of how many events type the hybrid publisher is publishing, and how many native subscribers and non-native subscribers are subscribed to those events? |
Unfortunately not, but there were lots of them. Gonna have to deploy this branch again to get the answer |
So i dug a little, and i found through the logs that we have few events with High frequencies, i got the highest one which i will refer to as Event This case is on the old services that we couldn't Migrate, but if i were to migrate those now of course it will fail miserably. Also, Notice that Both the Service across the new and old services we have some events flying but not with a high frequency like the the above case, that's why i dismissed those. In the new services that are already Migrated (using Native Pub/sub in a Hybrid Mode) , this only happened when we had a for loop publishing +1000 events, the service itself was the receiver of those events, that's why it was easy to switch them to commands instead. (of course in the future we will have other services subscribing and this fix will not work anymore) Q: if i wanted to use something else rather than logs to find out how many |
Thanks for the detailed answer @Selimmo. I'll have a deeper look later this (my) morning. I guess that closing the issue was not the intention, right? |
Yup apologies, i reopened again |
Yes, ServicePulse or ServiceInsight are one option. They source data from ServiceControl which consumes messages from the audit queue. So, another option would be to enable auditing on the endpoints and consume messages yourself from the audit queue to build the stats you need. This is the link to the new licensing model which includes all the products and is based on consumption. If you are on a different licensing model, I suggest writing an email to |
Thanks @mauroservienti |
If you want access to the raw messages you could configure ServiceControl to forward audits to a log queue. |
@Selimmo quick update: we have a fix in place (#933), we are dealing with some flaky tests, though. We also have another question. Given that SQS/SNS are |
That's a great news @mauroservienti i had a quick look as well. and looking good. |
Absolutely no. We discussed the option to completely remove the deduplication check for outgoing messages, the same your PR does. In theory, if you are using the outbox or your receiving code is idempotent everything should be fine even without the dedup check. However, we discarded that option for the following reasons:
On the fix: We have identified the cause for the flaky tests, it is due to eventual consistency in the subscriptions policies propagation in the cluster. Given that each test run creates a different infrastructure it's very likely that, under high load, policy propagation eventual consistency impacts on tests execution. When running tests on a fixed well-known infrastructure they are always green. We identified a potential remedy we'll try soon. |
@Selimmo we have a green build 🕺🏻 it took a while to stabilize all tests. It'll take a few more days to get the release out of the door due to documentation and the release process. |
@mauroservienti that's really a great news, thanks you very much for keeping me posted, can't wait for this problem to be finally solved. |
@Selimmo we're in the process of releasing 5.4, I'll ping here when the package is available on Nuget. Thanks for your patience. |
The package is available on Nuget and documentation has been updated to reflect changes. This is done. |
Thank you very much @mauroservienti 🎉 |
Hi Guys
We are trying to convert some services to use the new NserviceBuys.AmazonSQS 5.3.1 from 4.4.2, those services subscribing on average to 40 topics each they are all deployed to one Ec2 instance, and we still in compatibility mode.
I chose not to allow NserviceBus create the SNS policies and instead I made it myself to avoid any limitations problems with the AWS IAM policy size.
Now the problem is we have been facing this exception a lot and it causes some failures with some of the events, we have some sagas that cause some Events to be dispatched in the same time which was working perfectly with the Message driven style but it is causing troubles with SQS/SNS style.
All of that was discovered by some coverage tests that we had, again those are running fine with 4.4.2, but not with 5.3.1 and when we introduced some wait in the tests between those operations, it starts succeeding and i don't see the exception, but in a high load prod env, we can't control that.
And the main reason for that is the call that happens at
NServiceBus.AmazonSQS/src/NServiceBus.Transport.SQS/Extensions/SnsClientExtensions.cs
Line 19 in c652985
As u know its using ListTopics that has 30 trx per second hard limit from AWS.
Now i was wondering
Would really appreciate some help in here, as this comes in the way of moving with the SQS/SNS subscription instead of the MessageDriven
The text was updated successfully, but these errors were encountered: