Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception From Send (Rate Exceeded) #866

Closed
Selimmo opened this issue Jun 18, 2021 · 32 comments · Fixed by #998
Closed

Exception From Send (Rate Exceeded) #866

Selimmo opened this issue Jun 18, 2021 · 32 comments · Fixed by #998

Comments

@Selimmo
Copy link

Selimmo commented Jun 18, 2021

Hi Guys
We are trying to convert some services to use the new NserviceBuys.AmazonSQS 5.3.1 from 4.4.2, those services subscribing on average to 40 topics each they are all deployed to one Ec2 instance, and we still in compatibility mode.

I chose not to allow NserviceBus create the SNS policies and instead I made it myself to avoid any limitations problems with the AWS IAM policy size.

Now the problem is we have been facing this exception a lot and it causes some failures with some of the events, we have some sagas that cause some Events to be dispatched in the same time which was working perfectly with the Message driven style but it is causing troubles with SQS/SNS style.

All of that was discovered by some coverage tests that we had, again those are running fine with 4.4.2, but not with 5.3.1 and when we introduced some wait in the tests between those operations, it starts succeeding and i don't see the exception, but in a high load prod env, we can't control that.

exceptionType
:
"Amazon.SimpleNotificationService.AmazonSimpleNotificationServiceException",
exceptionMessage
:
"Rate exceeded",
stackTrace
:
"   at Amazon.Runtime.Internal.HttpErrorResponseExceptionHandler.HandleException(IExecutionContext executionContext, HttpErrorResponseException exception)
   at Amazon.Runtime.Internal.ErrorHandler.ProcessException(IExecutionContext executionContext, Exception exception)
   at Amazon.Runtime.Internal.ErrorHandler.<InvokeAsync>d__5`1.MoveNext()
   at Amazon.Runtime.Internal.CallbackHandler.<InvokeAsync>d__9`1.MoveNext()
   at Amazon.Runtime.Internal.EndpointDiscoveryHandler.<InvokeAsync>d__2`1.MoveNext()
   at Amazon.Runtime.Internal.CredentialsRetriever.<InvokeAsync>d__7`1.MoveNext()
   at Amazon.Runtime.Internal.RetryHandler.<InvokeAsync>d__10`1.MoveNext()
   at Amazon.Runtime.Internal.CallbackHandler.<InvokeAsync>d__9`1.MoveNext()
   at Amazon.Runtime.Internal.ErrorCallbackHandler.<InvokeAsync>d__5`1.MoveNext()
   at Amazon.Runtime.Internal.MetricsHandler.<InvokeAsync>d__1`1.MoveNext()
   at Amazon.SimpleNotificationService.AmazonSimpleNotificationServiceClient.<FindTopicAsync>d__6.MoveNext()
   at NServiceBus.Transport.SQS.Extensions.SnsClientExtensions.<FindMatchingSubscription>d__1.MoveNext() in /_/src/NServiceBus.Transport.SQS/Extensions/SnsClientExtensions.cs:line 19
   at NServiceBus.Transport.SQS.MessageDispatcher.<PrepareMessage>d__10`1.MoveNext() in /_/src/NServiceBus.Transport.SQS/MessageDispatcher.cs:line 257
   at NServiceBus.Transport.SQS.MessageDispatcher.<Dispatch>d__7.MoveNext() in /_/src/NServiceBus.Transport.SQS/MessageDispatcher.cs:line 206
   at NServiceBus.Transport.SQS.MessageDispatcher.<Dispatch>d__1.MoveNext() in /_/src/NServiceBus.Transport.SQS/MessageDispatcher.cs:line 59",

And the main reason for that is the call that happens at

var existingTopic = await snsClient.FindTopicAsync(topicName).ConfigureAwait(false);

As u know its using ListTopics that has 30 trx per second hard limit from AWS.

Now i was wondering

  • Is there any caching solutions that could be introduced to avoid this?
  • Is there any extra configurations that we missed that is causing this to happen over and over ?
  • Could we introduce a retry mechanism around the above line? in case it was a throttling exception

Would really appreciate some help in here, as this comes in the way of moving with the SQS/SNS subscription instead of the MessageDriven

@ramonsmits
Copy link
Member

ramonsmits commented Jun 18, 2021

@Selimmo Thanks for that feedback. I'll forward this for triage.

Just to be sure, this is currently an issue in your development environment? Open a support case via https://particular.net/support if this is affecting you in production.

Could you share more details like throughput and number of endpoints/instances?

How is this affecting you? Is the system eventually able to recover from this state?

@Selimmo
Copy link
Author

Selimmo commented Jun 20, 2021

Hi @ramonsmits
We have migrated many services to use the SQS/SNS subscription and we sometimes see this exception every once in a while. those already migrated services they don't subscribe to a lot of events.

Yet we have another set of services we are trying to also migrate to use SQS/SNS, but those ones are still in a test env, and as said above each subscribes to 40 Topics on average. and this exception happens a lot with just running some tests.

There are in total around 30 endpoints majority of them have their separate deployments infra (ECS/Fargate) but we still have some sharing an EC2 instance and these are the ones we are migrating right now. (not that the underlined infra makes any difference here but i thought to share more details)

Unfortunately its not recovering, the code in NserviceBus.AamazonSQS tries to dispatch the message but it first tries to retrieve the subscriptions which fails with Rate Exceeded (it has up to 30 trx per second) and this causes the event to not be dispatched.

and i happen to have a question in here :-
Why are we getting the subscriptions ? we are not doing anything with it in the rest of the code, in other words why just not dispatch regardless

@ramonsmits
Copy link
Member

a test env,
this exception happens a lot with just running some tests.

Ok, so not a production environment. Does this also happen outside of tests?

Unfortunately its not recovering, the code in NServiceBus.AmazonSQS tries to dispatch the message but it first tries to retrieve the subscriptions which fails with Rate Exceeded (it has up to 30 trx per second) and this causes the event to not be dispatched.

Does this mean that these messages never get processed and are forwarded to the error queue or are you experiencing other behavior?

Are you dispatching these messages outside of a incoming message context (via IMessagesSessions.Publish) or inside an message context (in a handler via context.Publish)?

@ramonsmits
Copy link
Member

Why are we getting the subscriptions ? we are not doing anything with it in the rest of the code, in other words why just not dispatch regardless

Good question and unfortunately not any code comments that clarify anything. I'm not familiar with this code base and will forward this question.

@Selimmo
Copy link
Author

Selimmo commented Jun 21, 2021

  • It happens as well in production (once every couple of weeks) .
  • For one user in testing it might not happen, but we are sure it will happen in prod (a lot) with multiple users. (if we go with those new set of services that we want to migrate)
  • We know this because the test is repeating one operation back to back but with different users, the operation results in some sagas that fire a lot of events, and some of those events being handled in more than one context. when we just let the test execute normally, it fails (second operation will fail) when we introduce a sleep between them, it succeeds and we never see the exception.

Does this mean that these messages never get processed and are forwarded to the error queue or are you experiencing other behavior?

This question is spot on, when i check the outbox table (because we are in compatibility mode and i except the messages should be there as well, i don't see any none dispatched messages (dispatched = 0) , but since the test is failing, this might mean one of 2 things:-

  • The events never got dispatched in the first place, although this will not explain why i can't find any messages in outbox table with dispatched = 0, this only will happen if the dispatch operation never gets retried.
    public async Task Dispatch(TransportOperations outgoingMessages, TransportTransaction transaction, ContextBag context)
  • The events "eventually" got dispatched, but a bit late, and since our tests are polling for a result, they fail before the event eventually got dispatched.

My theory so far is, the dispatch operation gets retried some how, but a bit late, and since our tests trying to poll for a result (in those contexts) it fails before the dispatch finally succeed. Now those services are relying heavily on polling which has a specific timeout, and even if they got executed eventually, a web user (for an example) will think that their request failed.

I just thought about this theory now, and will try to prove it and will come back to you, although proving it doesn't help much still but at least we know it doesn't abandon the events (which is good news for services that are already in prod).

Are you dispatching these messages outside of a incoming message context (via IMessagesSessions.Publish) or inside an message context (in a handler via context.Publish)?

With events we only use Context.Publish.

@ramonsmits
Copy link
Member

Using outbox should result in at-least-once delivery. Yes, it could be that there is some additional latency due to the immediate and maybe even delayed retries.

For sure the current state is not optimal. From what I've understood by one of our developers is that the current implementation is executing this code to prevent issues due to a misconfigured "hybrid" configuration. When solely relying on native pubsub this code path is not be required.

@ramonsmits
Copy link
Member

Thanks @Selimmo for that additional info.

For now I can share that this is a confirmed bug and that our triage process added this issue is added to our internal moderate priority bug list.

Could you please reach out to me via support@particular.net (just mention my name) using your corporate details so that we can link your account to this issue and I can provide you with follow-up on any prioritization/progress.

@Selimmo
Copy link
Author

Selimmo commented Jun 23, 2021

Hi @ramonsmits

From what I've understood by one of our developers is that the current implementation is executing this code to prevent issues due to a misconfigured "hybrid" configuration. When solely relying on native pubsub this code path is not be required.

does this mean if i remove the compatibility mode, i will not see this issue?

For now I can share that this is a confirmed bug and that our triage process added this issue is added to our internal moderate priority bug list.

can u please give me like a very rough estimate / timing for this to be complete? like i said we have merged many other services and those are remaining, and we can't complete this without those services.

Could you please reach out to me via support@particular.net (just mention my name) using your corporate details so that we can link your account to this issue and I can provide you with follow-up on any prioritization/progress.`

Done

@mauroservienti
Copy link
Member

From what I've understood by one of our developers is that the current implementation is executing this code to prevent issues due to a misconfigured "hybrid" configuration. When solely relying on native pubsub this code path is not be required.

does this mean if i remove the compatibility mode, i will not see this issue?

That's correct. The bug surfaces only when using the hybrid mode for the reasons explained here:

// The following check is required by the message-driven pub/sub hybrid mode in Core
// to allow endpoints to migrate from message-driven pub/sub to native pub/sub
// If the message we're trying to dispatch is a unicast message with a `Publish` intent
// but the subscriber is also subscribed via SNS we don't want to dispatch the message twice
// the subscriber will receive it via SNS and not via a unicast send.
// We can improve the situation a bit by caching the information and thus reduce the amount of times we hit the SNS API.
// We need to think abut what happens in case the destination endpoint unsubscribes from the event.

For now I can share that this is a confirmed bug and that our triage process added this issue is added to our internal moderate priority bug list.

can u please give me like a very rough estimate / timing for this to be complete? like i said we have merged many other services and those are remaining, and we can't complete this without those services.

We started working on this today, the resolution time really depends on the way we're going to fix it.

We might have a few more questions for you and we will reach out through this issue soon.

@Selimmo
Copy link
Author

Selimmo commented Aug 10, 2021

Thanks @mauroservienti really can't wait.
It has been happening now like every day on Prod (on the set of services that are migrated to use SNS in compat. mode) and we still have others awaiting this fix (cause we can't afford pushing them now giving this issue).
And today was the first day to see an event going to the error queue because of this (after many retries).
Thanks again, and feel free to reach out if you have any questions please.

@mauroservienti
Copy link
Member

@Selimmo we managed to reproduce the problem in an acceptance test, #933

@mauroservienti
Copy link
Member

@Selimmo we seem to be on the right track. We've introduced a rate limiter concept to guarantee that we don't exceed the ListTopics API requests limits. If it happens, the endpoint will throttle code execution to get back on the right track. We're also using the ListSubscriptionsByTopic API, but for now, we've decided that it's not worth it to limit that too. We've decided, though, to cache subscriptions lookup results (for a very limited amount of seconds), so that if a message gets published more than once we get a chance to hit the cache.

Initial experiments with a publisher endpoint (in hybrid mode) publishing 300 events in a loop to a native pub/sub subscriber AND to a message-driven pub/sub subscriber is working as expected and doesn't seem to show any performance penalty.

@mauroservienti
Copy link
Member

@Selimmo we have a couple of questions for you. It would be of great help if you could provide estimate numbers for the following scenarios:

  • what's the publishers' message publishing throughput? (msgs/sec)
  • are publishers publishing events in the context of an incoming message using the IMessageHandlerContext, or outside using an endpoint instance (e.g. from a web API controller action), or both?
    • if both, is there any meaningful throughput difference?
  • you mentioned that there are about 40 topics, which should map to 40 different evens types, do you have cases in which an event is subscribed by a message-driven pub/sub only and there is no corresponding topic?

Thanks!

@Selimmo
Copy link
Author

Selimmo commented Aug 20, 2021

Hi @mauroservienti
Thanks for getting back to me on this

1- what's the publishers' message publishing throughput? (msgs/sec) (will have to come back to you on that, but will explain a case at the end hopefully will help you a bit).

2- we always use IMessageHandlerContext to publish Events.

3- yes to explain this, we have new services (Microservices) using .Net Core those ones are Migrated to Use SNS Native Pub/Sub And also we have Old Monolith Services that haven't been migrated yet (due to this problem) and those ones still relying on message-driven pub/sub and there are those events flying around between old Services and new Services and vice versa (so no corresponding topics).

One thing i noticed, when i was trying to migrate the old service to use the Native Pub/sub (again those are using .Net Framework) , some of them created the Sns Topic twice and i can see the same queue subscribed twice to the same topic (2 versions of the topic) which was weird and i couldn't understand why it happens. (this could be insignificant but thought to mention it)

On question no1 above, with the new services migrated we also get this error Rate Exceeded (we have some services generating some payment files and they push so many events (inside for loop), one time we had 1000 events inside the loop and we started to see the error everyday, and eventually i had to convert it to a command temporarily till this fix happen.

But again i will come back to you on the average throughput per second, but hopefully this should get you going with the problem for now.

@Selimmo
Copy link
Author

Selimmo commented Aug 20, 2021

wanna also add here that those old services are the reason we are still on the hybrid mode, once we move them to use Native Pub/Sub we can switch off the compatibility mode and from what i understand, everything should be fine

@mauroservienti
Copy link
Member

One thing i noticed, when i was trying to migrate the old service to use the Native Pub/sub (again those are using .Net Framework) , some of them created the Sns Topic twice and i can see the same queue subscribed twice to the same topic (2 versions of the topic) which was weird and i couldn't understand why it happens. (this could be insignificant but thought to mention it)

This is extremely strange, the SNS API should be idempotent. By any chance, do you remember the topics' names and the names/types of the events that were causing that behavior?

@mauroservienti
Copy link
Member

yes to explain this, we have new services (Microservices) using .Net Core those ones are Migrated to Use SNS Native Pub/Sub And also we have Old Monolith Services that haven't been migrated yet (due to this problem) and those ones still relying on message-driven pub/sub and there are those events flying around between old Services and new Services and vice versa (so no corresponding topics).

Thanks. Can you give us a rough estimation of how many events type the hybrid publisher is publishing, and how many native subscribers and non-native subscribers are subscribed to those events?
We could use these numbers to build a more robust series of acceptance tests.

@Selimmo
Copy link
Author

Selimmo commented Aug 23, 2021

One thing i noticed, when i was trying to migrate the old service to use the Native Pub/sub (again those are using .Net Framework) , some of them created the Sns Topic twice and i can see the same queue subscribed twice to the same topic (2 versions of the topic) which was weird and i couldn't understand why it happens. (this could be insignificant but thought to mention it)

This is extremely strange, the SNS API should be idempotent. By any chance, do you remember the topics' names and the names/types of the events that were causing that behavior?

Unfortunately not, but there were lots of them. Gonna have to deploy this branch again to get the answer

@Selimmo Selimmo closed this as completed Aug 24, 2021
@Selimmo
Copy link
Author

Selimmo commented Aug 24, 2021

yes to explain this, we have new services (Microservices) using .Net Core those ones are Migrated to Use SNS Native Pub/Sub And also we have Old Monolith Services that haven't been migrated yet (due to this problem) and those ones still relying on message-driven pub/sub and there are those events flying around between old Services and new Services and vice versa (so no corresponding topics).

Thanks. Can you give us a rough estimation of how many events type the hybrid publisher is publishing, and how many native subscribers and non-native subscribers are subscribed to those events?
We could use these numbers to build a more robust series of acceptance tests.

So i dug a little, and i found through the logs that we have few events with High frequencies, i got the highest one which i will refer to as Event X
The event is published by Service A, Then its handled in Both Service A (which is the publisher) and another Service B.
I could see the highest occurrences of this event was 202 per second (this of course exceeds the 30 trx per second rule for ListTopics AWS API)

This case is on the old services that we couldn't Migrate, but if i were to migrate those now of course it will fail miserably.

Also, Notice that Both the Service A (Event Publisher and Receiver) and B The Receiver are both on old Message Driven (both haven't been migrated)

across the new and old services we have some events flying but not with a high frequency like the the above case, that's why i dismissed those.

In the new services that are already Migrated (using Native Pub/sub in a Hybrid Mode) , this only happened when we had a for loop publishing +1000 events, the service itself was the receiver of those events, that's why it was easy to switch them to commands instead. (of course in the future we will have other services subscribing and this fix will not work anymore)

Q: if i wanted to use something else rather than logs to find out how many events published by a specific Endpoint and what are the endpoints handling this event, what do u advise to use rather than looking at the logs and code together? i know you guys have ServicePulse and ServiceInsight but just checking is the right way? do we need another license for them? could u give me a link for the pricing, and is there a trial period?

@mauroservienti
Copy link
Member

Thanks for the detailed answer @Selimmo. I'll have a deeper look later this (my) morning. I guess that closing the issue was not the intention, right?

@Selimmo Selimmo reopened this Aug 24, 2021
@Selimmo
Copy link
Author

Selimmo commented Aug 24, 2021

Yup apologies, i reopened again

@mauroservienti
Copy link
Member

Q: if i wanted to use something else rather than logs to find out how many events published by a specific Endpoint and what are the endpoints handling this event, what do u advise to use rather than looking at the logs and code together? i know you guys have ServicePulse and ServiceInsight but just checking is the right way? do we need another license for them? could u give me a link for the pricing, and is there a trial period?

Yes, ServicePulse or ServiceInsight are one option. They source data from ServiceControl which consumes messages from the audit queue. So, another option would be to enable auditing on the endpoints and consume messages yourself from the audit queue to build the stats you need.

This is the link to the new licensing model which includes all the products and is based on consumption. If you are on a different licensing model, I suggest writing an email to support [at] particular [.] net and someone from sales will take care of answering all your licensing questions.

@Selimmo
Copy link
Author

Selimmo commented Aug 24, 2021

Thanks @mauroservienti
The other option (servicecontrol + Audit) is what we have, it puts the messages in logs and this is where i got you some stats.

@mauroservienti
Copy link
Member

The other option (servicecontrol + Audit) is what we have, it puts the messages in logs and this is where i got you some stats.

If you want access to the raw messages you could configure ServiceControl to forward audits to a log queue.

@mauroservienti
Copy link
Member

@Selimmo quick update: we have a fix in place (#933), we are dealing with some flaky tests, though.

We also have another question. Given that SQS/SNS are at-least-once message transports, is your receiving code idempotent or are you using the Outbox feature? I'm asking because if that was the case we could simplify the migration mode and provide a better experience while solving the issue you are facing.

@Selimmo
Copy link
Author

Selimmo commented Aug 31, 2021

@Selimmo quick update: we have a fix in place (#933), we are dealing with some flaky tests, though.

We also have another question. Given that SQS/SNS are at-least-once message transports, is your receiving code idempotent or are you using the Outbox feature? I'm asking because if that was the case we could simplify the migration mode and provide a better experience while solving the issue you are facing.

That's a great news @mauroservienti i had a quick look as well. and looking good.
we are actually using the outbox feature for deduplication. is that a problem ?

@mauroservienti
Copy link
Member

we are actually using the outbox feature for deduplication. is that a problem?

Absolutely no. We discussed the option to completely remove the deduplication check for outgoing messages, the same your PR does. In theory, if you are using the outbox or your receiving code is idempotent everything should be fine even without the dedup check. However, we discarded that option for the following reasons:

  • under high throughput, the endpoint will still be rate limited by trying to look for non-existing topics (when publishing in hybrid mode to legacy subscribers only). This means that we need anyway some form of rate-limiting and caching at the endpoint level.
  • Endpoints in hybrid mode could be publishing up to double the expected messages, which can have a significant cost impact.

On the fix: We have identified the cause for the flaky tests, it is due to eventual consistency in the subscriptions policies propagation in the cluster. Given that each test run creates a different infrastructure it's very likely that, under high load, policy propagation eventual consistency impacts on tests execution. When running tests on a fixed well-known infrastructure they are always green. We identified a potential remedy we'll try soon.

@mauroservienti
Copy link
Member

@Selimmo we have a green build 🕺🏻 it took a while to stabilize all tests. It'll take a few more days to get the release out of the door due to documentation and the release process.

@Selimmo
Copy link
Author

Selimmo commented Sep 10, 2021

@mauroservienti that's really a great news, thanks you very much for keeping me posted, can't wait for this problem to be finally solved.

@mauroservienti
Copy link
Member

@Selimmo we're in the process of releasing 5.4, I'll ping here when the package is available on Nuget. Thanks for your patience.

@mauroservienti
Copy link
Member

The package is available on Nuget and documentation has been updated to reflect changes.

This is done.

@Selimmo
Copy link
Author

Selimmo commented Sep 16, 2021

Thank you very much @mauroservienti 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants