Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use CloudWatch Events to manage delayed messages #244

Open
mauroservienti opened this issue Sep 27, 2018 · 4 comments
Open

Use CloudWatch Events to manage delayed messages #244

mauroservienti opened this issue Sep 27, 2018 · 4 comments

Comments

@mauroservienti
Copy link
Member

spun off from #191

Quoted from #191 (comment)

The idea is to use CloudWatch Events to trigger the timeout - this allows an AWS native service to take ownership of the timing of the trigger without having a complex algorithm of repeatedly sending messages, nor requiring FIFO queues.

A simple algorithm could be:

  • On Outgoing: When a timeout request is detected:

    1. serialise the message and store to an S3 object (could use the existing large messages bucket)
    2. Create a CloudWatch Event Rule with a cron expression for the requested Timeout time (see https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html) - the event rule should trigger an SQS message for the appropriate queue, and the parameter should be the message ID (or the s3 object uri)
  • On Incoming: When a Cloudwatch event message is detected on the queue:

    1. delete the Cloudwatch Event Rule
    2. Retrieve the message from S3 and use it as the timeout message that is passed through the rest of the NServiceBus pipeline

This would permit indefinite timeouts without FIFO queues, without satellite queues and without special algorithms to check and resend the message every 15 minutes.

@danielmarbach
Copy link
Contributor

Here is my original internal comment when we considered multiple options to implement native deferral

Another angle that might be worth considering is AWS Cloudwatch.

https://docs.aws.amazon.com/AmazonCloudWatch/latest/events/ScheduledEvents.html

With cloudwatch it is possible to schedule events at a given rate or time. CloudWatch natively supports SQS and or SNS as a target.

Caveats though:

You can create rules that self-trigger on an automated schedule in CloudWatch Events using cron or rate expressions. All scheduled events use UTC time zone and the minimum precision for schedules is 1 minute.

CloudWatch Events does not provide second-level precision in schedule expressions. The finest resolution using a cron expression is a minute. Due to the distributed nature of the CloudWatch Events and the target services, the delay between the time the scheduled rule is triggered and the time the target service honors the execution of the target resource might be several seconds. Your scheduled rule is triggered within that minute, but not on the precise 0th second.

an API is available.

It would require to create trigger rules with an SQS target. It looks like the event details is fixed but it is possible to pass arbitrary json to the target. I haven't done more investigation. Just an idea to consider or leave out. I leave that up to the TF

image

image

The rule management might not be trivial. We realized it would require many rules and you are only allowed to have 100 rules per account per region which makes cloudwatch a nogo

@chrissimon-au
Copy link
Contributor

Hi @danielmarbach - thanks for posting, that's a great analysis!

I can add some info that may or may not adjust your thinking:

100-rule limit

You can request an increase on the rule limit - I have checked and in our region (ap-southeast-2) the hard limit is 2000 rules per account. This still may be prohibitive for some use cases.

1-minute resolution

In our case, this would be acceptable - as long as the SQS deferral (with more precise resolution I think?) was used for timeouts < 15 minutes. Once the timeout is > 15 minutes we tend not to have second-resolution requirements. So if the implementation could use native deferral if the timeout is < 15 minutes, and a cloudwatch event rule if > 15 minutes, that would be fine for us.

Rule management

I agree - it would not be trivial, however I don't think too complex either. I think the key would be to consider each rule to be associated with a specific datetime (minute) rather than a specific timeout event. Each rule can have up to 5 triggers associated with it, and each trigger can carry a payload with potentially multiple message Ids.

So, as timeouts are raised, the algorithm could be:

Outbound

  • Store the message to S3 keyed by the message id

  • If an existing rule exists for the requested time, add the message Id to that rule

  • if not, create a rule with the message id in the payload

When adding a message id to a rule:

  • If an existing trigger is for the target queue, add to the payload (the input of the rule)
  • If not, create a new trigger for the target queue with the message id in the payload

When adding a message id to a payload:

  • If the payload size would not exceed the limit, update the rule with the payload (I'm struggling to find documentation for the limit - I'll run some tests later and let you know - I suspect 256kB).
  • If the payload size would exceed the limit, add a new trigger with the message id

(There may also need to be some concurrency handling to avoid losing message ids in a payload)

Inbound

  • Retrieve the message from S3
  • Delete the rule

Variable resolution

To accommodate the rule limit, it may be acceptable as a compromise to scale the resolution - e.g. within 24 hours support 1 minute resolution, after 24 hours support 5 minute resolution, and after 48 hours, support 1 hour resolution - so only 24 rules would be required to support any given hour on a given day.

Why bother?

Our main reason for being interested in this is that we can't use FIFO queues which are a requirement for the alternative algorithm which has already been implemented. We are looking at hangfire for recurring activity, but we still have some use cases where nservicebus timeouts are a better fit, for non-recurring timeouts that exceed 15 minutes.

I appreciate if our use case is not too common that there may not be enough appetite for this :)

@danielmarbach
Copy link
Contributor

Hi @chrissimon-au

Thanks for the input. We will definitely discuss it but I have a hunch we'd rather wait for Amazon to support FIFO queues in your region. I pinged Amazon Support and I'm meeting tomorrow with an Amazon representative to see where they are. I'll keep you posted

Daniel

@danielmarbach
Copy link
Contributor

Hi @chrissimon-au

I should have followed up earlier on this one. It seems that Asia Pacific supports FIFO queues since November 2018

https://aws.amazon.com/about-aws/whats-new/2018/11/amazon-sqs-fifo-asia-pacific-tokyo-sydney/

Have you been able to switch to FIFO queues?

Daniel

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants