This repository contains a CloudFormation stack to rerun ECS tasks on EC2 instances if they fail before reaching the EC2 instance. This typically occurs when the ECS Agent running on the EC2 instance is disconnected. You can refer to API failure reasons for a comprehensive list of reasons why the ECS API may fail, particularly the RunTask
or StartTask
actions.
The stack creates the following resources:
- EventBridge (and rules)
- Step Functions
- SNS Topic
- Lambda Function
And some IAM roles and policies.
When an ECS API failure occurs, EventBridge catches the error and triggers the Step Functions. The Step Functions then notify the SNS Topic of the error, which triggers a Lambda Function to send a message to Slack. If the reason for the failure is AGENT, the task is retried up to three times.
To create the stack, use the quick-create links. If you don't require Slack notifications, you can use this alternative quick-create links to create the stack.
If you wish to reproduce this scenario for testing purposes, you can use the terraform
directory and apply it to create the environment shown in the left thumbnail.
Then, modify the outbound rules of the subnet where the EC2 instance exists to block traffic to the ECS API endpoint. For more information, refer to How to manually trigger an AGENT error when running ECS RunTask.