Skip to content

fivexl/terraform-aws-ecs-fargate-spot-handler

Repository files navigation

ECS Fargate Spot Handler

A Terraform module that automatically handles AWS ECS Fargate Spot instance termination notifications by triggering forced redeployments of affected services. This ensures high availability and seamless service continuity when Spot instances are interrupted.

Features

  • Automatic Spot Interruption Handling: Monitors AWS EventBridge for ECS Fargate Spot termination events
  • Proactive Service Redeployment: Triggers forced redeployments before hard interruption occurs
  • Comprehensive Error Handling: Implements retry logic with exponential backoff for resilient operation
  • Structured Logging: Provides detailed JSON-formatted logs for monitoring and debugging
  • Minimal Permissions: Follows the principle of least privilege for security
  • Configurable: Supports customization of Lambda function settings and resource naming

Architecture

graph TB
    A[ECS Fargate Spot Instance] -->|Interruption Notice| B[AWS EventBridge]
    B -->|Filtered Event| C[Lambda Function]
    C -->|Force New Deployment| D[ECS Service]
    C -->|Logs| E[CloudWatch Logs]
    
    subgraph "Terraform Module"
        F[EventBridge Rule]
        G[Lambda Function]
        H[IAM Role & Policies]
        I[CloudWatch Log Group]
    end
    
    B -.-> F
    C -.-> G
    G -.-> H
    E -.-> I
Loading

The module creates:

  1. EventBridge Rule: Filters ECS Task State Change events with stopCode: "SpotInterruption"
  2. Lambda Function: Processes interruption events and triggers service redeployments
  3. IAM Role & Policies: Provides minimal required permissions for ECS operations
  4. CloudWatch Log Group: Centralized logging with configurable retention

Usage

Basic Usage

module "ecs_spot_handler" {
  source = "path/to/terraform-aws-ecs-fargate-spot-handler"

  # Optional: Customize function name
  lambda_function_name = "my-spot-handler"
  
  # Optional: Add resource tags
  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

Advanced Configuration

module "ecs_spot_handler" {
  source = "path/to/terraform-aws-ecs-fargate-spot-handler"

  # Lambda Configuration
  lambda_function_name                   = "custom-spot-handler"
  lambda_timeout                         = 120
  lambda_memory_size                     = 256
  lambda_reserved_concurrent_executions  = 10
  log_level                             = "DEBUG"

  # CloudWatch Configuration
  log_retention_days = 30

  # EventBridge Configuration
  eventbridge_rule_name  = "custom-spot-rule"
  eventbridge_rule_state = "ENABLED"

  # Resource Naming
  name_prefix = "prod"

  # Tags
  tags = {
    Environment = "production"
    Team        = "platform"
    Module      = "ecs-spot-handler"
  }
}

Multi-Environment Setup

# Production Environment
module "ecs_spot_handler_prod" {
  source = "path/to/terraform-aws-ecs-fargate-spot-handler"

  name_prefix            = "prod"
  lambda_function_name   = "ecs-spot-handler"
  lambda_timeout         = 90
  lambda_memory_size     = 256
  log_retention_days     = 30
  log_level             = "INFO"

  tags = {
    Environment = "production"
    Team        = "platform"
  }
}

# Staging Environment
module "ecs_spot_handler_staging" {
  source = "path/to/terraform-aws-ecs-fargate-spot-handler"

  name_prefix            = "staging"
  lambda_function_name   = "ecs-spot-handler"
  lambda_timeout         = 60
  lambda_memory_size     = 128
  log_retention_days     = 14
  log_level             = "DEBUG"

  tags = {
    Environment = "staging"
    Team        = "platform"
  }
}

Requirements

Name Version
terraform >= 1.0
aws >= 5.0

Providers

Name Version
aws >= 5.0

Modules

Name Source Version
spot_handler_lambda terraform-aws-modules/lambda/aws ~> 8.0

Resources

Name Type
aws_cloudwatch_event_rule.spot_interruption resource
aws_cloudwatch_event_target.lambda_target resource
aws_lambda_permission.allow_eventbridge resource
aws_iam_policy_document.ecs_operations data source

Inputs

Name Description Type Default Required
eventbridge_rule_name Name of the EventBridge rule string "ecs-spot-interruption" no
eventbridge_rule_state State of the EventBridge rule (ENABLED or DISABLED) string "ENABLED" no
lambda_function_name Name of the Lambda function string "ecs-fargate-spot-handler" no
lambda_memory_size Lambda function memory size in MB number 128 no
lambda_reserved_concurrent_executions Reserved concurrent executions for the Lambda function. Set to -1 for unreserved number -1 no
lambda_timeout Lambda function timeout in seconds number 60 no
log_level Log level for Lambda function string "INFO" no
log_retention_days CloudWatch log retention period in days number 14 no
name_prefix Prefix for resource names. If empty, no prefix will be used string "" no
tags A map of tags to apply to all resources map(string) {} no

Outputs

Name Description
cloudwatch_log_group_arn The Amazon Resource Name (ARN) of the CloudWatch log group
cloudwatch_log_group_name The name of the CloudWatch log group for the Lambda function
eventbridge_rule_arn The Amazon Resource Name (ARN) of the EventBridge rule
eventbridge_rule_id The ID of the EventBridge rule
eventbridge_rule_name The name of the EventBridge rule
eventbridge_target_id The ID of the EventBridge target
lambda_execution_role_arn The Amazon Resource Name (ARN) of the Lambda execution role
lambda_execution_role_name The name of the Lambda execution role
lambda_execution_role_unique_id The unique ID of the Lambda execution role
lambda_function_arn The Amazon Resource Name (ARN) of the Lambda function
lambda_function_invoke_arn The invoke ARN of the Lambda function, used for API Gateway integration
lambda_function_name The name of the Lambda function
lambda_function_qualified_arn The qualified ARN of the Lambda function (includes version)
lambda_function_version The version of the Lambda function
module_name The name of this Terraform module
module_version The version of this Terraform module

How It Works

Event Processing Flow

  1. Spot Interruption Detection: AWS sends ECS Task State Change events to EventBridge when Spot instances receive termination notices
  2. Event Filtering: The EventBridge rule filters for events with stopCode: "SpotInterruption"
  3. Lambda Invocation: Matching events trigger the Lambda function
  4. Event Validation: Lambda validates the event structure and extracts cluster/service information
  5. Service Redeployment: Lambda calls ECS UpdateService with forceNewDeployment=True
  6. Error Handling: Comprehensive error handling with retry logic ensures reliability

Event Structure

The Lambda function processes EventBridge events with the following structure:

{
  "version": "0",
  "id": "9bcdac79-b31f-4d3d-9410-fbd727c29fab",
  "detail-type": "ECS Task State Change",
  "source": "aws.ecs",
  "account": "111122223333",
  "time": "2023-01-01T12:00:00Z",
  "region": "us-east-1",
  "resources": [
    "arn:aws:ecs:us-east-1:111122223333:task/b99d40b3-5176-4f71-9a52-9dbd6f1cebef"
  ],
  "detail": {
    "clusterArn": "arn:aws:ecs:us-east-1:111122223333:cluster/default",
    "stopCode": "SpotInterruption",
    "group": "service:my-service"
  }
}

IAM Permissions

The Lambda function requires the following minimal IAM permissions:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "ecs:DescribeServices",
        "ecs:UpdateService",
        "ecs:DescribeTasks"
      ],
      "Resource": "*"
    }
  ]
}

Monitoring and Logging

CloudWatch Logs

The Lambda function provides structured JSON logging with the following fields:

  • timestamp: ISO 8601 timestamp
  • level: Log level (DEBUG, INFO, WARNING, ERROR)
  • message: Human-readable message
  • function: Function name where log was generated
  • line: Line number
  • cluster: ECS cluster ARN (when applicable)
  • service: ECS service name (when applicable)
  • task_arn: ECS task ARN (when applicable)
  • request_id: Lambda request ID for correlation

Example Log Entry

{
  "timestamp": "2023-01-01T12:00:00.000Z",
  "level": "INFO",
  "message": "Service redeployment triggered successfully",
  "function": "trigger_service_redeployment",
  "line": 245,
  "cluster": "arn:aws:ecs:us-east-1:111122223333:cluster/default",
  "service": "my-service",
  "deployment_id": "arn:aws:ecs:us-east-1:111122223333:service/default/my-service/deployment/123456789"
}

CloudWatch Metrics

Monitor the following CloudWatch metrics:

  • Lambda Function Metrics:

    • AWS/Lambda/Invocations: Number of function invocations
    • AWS/Lambda/Errors: Number of function errors
    • AWS/Lambda/Duration: Function execution duration
    • AWS/Lambda/Throttles: Number of throttled invocations
  • EventBridge Metrics:

    • AWS/Events/MatchedEvents: Number of events matching the rule
    • AWS/Events/InvocationsCount: Number of target invocations
    • AWS/Events/FailedInvocations: Number of failed invocations

Error Handling

The module implements comprehensive error handling:

Retry Logic

  • Exponential Backoff: Retries with exponential backoff and jitter
  • Maximum Retries: Up to 3 retry attempts for transient errors
  • Non-Retryable Errors: Immediate failure for permission and validation errors

Error Scenarios

  • Service Not Found: Logged as warning, returns success (service may have been deleted)
  • Cluster Not Found: Logged as error, returns failure
  • Access Denied: Logged as error, returns failure
  • Throttling: Automatic retry with exponential backoff
  • Network Errors: Automatic retry with exponential backoff

Security Considerations

IAM Permissions

  • Principle of Least Privilege: Only grants necessary ECS permissions
  • Resource Scoping: Permissions apply to all resources (required for cross-service operation)
  • Managed Policies: Uses AWS managed policies where appropriate

Network Security

  • VPC Configuration: Lambda function can be configured to run in VPC if required
  • Security Groups: Standard Lambda security group rules apply
  • Encryption: All logs are encrypted at rest using CloudWatch default encryption

Troubleshooting

Common Issues

  1. Lambda Function Not Triggered

    • Check EventBridge rule is enabled
    • Verify event pattern matches actual ECS events
    • Check Lambda permissions for EventBridge invocation
  2. Permission Denied Errors

    • Verify IAM role has required ECS permissions
    • Check if Lambda execution role is properly attached
    • Ensure ECS resources exist in the same account/region
  3. Service Redeployment Fails

    • Check if ECS service exists
    • Verify cluster ARN is correct
    • Ensure service is not already updating

Debug Mode

Enable debug logging by setting log_level = "DEBUG":

module "ecs_spot_handler" {
  source = "path/to/terraform-aws-ecs-fargate-spot-handler"
  
  log_level = "DEBUG"
}

Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Add tests for new functionality
  5. Run terraform fmt and terraform validate
  6. Submit a pull request

License

This module is licensed under the MIT License. See LICENSE for details.

Authors

Created and maintained by FivexL.

Usage

provider "aws" {
  region = var.aws_region
}

# Basic usage of the ECS Fargate Spot Handler module
module "ecs_spot_handler" {
  source = "../../"

  # Optional: Customize function name
  lambda_function_name = var.lambda_function_name

  # Optional: Add resource tags
  tags = var.tags
}

Requirements

Name Version
terraform >= 1.0
aws >= 5.0

Providers

Name Version
aws 6.10.0

Modules

Name Source Version
spot_handler_lambda terraform-aws-modules/lambda/aws ~> 8.0

Resources

Name Type
aws_cloudwatch_event_rule.spot_interruption resource
aws_cloudwatch_event_target.lambda_target resource
aws_lambda_permission.allow_eventbridge resource
aws_iam_policy_document.ecs_operations data source

Inputs

Name Description Type Default Required
eventbridge_rule_name Name of the EventBridge rule string "ecs-spot-interruption" no
eventbridge_rule_state State of the EventBridge rule (ENABLED or DISABLED) string "ENABLED" no
lambda_function_name Name of the Lambda function string "ecs-fargate-spot-handler" no
lambda_memory_size Lambda function memory size in MB number 128 no
lambda_reserved_concurrent_executions Reserved concurrent executions for the Lambda function. Set to -1 for unreserved number -1 no
lambda_timeout Lambda function timeout in seconds number 60 no
log_level Log level for Lambda function string "INFO" no
log_retention_days CloudWatch log retention period in days number 14 no
name_prefix Prefix for resource names. If empty, no prefix will be used string "" no
tags A map of tags to apply to all resources map(string) {} no

Outputs

Name Description
cloudwatch_log_group_arn The Amazon Resource Name (ARN) of the CloudWatch log group
cloudwatch_log_group_name The name of the CloudWatch log group for the Lambda function
eventbridge_rule_arn The Amazon Resource Name (ARN) of the EventBridge rule
eventbridge_rule_id The ID of the EventBridge rule
eventbridge_rule_name The name of the EventBridge rule
eventbridge_target_id The ID of the EventBridge target
lambda_execution_role_arn The Amazon Resource Name (ARN) of the Lambda execution role
lambda_execution_role_name The name of the Lambda execution role
lambda_execution_role_unique_id The unique ID of the Lambda execution role
lambda_function_arn The Amazon Resource Name (ARN) of the Lambda function
lambda_function_invoke_arn The invoke ARN of the Lambda function, used for API Gateway integration
lambda_function_name The name of the Lambda function
lambda_function_qualified_arn The qualified ARN of the Lambda function (includes version)
lambda_function_version The version of the Lambda function
module_name The name of this Terraform module
module_version The version of this Terraform module

About

Handle ECS Fargate Spot notification notices and force redeployment of the service

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published