Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add wait_for_steady_state attribute to aws_ecs_service #3485

Merged
merged 3 commits into from
Oct 29, 2020

Conversation

mmdriley
Copy link
Contributor

This allows for cases where the old and new service tasks have different resource dependencies -- for example, an old S3 bucket or database -- and the old resources should only be deleted once all old service tasks that depended on it are gone.

In the case that I hit, our service relied on an auxiliary aws_ecs_task_definition that became inactive while there were still service tasks that expected to use it with RunTask. Now we can set wait_for_steady_state and know the old task definition will stay alive as long as it's needed.

This also allows authors to make the success of their deployment conditional on the success of the service update.

To the extent possible, I modeled this after the logic of aws ecs wait services-stable. That includes the definition of "stable" as well as the timeout (10m).

Fixes #3107

@ghost ghost added the size/M Managed by automation to categorize the size of a PR. label Feb 22, 2018
@Ninir Ninir added enhancement Requests to existing resources that expand the functionality or scope. service/ecs Issues and PRs that pertain to the ecs service. labels Feb 22, 2018

steadyStateConf := &resource.StateChangeConf{
Pending: []string{"false"},
Target: []string{"true"},
Copy link
Contributor Author

@mmdriley mmdriley Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a bool state as a string feels a bit weird, but there's precedent:

https://github.com/terraform-providers/terraform-provider-aws/blob/490897d8f64608a336ab43564c5dfe681509b213/aws/resource_aws_network_interface.go#L205-L208

I considered returning the service state or "STABLE", but then we're relying on ECS never introducing another state.

@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of
`placement_constraints` is `10`. Defined below.
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes.
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`.
Copy link
Contributor Author

@mmdriley mmdriley Feb 22, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to write an acceptance test for when this is true, but I couldn't come up with a reliable, repeatable way to test we weren't hitting a race condition.

This would also be the first test case that required container instances (or which used Fargate and was locked to us-east-1).

I have verified manually that this waits for a new deployment to finish.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What race conditions were you thinking of?

I would have thought that you could simply create an ECS service in one step (using your wait_for_steady_state to make sure it doesn't complete until there is a task running somewhere) and then another step that increases the desired_count and then another step that changes the task definition by setting an environment variable or something.

Creating container instances shouldn't be difficult, I'd just use the latest ECS optimised AMI and set the user data so that it joins the ECS cluster you create for the acceptance test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition I have in mind is the one we hit without this flag: in the case where the to-be-replaced ECS service definition references a resource that will be deleted, Terraform and ECS race and Terraform can delete the resource while there are still tasks referencing it.

The problem is finding a test that would reliably fail when this flag is unset and reliably pass when it is. The test you describe would pass without this flag, right?

The best idea I have for testing this is a service binary that takes N minutes to start responding to healthchecks, then verify that service update takes at least N minutes. Not great, and for small N it's entirely possible the test would pass without this flag if service update were slow.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we create an ALB and Target Group then the acceptance test can be something as simple as launching busybox sleep 600 in that service.

Since it will never pass health checks the service will never become stable.

Copy link
Contributor

@tomelliff tomelliff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was just thinking about working on this because I have a service that needs to have finished deploying before I invalidate cache on Cloudfront so thanks for picking this up. I've added a few comments on things I might have approached differently.


// Returns a StateRefreshFunc for a given service. That function will return "true" if the service is in a
// steady state, "false" if the service is running but not in a steady state, and an error otherwise.
func resourceAwsEcsServiceIsSteadyStateFunc(d *schema.ResourceData, meta interface{}) resource.StateRefreshFunc {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You probably also want a configurable timeout period here. Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you're not just using https://docs.aws.amazon.com/sdk-for-go/api/service/ecs/#ECS.WaitUntilServicesStable here?

It seemed more in keeping with the Terraform code to use StateChangeConf, especially since that code has more elaborate and configurable timeout logic than the AWS SDK -- which will help implementing configurable timeouts, as you describe.

$ grep -R . -e "\.WaitUntil" --exclude './vendor/*'
./aws/resource_aws_ebs_snapshot.go:	err := conn.WaitUntilSnapshotCompleted(req)
./aws/resource_aws_iam_instance_profile.go:	err = iamconn.WaitUntilInstanceProfileExists(waiterRequest)

Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.

That's exactly where I got 10 * time.Minute from above. :)

@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of
`placement_constraints` is `10`. Defined below.
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes.
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What race conditions were you thinking of?

I would have thought that you could simply create an ECS service in one step (using your wait_for_steady_state to make sure it doesn't complete until there is a task running somewhere) and then another step that increases the desired_count and then another step that changes the task definition by setting an environment variable or something.

Creating container instances shouldn't be difficult, I'd just use the latest ECS optimised AMI and set the user data so that it joins the ECS cluster you create for the acceptance test.

Copy link
Contributor Author

@mmdriley mmdriley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments!

@@ -63,6 +63,7 @@ into consideration during task placement. The maximum number of
* `placement_constraints` - (Optional) rules that are taken into consideration during task placement. Maximum number of
`placement_constraints` is `10`. Defined below.
* `network_configuration` - (Optional) The network configuration for the service. This parameter is required for task definitions that use the awsvpc network mode to receive their own Elastic Network Interface, and it is not supported for other network modes.
* `wait_for_steady_state` - (Optional) If `true`, Terraform will wait for the service to reach a steady state (like [`aws ecs wait services-stable`](https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html)) before continuing. Default `false`.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race condition I have in mind is the one we hit without this flag: in the case where the to-be-replaced ECS service definition references a resource that will be deleted, Terraform and ECS race and Terraform can delete the resource while there are still tasks referencing it.

The problem is finding a test that would reliably fail when this flag is unset and reliably pass when it is. The test you describe would pass without this flag, right?

The best idea I have for testing this is a service binary that takes N minutes to start responding to healthchecks, then verify that service update takes at least N minutes. Not great, and for small N it's entirely possible the test would pass without this flag if service update were slow.


// Returns a StateRefreshFunc for a given service. That function will return "true" if the service is in a
// steady state, "false" if the service is running but not in a steady state, and an error otherwise.
func resourceAwsEcsServiceIsSteadyStateFunc(d *schema.ResourceData, meta interface{}) resource.StateRefreshFunc {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason you're not just using https://docs.aws.amazon.com/sdk-for-go/api/service/ecs/#ECS.WaitUntilServicesStable here?

It seemed more in keeping with the Terraform code to use StateChangeConf, especially since that code has more elaborate and configurable timeout logic than the AWS SDK -- which will help implementing configurable timeouts, as you describe.

$ grep -R . -e "\.WaitUntil" --exclude './vendor/*'
./aws/resource_aws_ebs_snapshot.go:	err := conn.WaitUntilSnapshotCompleted(req)
./aws/resource_aws_iam_instance_profile.go:	err = iamconn.WaitUntilInstanceProfileExists(waiterRequest)

Although I can't see it in the Go SDK the botocore link lower down shows a 600s total wait time (15s * 40 attempts) which seems like a reasonable default.

That's exactly where I got 10 * time.Minute from above. :)

@mmdriley mmdriley force-pushed the wait-for-services branch from f732521 to f115c53 Compare March 13, 2018 01:06
@ghost ghost added the size/M Managed by automation to categorize the size of a PR. label Mar 13, 2018
@mmdriley
Copy link
Contributor Author

Despite claiming to wait on creates and updates, my implementation previously only waited on updates. Sorry about that! I've updated the PR.

@mmdriley mmdriley force-pushed the wait-for-services branch from f115c53 to c00c0d8 Compare March 13, 2018 06:39
@ghost ghost added the size/M Managed by automation to categorize the size of a PR. label Mar 13, 2018
@mmdriley
Copy link
Contributor Author

We've seen a problem on create where the ECS service seems to have disappeared due to eventual consistency, similar to the problem addressed in 7551f92. Change coming.

@mmdriley mmdriley force-pushed the wait-for-services branch from c00c0d8 to fb6a5eb Compare March 28, 2018 23:43
@ghost ghost added the size/M Managed by automation to categorize the size of a PR. label Mar 28, 2018
@mmdriley
Copy link
Contributor Author

Now updated with code to handle stale reads. Ready for review.

@bitbrain
Copy link

bitbrain commented May 2, 2018

Any news on this? Having wait_for_steady_state in Terraform would be helpful, especially for deployment pipelines.

@bgiorgini
Copy link

Is this trying to implement similar concept like this AWS cli? https://docs.aws.amazon.com/cli/latest/reference/ecs/wait/services-stable.html
Is there any news on this PR? I also think this will be very helpful for deployment pipelines.

@tomelliff
Copy link
Contributor

@bgiorgini yep, exactly like that.

As a workaround for now I'm using a null resource to shell out to the ECS waiter:

resource "null_resource" "wait_for_service_deploy" {
  triggers {
    task_definition = "${aws_ecs_service.web_service.task_definition}"
  }

  provisioner "local-exec" {
    command = "aws ecs wait services-stable --services ${aws_ecs_service.web_service.name} --cluster ${var.cluster_name} --region ${data.aws_region.current.name}"
  }
}

@PaulAtkins
Copy link
Contributor

Thanks for the workaround @tomelliff!

We found when testing that the ecs wait services-stable command works slightly differently to the ECS service stable event message, and doesn't wait for new tasks to become healthy in a ALB.

So we added a second null_resource using elbv2 wait target-in-service with a trigger on the first one & the task definition.

The triggers cause Terraform to only run elbv2 wait target-in-service after service-stable completes, and prevents the task definition being destroyed until the new instances are healthy. (one downside is that an apply will fail if existing Targets aren't already healthy)

resource "null_resource" "wait_for_service_deploy" {
  triggers {
    task_definition = "${aws_ecs_service.web_service.task_definition}"
  }

  provisioner "local-exec" {
    command = "aws ecs wait services-stable --services ${aws_ecs_service.web_service.name} --cluster ${var.cluster_name} --region ${data.aws_region.current.name}"
  }
}


resource "null_resource" "wait_for_service_alb_healthy" {
  triggers {
    task_definition = "${aws_ecs_service.web_service.task_definition}"
    service_stable = "${null_resource.wait_for_service_deploy.id}"
  }

  provisioner "local-exec" {
    command = "aws elbv2 wait target-in-service --target-group-arn ${aws_alb_target_group.web_service_tg.arn} --region ${data.aws_region.current.name}"
  }
}

@tomelliff
Copy link
Contributor

In my experience it does seem to catch when a service is fundamentally unhealthy (container is missing something so it fails to properly start) and this fails our deployment CI jobs when it happens.

What is unfortunate is that if another deployment happens that still doesn't fix things then it seems to return as stable despite the new task definition still being broken. This seems to happen the majority of the time with back to back failing deploys and it will also usually stop the original healthy tasks then. Oddly this isn't 100% reproducible and sometimes ECS keeps the original healthy tasks running until a new task definition is genuinely healthy.

I'm not sure what's exactly happening in these situations but they're rare enough for us that I've not really looked into it other than a cursory glance when alerting points out 0 tasks are running. And it's not been an issue for us because deploy failures always get flushed out in staging because we aren't yet (maybe never) just going straight to production.

@derickson82
Copy link

I've been looking for this exact feature, and would love to use this instead of the work around. Any clue when this will be ready?

I'd be happy to help. Let me know if there is anything I can do to help finish this.

@mmdriley
Copy link
Contributor Author

If this gets attention from the Terraform folks I can probably try to help shepherd it through, but otherwise I've pretty much forgotten about it.

To be practically deployable, this probably needs to allow configurable timeouts to accommodate large service deployments that will take a long time to update.

@tomelliff
Copy link
Contributor

@mmdriley first passes on a feature normally add the minimal thing to get it working and then if people have a requirement for something else (eg for needing to wait more than 10 minutes for an insanely slow starting service to start) then it can be made configurable in a follow up PR.

@bflad any chance this can get a review? Got a few people interested in it and we've had a couple of issues raised looking for this functionality.

@ghost ghost added size/M Managed by automation to categorize the size of a PR. and removed size/L Managed by automation to categorize the size of a PR. labels Oct 26, 2020
@ghost ghost added size/L Managed by automation to categorize the size of a PR. and removed size/M Managed by automation to categorize the size of a PR. labels Oct 26, 2020
Comment on lines 1385 to 1496
data "aws_ami" "test" {
most_recent = true
owners = ["amazon"]

filter {
name = "name"
values = ["amzn2-ami-ecs-hvm-*"]
}
}

data "aws_availability_zones" "available" {
state = "available"

filter {
name = "opt-in-status"
values = ["opt-in-not-required"]
}
}

resource "aws_launch_template" "test" {
image_id = data.aws_ami.test.id
instance_type = "t3.micro"
name = %[3]q
}

resource "aws_autoscaling_group" "test" {
availability_zones = data.aws_availability_zones.available.names
desired_capacity = 1
max_size = 1
min_size = 1
name = %[3]q

launch_template {
id = aws_launch_template.test.id
}

tags = [
{
key = "foo"
value = "bar"
propagate_at_launch = true
},
]
}

Copy link
Contributor

@anGie44 anGie44 Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added b/c DescribeServices was throwing was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your cluster but a new error persists: was unable to place a task because no container instance met all of its requirements. Reason: No Container Instances were found in your capacity provider. I originally followed these instructions https://docs.aws.amazon.com/AmazonECS/latest/developerguide/launch_container_instance.html but was hitting the first error.

@anGie44 anGie44 force-pushed the wait-for-services branch 3 times, most recently from 440a662 to 28df995 Compare October 29, 2020 15:33
@ghost ghost added size/XL Managed by automation to categorize the size of a PR. and removed size/L Managed by automation to categorize the size of a PR. labels Oct 29, 2020
network_configuration {
security_groups = [aws_security_group.test.id]
subnets = aws_subnet.test[*].id
assign_public_ip = true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required for the task to reach RUNNING state -- else task throws error CannotPullContainerError: Error response from daemon:Get https://registry-name/: net/http: request canceled while waiting for connection (Client.Timeout exceeded while awaiting headers)

Comment on lines +1568 to +1585
resource "aws_internet_gateway" "test" {
vpc_id = aws_vpc.test.id
}

resource "aws_route_table" "test" {
vpc_id = aws_vpc.test.id

route {
cidr_block = "0.0.0.0/0"
gateway_id = aws_internet_gateway.test.id
}
}

resource "aws_route_table_association" "test" {
count = 2
subnet_id = element(aws_subnet.test.*.id, count.index)
route_table_id = aws_route_table.test.id
}
Copy link
Contributor

@anGie44 anGie44 Oct 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

required to run task else receives the error CannotPullContainerError and task reaches STOPPED state

@anGie44
Copy link
Contributor

anGie44 commented Oct 29, 2020

Output of added tests:

--- PASS: TestAccAWSEcsService_withLaunchTypeFargateAndWaitForSteadyState (188.43s)
--- PASS: TestAccAWSEcsService_withLaunchTypeFargateAndUpdateWaitForSteadyState (212.50s)

Output of pre-existing tests:

--- PASS: TestAccAWSEcsService_withEcsClusterName (83.90s)
--- PASS: TestAccAWSEcsService_withDeploymentValues (91.25s)
--- PASS: TestAccAWSEcsService_withPlacementStrategy_Type_Missing (1.69s)
--- PASS: TestAccAWSEcsService_withDeploymentController_Type_External (95.34s)
--- PASS: TestAccAWSEcsService_disappears (106.82s)
--- PASS: TestAccAWSEcsService_withDeploymentMinimumZeroMaximumOneHundred (111.37s)
--- PASS: TestAccAWSEcsServiceDataSource_basic (113.75s)
--- PASS: TestAccAWSEcsService_withUnnormalizedPlacementStrategy (118.03s)
--- PASS: TestAccAWSEcsService_withARN (127.97s)
--- PASS: TestAccAWSEcsService_withForceNewDeployment (128.12s)
--- PASS: TestAccAWSEcsService_basicImport (128.57s)
--- PASS: TestAccAWSEcsService_withFamilyAndRevision (137.70s)
--- PASS: TestAccAWSEcsService_withIamRole (154.10s)
--- PASS: TestAccAWSEcsService_withMultipleCapacityProviderStrategies (163.97s)
--- PASS: TestAccAWSEcsService_withPlacementConstraints_emptyExpression (71.78s)
--- PASS: TestAccAWSEcsService_withPlacementConstraints (78.70s)
--- PASS: TestAccAWSEcsService_withDaemonSchedulingStrategy (57.36s)
--- PASS: TestAccAWSEcsService_withDaemonSchedulingStrategySetDeploymentMinimum (57.95s)
--- PASS: TestAccAWSEcsService_withRenamedCluster (213.87s)
--- PASS: TestAccAWSEcsService_withPlacementStrategy (134.03s)
--- PASS: TestAccAWSEcsService_withReplicaSchedulingStrategy (93.96s)
--- PASS: TestAccAWSEcsService_withCapacityProviderStrategy (244.64s)
--- PASS: TestAccAWSEcsService_withLaunchTypeEC2AndNetworkConfiguration (127.50s)
--- PASS: TestAccAWSEcsService_ManagedTags (84.81s)
--- PASS: TestAccAWSEcsService_withLaunchTypeFargate (156.91s)
--- PASS: TestAccAWSEcsService_withLaunchTypeFargateAndPlatformVersion (157.19s)
--- PASS: TestAccAWSEcsService_withLbChanges (272.96s)
--- PASS: TestAccAWSEcsService_healthCheckGracePeriodSeconds (282.99s)
--- PASS: TestAccAWSEcsService_Tags (120.85s)
--- PASS: TestAccAWSEcsService_withAlb (296.80s)
--- PASS: TestAccAWSEcsService_withDeploymentController_Type_CodeDeploy (302.03s)
--- PASS: TestAccAWSEcsService_withServiceRegistries_container (164.96s)
--- PASS: TestAccAWSEcsService_withServiceRegistries (180.50s)
--- PASS: TestAccAWSEcsService_withMultipleTargetGroups (361.04s)
--- PASS: TestAccAWSEcsService_PropagateTags (215.32s)

@anGie44 anGie44 added this to the v3.13.0 milestone Oct 29, 2020
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! 🚀

@anGie44 anGie44 merged commit 759424c into hashicorp:master Oct 29, 2020
anGie44 added a commit that referenced this pull request Oct 29, 2020
@ghost
Copy link

ghost commented Oct 29, 2020

This has been released in version 3.13.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template for triage. Thanks!

@ghost
Copy link

ghost commented Nov 28, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked as resolved and limited conversation to collaborators Nov 28, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
documentation Introduces or discusses updates to documentation. enhancement Requests to existing resources that expand the functionality or scope. service/ecs Issues and PRs that pertain to the ecs service. size/XL Managed by automation to categorize the size of a PR. tests PRs: expanded test coverage. Issues: expanded coverage, enhancements to test infrastructure.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wait until ECS deployment is complete?