Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timeout error retries for SSM resources #8992

Merged
merged 3 commits into from
Jun 19, 2019
Merged

Timeout error retries for SSM resources #8992

merged 3 commits into from
Jun 19, 2019

Conversation

ryndaniels
Copy link
Contributor

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" comments, they generate extra noise for pull request followers and do not help prioritize the request

Related #7873

Release note for CHANGELOG:

BUG FIXES:
* resource/aws_ssm_document - Final retries when creating and deleting SSM documents
* resource/aws_ssm_resource_data_sync - Final retry when creating SSM resource data sync

Output from acceptance testing:

$ make testacc TESTARGS="-run=TestAccAWSSSMDocument"
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -parallel 20 -run=TestAccAWSSSMDocument -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSSSMDocument_basic
=== PAUSE TestAccAWSSSMDocument_basic
=== RUN   TestAccAWSSSMDocument_update
=== PAUSE TestAccAWSSSMDocument_update
=== RUN   TestAccAWSSSMDocument_permission_public
=== PAUSE TestAccAWSSSMDocument_permission_public
=== RUN   TestAccAWSSSMDocument_permission_private
=== PAUSE TestAccAWSSSMDocument_permission_private
=== RUN   TestAccAWSSSMDocument_permission_batching
=== PAUSE TestAccAWSSSMDocument_permission_batching
=== RUN   TestAccAWSSSMDocument_permission_change
=== PAUSE TestAccAWSSSMDocument_permission_change
=== RUN   TestAccAWSSSMDocument_params
=== PAUSE TestAccAWSSSMDocument_params
=== RUN   TestAccAWSSSMDocument_automation
=== PAUSE TestAccAWSSSMDocument_automation
=== RUN   TestAccAWSSSMDocument_session
=== PAUSE TestAccAWSSSMDocument_session
=== RUN   TestAccAWSSSMDocument_DocumentFormat_YAML
=== PAUSE TestAccAWSSSMDocument_DocumentFormat_YAML
=== RUN   TestAccAWSSSMDocument_Tags
=== PAUSE TestAccAWSSSMDocument_Tags
=== CONT  TestAccAWSSSMDocument_basic
=== CONT  TestAccAWSSSMDocument_Tags
=== CONT  TestAccAWSSSMDocument_DocumentFormat_YAML
=== CONT  TestAccAWSSSMDocument_permission_public
=== CONT  TestAccAWSSSMDocument_permission_private
=== CONT  TestAccAWSSSMDocument_permission_change
=== CONT  TestAccAWSSSMDocument_session
=== CONT  TestAccAWSSSMDocument_update
=== CONT  TestAccAWSSSMDocument_permission_batching
=== CONT  TestAccAWSSSMDocument_automation
=== CONT  TestAccAWSSSMDocument_params
--- PASS: TestAccAWSSSMDocument_basic (27.58s)
--- PASS: TestAccAWSSSMDocument_session (30.04s)
--- PASS: TestAccAWSSSMDocument_permission_public (30.88s)
--- PASS: TestAccAWSSSMDocument_permission_private (32.35s)
--- PASS: TestAccAWSSSMDocument_params (33.34s)
--- PASS: TestAccAWSSSMDocument_permission_batching (37.19s)
--- PASS: TestAccAWSSSMDocument_automation (44.41s)
--- PASS: TestAccAWSSSMDocument_DocumentFormat_YAML (46.44s)
--- PASS: TestAccAWSSSMDocument_Tags (65.37s)
--- PASS: TestAccAWSSSMDocument_permission_change (69.51s)
--- PASS: TestAccAWSSSMDocument_update (71.77s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       73.176s


make testacc TESTARGS="-run=TestAccAWSSsmResourceDataSync"
==> Checking that code complies with gofmt requirements...
TF_ACC=1 go test ./... -v -parallel 20 -run=TestAccAWSSsmResourceDataSync -timeout 120m
?       github.com/terraform-providers/terraform-provider-aws   [no test files]
=== RUN   TestAccAWSSsmResourceDataSync_basic
=== PAUSE TestAccAWSSsmResourceDataSync_basic
=== RUN   TestAccAWSSsmResourceDataSync_update
=== PAUSE TestAccAWSSsmResourceDataSync_update
=== RUN   TestAccAWSSsmResourceDataSync_import
=== PAUSE TestAccAWSSsmResourceDataSync_import
=== CONT  TestAccAWSSsmResourceDataSync_basic
=== CONT  TestAccAWSSsmResourceDataSync_import
=== CONT  TestAccAWSSsmResourceDataSync_update
--- PASS: TestAccAWSSsmResourceDataSync_basic (58.64s)
--- PASS: TestAccAWSSsmResourceDataSync_import (62.04s)
--- PASS: TestAccAWSSsmResourceDataSync_update (112.33s)
PASS
ok      github.com/terraform-providers/terraform-provider-aws/aws       114.737s

@ghost ghost added size/XS Managed by automation to categorize the size of a PR. service/ssm Issues and PRs that pertain to the ssm service. labels Jun 14, 2019
@ryndaniels ryndaniels added the bug Addresses a defect in current functionality. label Jun 14, 2019
@ryndaniels ryndaniels requested a review from bflad June 14, 2019 06:21
@bflad bflad self-assigned this Jun 18, 2019
@@ -164,21 +164,26 @@ func resourceAwsSsmDocumentCreate(d *schema.ResourceData, meta interface{}) erro
}

log.Printf("[DEBUG] Waiting for SSM Document %q to be created", d.Get("name").(string))
var resp *ssm.CreateDocumentOutput
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😍

@@ -369,6 +374,11 @@ func resourceAwsSsmDocumentDelete(d *schema.ResourceData, meta interface{}) erro

return resource.RetryableError(fmt.Errorf("SSM Document (%s) still exists", d.Id()))
})
if isResourceTimeoutError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woo! Okay this one is strange currently, let's fix it!

The resource.Retry() function above is using DescribeDocument calls to wait explicitly for the InvalidDocument error from the API to say "hey! this is actually gone after you deleted it". So in our case, we likely will need some additional logic here, not only to retry the last call, but also say "great! you deleted it!" after this new DescribeDocument call.

So maybe this logic can look something like:

	input := &ssm.DescribeDocumentInput{
		Name: aws.String(d.Get("name").(string)),
	}

	log.Printf("[DEBUG] Waiting for SSM Document %q to be deleted", d.Get("name").(string))
	err = resource.Retry(10*time.Minute, func() *resource.RetryError {
		_, err := ssmconn.DescribeDocument(input)

		if isAWSErr(err, ssm.ErrCodeInvalidDocument, "") {
			return nil
		}

		if err != nil {
			return resource.NonRetryableError(err)
		}

		return resource.RetryableError(fmt.Errorf("SSM Document (%s) still exists", d.Id()))
	})

	if isResourceTimeoutError(err) {
		_, err = ssmconn.DescribeDocument(input)
	}

	if isAWSErr(err, ssm.ErrCodeInvalidDocument, "") {
		return nil
	}

	if err != nil {
		return fmt.Errorf("error waiting for SSM Document (%s) deletion: %s", d.Id(), err)
	}

Please reach out if you have any questions 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bflad gotcha! Pushed an update to this, but I am left with one question. After the retry, we return nil if it was an ErrCodeInvalidDocument (indicating it was deleted), and return an error if there was one. But, if there wasn't an error, wouldn't that imply that it still described the document and so it wasn't deleted?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, that's the rabbit hole with this type of deletion logic. You are correct that it should probably return an error indicating if the document was still found, but up to you if you'd like to implement that now this logic is considered best effort. 😅

@bflad bflad assigned ryndaniels and unassigned bflad Jun 18, 2019
@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Jun 18, 2019
@ghost ghost added size/S Managed by automation to categorize the size of a PR. and removed size/XS Managed by automation to categorize the size of a PR. labels Jun 18, 2019
@ryndaniels ryndaniels removed the waiting-response Maintainers are waiting on response from community or contributor. label Jun 18, 2019
@bflad bflad added this to the v2.16.0 milestone Jun 18, 2019
Copy link
Contributor

@bflad bflad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 🚀

--- PASS: TestAccAWSSsmResourceDataSync_basic (13.51s)
--- PASS: TestAccAWSSsmResourceDataSync_import (14.01s)
--- PASS: TestAccAWSSSMDocument_basic (17.97s)
--- PASS: TestAccAWSSSMDocument_permission_batching (19.40s)
--- PASS: TestAccAWSSsmResourceDataSync_update (21.13s)
--- PASS: TestAccAWSSSMDocument_session (21.67s)
--- PASS: TestAccAWSSSMDocument_permission_public (27.22s)
--- PASS: TestAccAWSSSMDocument_update (29.41s)
--- PASS: TestAccAWSSSMDocument_DocumentFormat_YAML (32.20s)
--- PASS: TestAccAWSSSMDocument_params (32.64s)
--- PASS: TestAccAWSSSMDocument_Tags (36.51s)
--- PASS: TestAccAWSSSMDocument_permission_private (38.84s)
--- PASS: TestAccAWSSSMDocument_automation (47.26s)
--- PASS: TestAccAWSSSMDocument_permission_change (87.00s)

@@ -369,6 +374,11 @@ func resourceAwsSsmDocumentDelete(d *schema.ResourceData, meta interface{}) erro

return resource.RetryableError(fmt.Errorf("SSM Document (%s) still exists", d.Id()))
})
if isResourceTimeoutError(err) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aye, that's the rabbit hole with this type of deletion logic. You are correct that it should probably return an error indicating if the document was still found, but up to you if you'd like to implement that now this logic is considered best effort. 😅

@ryndaniels ryndaniels merged commit d92923f into master Jun 19, 2019
bflad added a commit that referenced this pull request Jun 20, 2019
@ryndaniels ryndaniels deleted the rfd-retry-ssm branch June 20, 2019 11:53
@bflad
Copy link
Contributor

bflad commented Jun 20, 2019

This has been released in version 2.16.0 of the Terraform AWS provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading.

@ghost
Copy link

ghost commented Nov 3, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ssm Issues and PRs that pertain to the ssm service. size/S Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants