r/resource_aws_eip: implement retry reading EIPs #1053

s-urbaniak · 2017-07-04T13:54:09Z

We experience a significant amount of flakes when destroying a cluster. The error that occurs is as follows:

Error applying plan:
1 error(s) occurred:
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

CI jenkins log, filtered by the relationship between aws_nat_gateway and nat_eip:

$ egrep 'nat_eip|aws_nat_gateway' jenkins-tectonic-installer.prod.coreos.systems.txt 
+ module.vpc.aws_eip.nat_eip.0
+ module.vpc.aws_eip.nat_eip.1
+ module.vpc.aws_eip.nat_eip.2
+ module.vpc.aws_nat_gateway.nat_gw.0
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.1
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.2
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
module.vpc.aws_eip.nat_eip.1: Creating...
module.vpc.aws_eip.nat_eip.2: Creating...
module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.0: Creating...
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)
module.vpc.aws_nat_gateway.nat_gw.1: Creating...
module.vpc.aws_nat_gateway.nat_gw.0: Creating...
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Creation complete (ID: nat-0a6cebc0930c8adb4)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Creation complete (ID: nat-0036bd8641f070425)
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

Noteful observation. While the first two EIPs (nat_eip.0, nat_eip.1) get allocation IDs from AWS printed out, in this example sometimes nat_eip.2 (or any other EIP) does not get a allocation ID printed out:

module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)

There is currently exactly one code place, where the ID set to empty string explicitely: https://github.com/hashicorp/terraform/blob/v0.9.9/builtin/providers/aws/resource_aws_eip.go#L137-L140

The read of EIPs happens immediately after a create, but internally the AWS control plane might need longer to propagate causing the above effect and flake.

This PR makes the read of EIPs retryable. Setting the ID to empty string is not meaningful, because the resource then would not exist from a dependency graph perspective. We did not observe this flake any more with this patch.

Fixes coreos/tectonic-installer#1246

@radeksimko this change is a bit more intrusive. Please advise if this is a good direction. Thanks a lot!

/cc @jasminSPC

radeksimko

I'm generally feeling positive about this PR. I believe this is happening and I did observe some of our EIP nightly tests to be a bit flaky, so 👍 for having retry in place.

Admittedly the error message here is a bit confusing

* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in: ${aws_eip.nat_eip.*.id[count.index]}

but I think it reflects the test failures I have seen. Basically what's happening here is that Terraform follows Create() -> Update() -> Read() until it actually returns the resource to the user as "Created".

Because Read() is also used during refresh (which by default runs prior to plan/apply/destroy) and it's our promise to reflect the reality in the state, we have to wipe that resource from state (via d.SetId("")) if it genuinely doesn't exist.

I left you some comments there on how to address this situation. I hope it's helpful.

Let me know if anything's unclear.

radeksimko · 2017-07-04T14:11:17Z

aws/resource_aws_eip.go

+		Delay:          10 * time.Second,
+		Refresh:        describeAddressesFunc(ec2conn, req),
+		NotFoundChecks: 90,
+	}


Since we're retrying on errors, not checking an actual state returned by the API I think resource.Retry() would suffice. It has much simpler interface than StateChangeConf here and we don't have to make up state names like notfound and found. 😉

While 15mins seems fairly high, I'd be ok with such timeout if we only waited in the context of creation, not in Read. Only right after creation we can be almost sure the resource should exist.

That d.SetId("") is there for a good reason.
It's a promise of Terraform to recover (ideally as quickly as possible) from situation where the user or another tool steps in and deletes resource Terraform has created before.

You can wrap the waiter in d.IsNewResource() to satisfy both use cases. 🙂

radeksimko · 2017-07-04T14:30:01Z

FYI https://github.com/terraform-providers/terraform-provider-aws/pull/1039/files#diff-1ea983c00dd6493ff49f61e711e647c0R138 - I think it's very similar problem, just different resource.

s-urbaniak · 2017-07-04T14:32:05Z

@radeksimko Thank you a lot for the valuable feedback! I will adjust the PR accordingly.

s-urbaniak · 2017-07-05T14:50:48Z

@radeksimko PTAL, addressed review comments to my best knowledge and tested locally.

Side-note: the snippet in https://github.com/terraform-providers/terraform-provider-aws/pull/1053/files#diff-0485863f9f2c42f2a162b547c419db50R186 looks very suspicious. The error is checked, despite the fact that it is already checked above.

s-urbaniak · 2017-07-12T16:17:21Z

@radeksimko ping, did you have a chance to look at it? thanks! :-)

Fixes coreos/tectonic-installer#1246

radeksimko · 2017-07-18T13:52:59Z

Hey @s-urbaniak
sorry for the delay, we had a company summit last week.

This functionally looks good, I just took the liberty and cleaned/simplified the code - e.g. removed switch statements as they seemed a bit too verbose in the context (boolean).

I hope you don't mind - it was merely to speed things up 😄

s-urbaniak · 2017-07-18T16:38:16Z

@radeksimko no worries and thanks for merging! Admittedly I liked the switch case better (less if-else nesting), but that is also perfectly fine ;-)

ghost · 2020-04-11T17:50:53Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

radeksimko suggested changes Jul 4, 2017

View reviewed changes

radeksimko added bug Addresses a defect in current functionality. waiting-response Maintainers are waiting on response from community or contributor. labels Jul 4, 2017

s-urbaniak force-pushed the fix-1246 branch from 9c00cb6 to f17afd9 Compare July 5, 2017 14:48

s-urbaniak mentioned this pull request Jul 15, 2017

aws flake aws_nat_gateway.nat_gw[n]: index n out of range coreos/tectonic-installer#1246

Closed

radeksimko removed the waiting-response Maintainers are waiting on response from community or contributor. label Jul 17, 2017

r/resource_aws_eip: implement retry reading EIPs

e4244f8

Fixes coreos/tectonic-installer#1246

radeksimko force-pushed the fix-1246 branch from f17afd9 to e4244f8 Compare July 18, 2017 13:49

radeksimko approved these changes Jul 18, 2017

View reviewed changes

radeksimko merged commit ef28258 into hashicorp:master Jul 18, 2017

ksperling mentioned this pull request Aug 29, 2017

EIP deletion is not detected, unable to recover via taint #1529

Closed

ghost locked and limited conversation to collaborators Apr 11, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

r/resource_aws_eip: implement retry reading EIPs #1053

r/resource_aws_eip: implement retry reading EIPs #1053

s-urbaniak commented Jul 4, 2017

radeksimko left a comment •

edited

Loading

radeksimko Jul 4, 2017

radeksimko commented Jul 4, 2017

s-urbaniak commented Jul 4, 2017

s-urbaniak commented Jul 5, 2017 •

edited

Loading

s-urbaniak commented Jul 12, 2017

radeksimko commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

ghost commented Apr 11, 2020

r/resource_aws_eip: implement retry reading EIPs #1053

r/resource_aws_eip: implement retry reading EIPs #1053

Conversation

s-urbaniak commented Jul 4, 2017

radeksimko left a comment • edited Loading

Choose a reason for hiding this comment

radeksimko Jul 4, 2017

Choose a reason for hiding this comment

radeksimko commented Jul 4, 2017

s-urbaniak commented Jul 4, 2017

s-urbaniak commented Jul 5, 2017 • edited Loading

s-urbaniak commented Jul 12, 2017

radeksimko commented Jul 18, 2017

s-urbaniak commented Jul 18, 2017

ghost commented Apr 11, 2020

radeksimko left a comment •

edited

Loading

s-urbaniak commented Jul 5, 2017 •

edited

Loading