Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

r/resource_aws_eip: implement retry reading EIPs #1053

Merged
merged 1 commit into from
Jul 18, 2017

Conversation

s-urbaniak
Copy link

We experience a significant amount of flakes when destroying a cluster. The error that occurs is as follows:

Error applying plan:
1 error(s) occurred:
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

CI jenkins log, filtered by the relationship between aws_nat_gateway and nat_eip:

$ egrep 'nat_eip|aws_nat_gateway' jenkins-tectonic-installer.prod.coreos.systems.txt 
+ module.vpc.aws_eip.nat_eip.0
+ module.vpc.aws_eip.nat_eip.1
+ module.vpc.aws_eip.nat_eip.2
+ module.vpc.aws_nat_gateway.nat_gw.0
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.1
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
+ module.vpc.aws_nat_gateway.nat_gw.2
    allocation_id:        "${aws_eip.nat_eip.*.id[count.index]}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
    nat_gateway_id:             "${element(aws_nat_gateway.nat_gw.*.id, count.index)}"
module.vpc.aws_eip.nat_eip.1: Creating...
module.vpc.aws_eip.nat_eip.2: Creating...
module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.0: Creating...
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)
module.vpc.aws_nat_gateway.nat_gw.1: Creating...
module.vpc.aws_nat_gateway.nat_gw.0: Creating...
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m0s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m10s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m20s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m30s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m40s elapsed)
module.vpc.aws_nat_gateway.nat_gw.1: Creation complete (ID: nat-0a6cebc0930c8adb4)
module.vpc.aws_nat_gateway.nat_gw.0: Still creating... (1m50s elapsed)
module.vpc.aws_nat_gateway.nat_gw.0: Creation complete (ID: nat-0036bd8641f070425)
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in:
${aws_eip.nat_eip.*.id[count.index]}

Noteful observation. While the first two EIPs (nat_eip.0, nat_eip.1) get allocation IDs from AWS printed out, in this example sometimes nat_eip.2 (or any other EIP) does not get a allocation ID printed out:

module.vpc.aws_eip.nat_eip.1: Creation complete (ID: eipalloc-1314807a)
module.vpc.aws_eip.nat_eip.2: Creation complete
module.vpc.aws_eip.nat_eip.0: Creation complete (ID: eipalloc-c268fcab)

There is currently exactly one code place, where the ID set to empty string explicitely: https://github.com/hashicorp/terraform/blob/v0.9.9/builtin/providers/aws/resource_aws_eip.go#L137-L140

The read of EIPs happens immediately after a create, but internally the AWS control plane might need longer to propagate causing the above effect and flake.

This PR makes the read of EIPs retryable. Setting the ID to empty string is not meaningful, because the resource then would not exist from a dependency graph perspective. We did not observe this flake any more with this patch.

Fixes coreos/tectonic-installer#1246

@radeksimko this change is a bit more intrusive. Please advise if this is a good direction. Thanks a lot!

/cc @jasminSPC

Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally feeling positive about this PR. I believe this is happening and I did observe some of our EIP nightly tests to be a bit flaky, so 👍 for having retry in place.

Admittedly the error message here is a bit confusing

* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in: ${aws_eip.nat_eip.*.id[count.index]}

but I think it reflects the test failures I have seen. Basically what's happening here is that Terraform follows Create() -> Update() -> Read() until it actually returns the resource to the user as "Created".

Because Read() is also used during refresh (which by default runs prior to plan/apply/destroy) and it's our promise to reflect the reality in the state, we have to wipe that resource from state (via d.SetId("")) if it genuinely doesn't exist.

I left you some comments there on how to address this situation. I hope it's helpful.

Let me know if anything's unclear.

Delay: 10 * time.Second,
Refresh: describeAddressesFunc(ec2conn, req),
NotFoundChecks: 90,
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we're retrying on errors, not checking an actual state returned by the API I think resource.Retry() would suffice. It has much simpler interface than StateChangeConf here and we don't have to make up state names like notfound and found. 😉

While 15mins seems fairly high, I'd be ok with such timeout if we only waited in the context of creation, not in Read. Only right after creation we can be almost sure the resource should exist.

That d.SetId("") is there for a good reason.
It's a promise of Terraform to recover (ideally as quickly as possible) from situation where the user or another tool steps in and deletes resource Terraform has created before.

You can wrap the waiter in d.IsNewResource() to satisfy both use cases. 🙂

@radeksimko radeksimko added bug Addresses a defect in current functionality. waiting-response Maintainers are waiting on response from community or contributor. labels Jul 4, 2017
@radeksimko
Copy link
Member

FYI https://github.com/terraform-providers/terraform-provider-aws/pull/1039/files#diff-1ea983c00dd6493ff49f61e711e647c0R138 - I think it's very similar problem, just different resource.

@s-urbaniak
Copy link
Author

@radeksimko Thank you a lot for the valuable feedback! I will adjust the PR accordingly.

@s-urbaniak
Copy link
Author

s-urbaniak commented Jul 5, 2017

@radeksimko PTAL, addressed review comments to my best knowledge and tested locally.

Side-note: the snippet in https://github.com/terraform-providers/terraform-provider-aws/pull/1053/files#diff-0485863f9f2c42f2a162b547c419db50R186 looks very suspicious. The error is checked, despite the fact that it is already checked above.

@s-urbaniak
Copy link
Author

@radeksimko ping, did you have a chance to look at it? thanks! :-)

@radeksimko radeksimko removed the waiting-response Maintainers are waiting on response from community or contributor. label Jul 17, 2017
@radeksimko
Copy link
Member

Hey @s-urbaniak
sorry for the delay, we had a company summit last week.

This functionally looks good, I just took the liberty and cleaned/simplified the code - e.g. removed switch statements as they seemed a bit too verbose in the context (boolean).

I hope you don't mind - it was merely to speed things up 😄

@radeksimko radeksimko merged commit ef28258 into hashicorp:master Jul 18, 2017
@s-urbaniak
Copy link
Author

@radeksimko no worries and thanks for merging! Admittedly I liked the switch case better (less if-else nesting), but that is also perfectly fine ;-)

@ghost
Copy link

ghost commented Apr 11, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Apr 11, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

aws flake aws_nat_gateway.nat_gw[n]: index n out of range
3 participants