-
Notifications
You must be signed in to change notification settings - Fork 9.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
r/resource_aws_eip: implement retry reading EIPs #1053
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally feeling positive about this PR. I believe this is happening and I did observe some of our EIP nightly tests to be a bit flaky, so 👍 for having retry in place.
Admittedly the error message here is a bit confusing
* module.vpc.aws_nat_gateway.nat_gw[2]: index 2 out of range for list aws_eip.nat_eip.*.id (max 2) in: ${aws_eip.nat_eip.*.id[count.index]}
but I think it reflects the test failures I have seen. Basically what's happening here is that Terraform follows Create() -> Update() -> Read()
until it actually returns the resource to the user as "Created".
Because Read()
is also used during refresh
(which by default runs prior to plan
/apply
/destroy
) and it's our promise to reflect the reality in the state, we have to wipe that resource from state (via d.SetId("")
) if it genuinely doesn't exist.
I left you some comments there on how to address this situation. I hope it's helpful.
Let me know if anything's unclear.
aws/resource_aws_eip.go
Outdated
Delay: 10 * time.Second, | ||
Refresh: describeAddressesFunc(ec2conn, req), | ||
NotFoundChecks: 90, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we're retrying on errors, not checking an actual state returned by the API I think resource.Retry()
would suffice. It has much simpler interface than StateChangeConf
here and we don't have to make up state names like notfound
and found
. 😉
While 15mins seems fairly high, I'd be ok with such timeout if we only waited in the context of creation, not in Read
. Only right after creation we can be almost sure the resource should exist.
That d.SetId("")
is there for a good reason.
It's a promise of Terraform to recover (ideally as quickly as possible) from situation where the user or another tool steps in and deletes resource Terraform has created before.
You can wrap the waiter in d.IsNewResource()
to satisfy both use cases. 🙂
FYI https://github.com/terraform-providers/terraform-provider-aws/pull/1039/files#diff-1ea983c00dd6493ff49f61e711e647c0R138 - I think it's very similar problem, just different resource. |
@radeksimko Thank you a lot for the valuable feedback! I will adjust the PR accordingly. |
@radeksimko PTAL, addressed review comments to my best knowledge and tested locally. Side-note: the snippet in https://github.com/terraform-providers/terraform-provider-aws/pull/1053/files#diff-0485863f9f2c42f2a162b547c419db50R186 looks very suspicious. The error is checked, despite the fact that it is already checked above. |
@radeksimko ping, did you have a chance to look at it? thanks! :-) |
Hey @s-urbaniak This functionally looks good, I just took the liberty and cleaned/simplified the code - e.g. removed I hope you don't mind - it was merely to speed things up 😄 |
@radeksimko no worries and thanks for merging! Admittedly I liked the switch case better (less if-else nesting), but that is also perfectly fine ;-) |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks! |
We experience a significant amount of flakes when destroying a cluster. The error that occurs is as follows:
CI jenkins log, filtered by the relationship between
aws_nat_gateway
andnat_eip
:Noteful observation. While the first two EIPs (
nat_eip.0
,nat_eip.1
) get allocation IDs from AWS printed out, in this example sometimesnat_eip.2
(or any other EIP) does not get a allocation ID printed out:There is currently exactly one code place, where the ID set to empty string explicitely: https://github.com/hashicorp/terraform/blob/v0.9.9/builtin/providers/aws/resource_aws_eip.go#L137-L140
The read of EIPs happens immediately after a create, but internally the AWS control plane might need longer to propagate causing the above effect and flake.
This PR makes the read of EIPs retryable. Setting the ID to empty string is not meaningful, because the resource then would not exist from a dependency graph perspective. We did not observe this flake any more with this patch.
Fixes coreos/tectonic-installer#1246
@radeksimko this change is a bit more intrusive. Please advise if this is a good direction. Thanks a lot!
/cc @jasminSPC