-
Notifications
You must be signed in to change notification settings - Fork 9.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource/aws_route_table_association: error reading Route Table Association (rtbassoc-xxx): Empty result #21629
Comments
I experience the same behavior using TF 0.13.7 and AWS provider 3.62.0. Before upgrading to 3.62.0 I saw the problem with route tables which is now fixed - #19985 Now the route table problem is gone but still often get errors for route table associations as described above when running tests. Note that this issue is nondeterministic and may take few runs to reproduce. Below sample error:
|
We're hitting this as well, seemingly randomly, on version |
Hi all 👋 the PR #21710 has been merged to hopefully address this nondeterministic issue. any findings from those who upgrade to the new provider that will be out later today (v3.65.0) would be greatly appreciated! |
Hi @anGie44 , unfortunately this issue is still present in v3.65.0. I analysed the code and it seems to me that FindRouteTableAssociationByID function is a problem. My suspicion is that it starts returning expected results in StatusRouteTableAssociationState check in WaitRouteTableAssociationUpdated but then for some reason it is not returning same results in subsequent resourceRouteTableAssociationRead. So either there is some bug in this function (in such case this needs to be fixed) or AWS is returning inconsistent results (in that case maybe WaitRouteTableAssociationUpdated should wait for more than one successful find invocation). BTW it looks like there is IGW regression in v3.65.0 which affects Terraform destroy, I reported it as a separate issue: #21792 |
FTR I tested this on 3.66.0 version and now I see routeTable issues like below (IIRC I saw route_table_association issues but much less often than routeTable): [1m�[31mError: �[0m�[0m�[1merror reading Route Table (rtb-0fa601bb520958930): couldn't find resource�[0m
For now I decided to rollback to 3.62.0 on prod, seems to me that route table issues occur less often on that version. |
I am experiencing same issue. I am using pulumi 4.31.0, itself using aws terraform provider 3.68.0 |
Seeing the same issue on |
@artem-nefedov same. Not sure what your setup looks like but we've found that wrapping the process in our own retry blocks seems to give better results. Unfortunate we have to do so but we also can't sit idly by waiting for this type of thing to be fixed which has been on-going seemingly forever now. |
@jasonkinsella the retry blocks are not being done within terraform but outside using a python script. Basically we are retrying applying the terraform over and over again until it succeeds. It's not a great solution but it mostly works. |
@jasonkinsella you and me both. These errors have taken so many different forms and variations and have been really really really frustrating for us the past 6 months or so. |
FTR I tried to briefly analyse that problem a little bit some time ago (version 3.66.0). It manifests either as empty result for route table lookup or route table association lookup. Both use same read API (route table find). What happens is that during create, route is created, then create waits until read returns the route (this succeeds) and then read is invoked once again to read resource but for some reason the query does not return it. So it seems to me that it is either AWS that is inconsistent with routes query or there is some issue in AWS provider that reads AWS read response inconsistently (e.g. pagination or sth else). |
@mgusiew-guide From official AWS documentation
Basically, this is not a bug from AWS perspective, and it's up to client to implement retries. |
@artem-nefedov AFAIR wait has retries and waits for the result to appear, the problem is that once wait succeeds there is another read and this read does not find the route. So it is a bit like DNS propagation, after it returns first correct result, it may happen then it will still return stale results for a while (this is due to distributed nature of the service). I would expect that once route read query returns the specific route, the subsequent read query will also return it. For some reason that is not the case. Note that I haven't analysed it deeply so my theory may be wrong. To verify this it would be good to setup provider in trace mode where aws commands are logged and try to reproduce. Unfortunately right now I don't have capacity to do that. |
Yes, that's exactly what "eventual consistency" means. Suppose you have a cluster of 3 nodes handling your request, and perform a write operation. That request goes to one of the nodes, which then asynchronously propagate changes to other nodes (which take some time). During this time, you can make read request, and it may go to original node (which you wrote to), and it will return data successfully. Then, you can make a second request, and it can potentially go to another node, which potentially doesn't have the data propagated yet, so you will get the failure. This is exactly why the fix for similar problem with Security Groups says:
|
@artem-nefedov I agree with your train of thought. I think that adding ContinuousTargetOccurence: 3 (or higher value) to resource.StateChangeConf in WaitRouteTableAssociationCreated and WaitRouteTableReady in ec2/wait.go could solve this issue. Unfortunately I don't see that change in main branch. Hopefully somebody will pick it up and this gets resolved... |
I am sorry WaitRouteTableReady was not affected. Just noticed that ContinuousTargetOccurence: 2 is already present in WaitRouteReady since 11/06/2021. Hopefully someone will have a closer look. |
Apologies for the noise; in case it's helpful: this still happens on 3.72.0 and 3.73.0. |
This is a far from negligible occurrence pattern; in a dozen apply across 2 days I've seen it fail twice already (earlier it felt very rare). This impacts terraform-aws-modules/vpc/aws directly. And obviously completely breaks any CD pipelines that create and immediately manipulate route tables (i.e. everyone who creates a vpc?). It is so breaking in fact I have to ask myself if aws changed something in their backend to make this happen more often. Bumping up literals like Retrying the entire script is like using a cannon as a fly swatter. I'd be glad to test any potential fixes as I'm running into this on the daily. |
@mritalian you said what we're all thinking brother ... but they seem to only want to provide Band-Aids and not complete solutions |
In the past 3 months we have gone from almost zero failures to about 5 per day across about 100 deployments. We had our provider pinned to 3.64 during this whole time. We've now got a dedicated customer channel reporting launch failures! I suspect there has been some AWS backend change that's amplified this weakness. |
Disclaimers:
Now some comments:
FTR I agree that neither Terraform and Terraform AWS is perfect, there are gaps in coverage and that results in regressions. However, what I noticed is that when I report the problem and provide the details, somebody will look at it, e.g. #21792 (comment) So maybe as a users we could a at least try to collect some more data in order to move this one forward. |
This functionality has been released in v4.0.0 of the Terraform AWS Provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. For further feature requests or bug reports with this functionality, please create a new GitHub issue following the template. Thank you! |
v4.0.0 has breaking changes. Would it be possible to mitigate this in a v3.x version? |
FTR I ran my test suite on 4.0.0 (fortunately I am not affected by backwards incompatible changes) and I still receive route table errors like the one below (I use VPC module from terraform-aws-modules). Therefore I am not sure if this issue is completely fixed. �[0m �[0m on .terraform/modules/vpc.vpc.vpc/main.tf line 203, in resource "aws_route_table" "public": |
Upgrade the aws terraform provider to v4.5.0 to bring in the fix for the empty result error when reading route table associations [1]. https://bugzilla.redhat.com/show_bug.cgi?id=2064969 [1] hashicorp/terraform-provider-aws#21629
Hi, I am using hashicorp/aws v4.9.0 and I see the same error when I try to create aws_route_table_association
I see my route table , associated subnets and subnet associations created on AWS. |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Terraform CLI and Terraform AWS Provider Version
terraform version -
0.12.31
provider-aws version -
3.54.0
Affected Resource(s)
Terraform Configuration Files
Please include all Terraform configurations required to reproduce the bug. Bug reports without a functional reproduction may be closed without investigation.
Debug Output
Panic Output
Expected Behavior
Actual Behavior
Steps to Reproduce
terraform apply
the configuration from aboveMake sure that for heavily used AWS account, it may fail with the above error:
Maybe also cloud provider request limits and throttling can lead to this error?
Can this issue be related to the eventual consistency model of the AWS EC2 API (hence related to #16796)?
Important Factoids
References
The text was updated successfully, but these errors were encountered: