Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Increases the timeout for TransitGateway routes #8417

Closed
wants to merge 1 commit into from
Closed

fix: Increases the timeout for TransitGateway routes #8417

wants to merge 1 commit into from

Conversation

ghost
Copy link

@ghost ghost commented Apr 23, 2019

When creating a lot of routes (+4), idempotency is broken. This is what happen:

  • the routes creation stop exactly after 1m;
  • the apply finishes without apparent issue;
  • the routes are indeed created on AWS;
  • the state file does not contain the routes (!);
  • subsequent applies try to create the route again (because it’s not in the state file);
  • subsequent applies fail.

There maybe a better solution to this problem, but this quick fix works.

Community Note

  • Please vote on this pull request by adding a 👍 reaction to the original pull request comment to help the community and maintainers prioritize this request
  • Please do not leave "+1" comments, they generate extra noise for pull request followers and do not help prioritize the request

@ghost ghost added size/XS Managed by automation to categorize the size of a PR. service/ec2 Issues and PRs that pertain to the ec2 service. labels Apr 23, 2019
@bflad bflad added the bug Addresses a defect in current functionality. label Apr 24, 2019
@bflad
Copy link
Contributor

bflad commented Apr 24, 2019

Hi @gui-don 👋 Thanks for submitting this and sorry for any trouble caused here.

Would you be able to quickly verify which error is being returned by EC2 during the 1 minute retry loop? This can be seen by enabling Terraform debug logging, e.g. TF_LOG=debug terraform apply. Seeing which [WARN] log message you receive would be very helpful. If this is an easily reproducible issue, we would also prefer to have a minimal configuration so we can implement a covering acceptance test to prevent potential regressions in the future.

If the EC2 error is something related to a throttling error or rate limiting error, then I believe a fix like #7873 is a little more appropriate because it removes the time bounding problem where any amount of timeout increases may only help in some situations.

Thanks and hopefully we can get to the bottom of this.

@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Apr 24, 2019
@ghost
Copy link
Author

ghost commented Apr 25, 2019

Hi @bflad. Sure, I understand.

I’ll do the test today.

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Apr 25, 2019
@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label Apr 25, 2019
@ghost
Copy link
Author

ghost commented Apr 25, 2019

Here are the WARNs I get, confirming the issue. I have as many logs like this as the number of route I create:

2019-04-25T10:47:58.150-0400 [DEBUG] plugin.terraform-provider-aws_v2.7.0_x4: 2019/04/25 10:47:58 [WARN] WaitForState timeout after 1m0s
2019-04-25T10:47:58.150-0400 [DEBUG] plugin.terraform-provider-aws_v2.7.0_x4: 2019/04/25 10:47:58 [WARN] WaitForState starting 30s refresh grace period
2019-04-25T10:47:58.150-0400 [DEBUG] plugin.terraform-provider-aws_v2.7.0_x4: 2019/04/25 10:47:58 [WARN] EC2 Transit Gateway Route (tgw-rtb-00b582c9e05bbc2bd_192.168.4.0/24) not found, removing from state

Here is the “minimal” configuration to reproduce the issue. Oddly, I also noted that the aws_ec2_transit_gateway_vpc_attachment is needed to reproduce the issue:

data "aws_vpc" "default" {
  default = true
}

data "aws_subnet_ids" "all" {
  vpc_id = "${data.aws_vpc.default.id}"
}

locals {
  routes = [
    "192.168.10.0/24",
    "192.168.11.0/24",
    "192.168.12.0/24",
    "192.168.13.0/24",
    "192.168.14.0/24",
    "192.168.15.0/24",
    "192.168.16.0/24",
    "192.168.17.0/24",
    "192.168.18.0/24",
    "192.168.19.0/24",
    "192.168.20.0/24",
    "192.168.21.0/24",
  ]
}

resource "aws_ec2_transit_gateway" "this" {
  auto_accept_shared_attachments  = "enable"
}

resource "aws_customer_gateway" "this" {
  bgp_asn    = "65000"
  ip_address = "24.25.25.25"
  type       = "ipsec.1"
}

resource "aws_vpn_connection" "this" {
  transit_gateway_id  = "${aws_ec2_transit_gateway.this.id}"
  customer_gateway_id = "${aws_customer_gateway.this.id}"
  static_routes_only  = "true"
  type                = "ipsec.1"
}

resource "aws_ec2_transit_gateway_vpc_attachment" "this" {
  subnet_ids         = ["${element(data.aws_subnet_ids.all.*.id, 0)}"]
  transit_gateway_id = "${aws_ec2_transit_gateway.this.id}"
  vpc_id             = "${data.aws_vpc.default.id}"
}

resource "aws_ec2_transit_gateway_route" "this_vpn" {
  count = "10"

  destination_cidr_block = "${element(local.routes, count.index)}"

  transit_gateway_attachment_id  = "${aws_vpn_connection.this.transit_gateway_attachment_id}"
  transit_gateway_route_table_id = "${aws_ec2_transit_gateway.this.association_default_route_table_id}"
}

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label Apr 25, 2019
@bflad
Copy link
Contributor

bflad commented May 13, 2019

Hi again @gui-don 👋

I'm still having trouble replicating this issue with our acceptance testing frameworking. Using a configuration similar to yours above passes the testing:

// Reference: https://github.com/terraform-providers/terraform-provider-aws/pull/8417
// This acceptance test is for verifying concurrent route creation
func TestAccAWSEc2TransitGatewayRoute_concurrency(t *testing.T) {
	var transitGatewayRoute ec2.TransitGatewayRoute
	rBgpAsn := acctest.RandIntRange(64512, 65534)

	resource.ParallelTest(t, resource.TestCase{
		PreCheck:     func() { testAccPreCheck(t); testAccPreCheckAWSEc2TransitGateway(t) },
		Providers:    testAccProviders,
		CheckDestroy: testAccCheckAWSEc2TransitGatewayRouteDestroy,
		Steps: []resource.TestStep{
			{
				Config: testAccAWSEc2TransitGatewayRouteConfigConcurrency(rBgpAsn),
				Check: resource.ComposeTestCheckFunc(
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.0", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.1", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.2", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.3", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.4", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.5", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.6", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.7", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.8", &transitGatewayRoute),
					testAccCheckAWSEc2TransitGatewayRouteExists("aws_ec2_transit_gateway_route.test.9", &transitGatewayRoute),
				),
			},
		},
	})
}

func testAccAWSEc2TransitGatewayRouteConfigConcurrency(rBgpAsn int) string {
	return fmt.Sprintf(`
data "aws_availability_zones" "available" {
  # IncorrectState: Transit Gateway is not available in availability zone us-west-2d
  blacklisted_zone_ids = ["usw2-az4"]
  state                = "available"
}

resource "aws_vpc" "test" {
  cidr_block = "10.0.0.0/16"

  tags = {
    Name = "tf-acc-test-ec2-transit-gateway-route"
  }
}

resource "aws_subnet" "test" {
  availability_zone = "${data.aws_availability_zones.available.names[0]}"
  cidr_block        = "10.0.0.0/24"
  vpc_id            = "${aws_vpc.test.id}"

  tags = {
    Name = "tf-acc-test-ec2-transit-gateway-route"
  }
}

resource "aws_ec2_transit_gateway" "test" {
  auto_accept_shared_attachments = "enable"
}

resource "aws_ec2_transit_gateway_vpc_attachment" "test" {
  subnet_ids         = ["${aws_subnet.test.id}"]
  transit_gateway_id = "${aws_ec2_transit_gateway.test.id}"
  vpc_id             = "${aws_vpc.test.id}"
}

resource "aws_customer_gateway" "test" {
  bgp_asn    = %[1]d
  ip_address = "178.0.0.1"
  type       = "ipsec.1"

  tags = {
    Name = "tf-acc-test-ec2-transit-gateway-route"
  }
}

resource "aws_vpn_connection" "test" {
  customer_gateway_id = "${aws_customer_gateway.test.id}"
  static_routes_only  = true
  transit_gateway_id  = "${aws_ec2_transit_gateway.test.id}"
  type                = "${aws_customer_gateway.test.type}"
}

resource "aws_ec2_transit_gateway_route" "test" {
  count = 10

  destination_cidr_block         = "10.${count.index}.0.0/16"
  transit_gateway_attachment_id  = "${aws_vpn_connection.test.transit_gateway_attachment_id}"
  transit_gateway_route_table_id = "${aws_ec2_transit_gateway.test.association_default_route_table_id}"
}
`, rBgpAsn)
}
$ TF_ACC=1 go test ./aws -v -timeout 120m -parallel 20 -run='TestAccAWSEc2TransitGatewayRoute_concurrency'
=== RUN   TestAccAWSEc2TransitGatewayRoute_concurrency
=== PAUSE TestAccAWSEc2TransitGatewayRoute_concurrency
=== CONT  TestAccAWSEc2TransitGatewayRoute_concurrency
--- PASS: TestAccAWSEc2TransitGatewayRoute_concurrency (484.70s)

I do notice something odd about your output above, that the destination CIDR reference in the warning log does not match your local:

locals {
  routes = [
    "192.168.10.0/24",
    "192.168.11.0/24",
    "192.168.12.0/24",
    "192.168.13.0/24",
    "192.168.14.0/24",
    "192.168.15.0/24",
    "192.168.16.0/24",
    "192.168.17.0/24",
    "192.168.18.0/24",
    "192.168.19.0/24",
    "192.168.20.0/24",
    "192.168.21.0/24",
  ]
}

# 2019-04-25T10:47:58.150-0400 [DEBUG] plugin.terraform-provider-aws_v2.7.0_x4: 2019/04/25 10:47:58 [WARN] EC2 Transit Gateway Route (tgw-rtb-00b582c9e05bbc2bd_192.168.4.0/24) not found, removing from state

Are there other contributing factors that might be present?

@bflad bflad added the waiting-response Maintainers are waiting on response from community or contributor. label May 13, 2019
@ghost
Copy link
Author

ghost commented May 15, 2019

Hi!

Our issue was also fixed auto-magically. It’s been a week since I last hit this problem.
I suspect AWS changed something on its side, allowing route to be created much faster than before.

The problem is that it might appear again if the the creation time hit the threshold of 1m again. It is indeed very hard to come up with a test that is replicable with a moving target like the time required to create a resource…

I would suggest to modify the code to allow more than 1m to create the routes but at the same time the problem appears to be fixed.

@ghost ghost removed the waiting-response Maintainers are waiting on response from community or contributor. label May 15, 2019
@ryndaniels ryndaniels self-assigned this May 21, 2019
@ryndaniels
Copy link
Contributor

Hi @gui-don - I'm glad to hear this issue has resolved itself for you! Since that's the case, and because I added an additional retry with #8726 that should help in general, I'm going to go ahead and close this PR. However, if this problem comes back up for you, please feel free to open a new issue and we can take a further look.

@ryndaniels ryndaniels closed this May 21, 2019
@ghost ghost deleted the bugfix/TransitGatewayRouteTimeout branch June 13, 2019 17:59
@ghost
Copy link

ghost commented Nov 3, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. Thanks!

@ghost ghost locked and limited conversation to collaborators Nov 3, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Addresses a defect in current functionality. service/ec2 Issues and PRs that pertain to the ec2 service. size/XS Managed by automation to categorize the size of a PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants