Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: TF often tries to create duplicate security group rules #29797

Open
speller opened this issue Mar 6, 2023 · 9 comments
Open

[Bug]: TF often tries to create duplicate security group rules #29797

speller opened this issue Mar 6, 2023 · 9 comments
Labels
bug Addresses a defect in current functionality. service/vpc Issues and PRs that pertain to the vpc service.

Comments

@speller
Copy link
Contributor

speller commented Mar 6, 2023

Terraform Core Version

1.3.7

AWS Provider Version

4.57.0

Affected Resource(s)

In my infra, TF often tried to create duplicate security group resources. On the same security group. When I delete them manually, it creates them again and succeeds. But on the next run it may fail again. But may not. It also happens on the deployments created from scratch earlier by the same configuration.

Expected Behavior

Successful apply

Actual Behavior

│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 8080, to port: 8080, ALLOW" already exists
│ 	status code: 400, request id: d447b3b1-7324-4e26-a494-d1c84ed0d0f0
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-magi,
│   on service-magi/magi-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-magi":
│   11: resource "aws_security_group_rule" "vpc-http-magi" {
│ 
╵
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 8082, to port: 8082, ALLOW" already exists
│ 	status code: 400, request id: 0c54c265-403f-40c3-9c5b-4b293b8e4be1
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-sim,
│   on service-magi/sim-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-sim":
│   11: resource "aws_security_group_rule" "vpc-http-sim" {
│ 
╵
╷
│ Error: [WARN] A duplicate Security Group rule was found on (sg-091a5bd0ced7044a9). This may be
│ a side effect of a now-fixed Terraform issue causing two security groups with
│ identical attributes but different source_security_group_ids to overwrite each
│ other in the state. See https://github.com/hashicorp/terraform/pull/2376 for more
│ information and instructions for recovery. Error: InvalidPermission.Duplicate: the specified rule "peer: 10.0.0.0/16, TCP, from port: 4000, to port: 4000, ALLOW" already exists
│ 	status code: 400, request id: 0eec0b6e-be4c-4673-8021-83e73dd86dde
│ 
│   with module.service-magi.aws_security_group_rule.vpc-http-sim-ui,
│   on service-magi/sim-ui-security-group-rules.tf line 11, in resource "aws_security_group_rule" "vpc-http-sim-ui":
│   11: resource "aws_security_group_rule" "vpc-http-sim-ui" {

Relevant Error/Panic Output Snippet

No response

Terraform Configuration Files

Let me know how can I share my state and failing plan with you securely.

The problematic resources are here:

resource "aws_security_group_rule" "lb-http-sim" {
  description = "Allow SIM HTTP server access from load balancer"
  from_port = var.sim.port
  to_port = var.sim.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-sim" {
  description = "Allow SIM HTTP access from VPC"
  from_port = var.sim.port
  to_port = var.sim.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-http-sim-ui" {
  description = "Allow SIM UI HTTP server access from load balancer"
  from_port = var.sim_ui.port
  to_port = var.sim_ui.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-sim-ui" {
  description = "Allow SIM UI HTTP access from VPC"
  from_port = var.sim_ui.port
  to_port = var.sim_ui.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-http-magi" {
  description = "Allow Magi HTTP server access from load balancer"
  from_port = var.magi.port
  to_port = var.magi.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  source_security_group_id = var.lb.security_group_id
  type = "ingress"
}

resource "aws_security_group_rule" "vpc-http-magi" {
  description = "Allow Magi HTTP access from VPC"
  from_port = var.magi.port
  to_port = var.magi.port
  protocol = "tcp"
  security_group_id = module.magi.security_group_id
  cidr_blocks = [data.aws_vpc.vpc.cidr_block]
  type = "ingress"
}

resource "aws_security_group_rule" "lb-from-magi" {
  description = "Allow access to LB from Magi"
  from_port = 0
  to_port = 0
  protocol = "-1"
  security_group_id = var.lb.security_group_id
  source_security_group_id = module.magi.security_group_id
  type = "ingress"
}

Only the three rules referring to the vpc cidr block are affected. Rules next to them referring to another security group are always fine. There are no duplicate definitions in the configuration (otherwise it will fail always).

In the plan, TF can not get the VPC data and can not determine the VPC cidr block and wants to update it. It eventually updates to the same value and fails:

  # module.service-magi.data.aws_vpc.vpc will be read during apply
  # (depends on a resource or a module with changes pending)
 <= data "aws_vpc" "vpc" {
      + arn                                  = (known after apply)
      + cidr_block                           = (known after apply)
      + cidr_block_associations              = (known after apply)
      + default                              = (known after apply)
      + dhcp_options_id                      = (known after apply)
      + enable_dns_hostnames                 = (known after apply)
      + enable_dns_support                   = (known after apply)
      + enable_network_address_usage_metrics = (known after apply)
      + id                                   = "vpc-xxxx"
      + instance_tenancy                     = (known after apply)
      + ipv6_association_id                  = (known after apply)
      + ipv6_cidr_block                      = (known after apply)
      + main_route_table_id                  = (known after apply)
      + owner_id                             = (known after apply)
      + state                                = (known after apply)
      + tags                                 = (known after apply)
      + timeouts {
          + read = (known after apply)
        }
    }
  # module.service-magi.aws_security_group_rule.vpc-http-magi must be replaced
+/- resource "aws_security_group_rule" "vpc-http-magi" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-4286748621" -> (known after apply)
      ~ security_group_rule_id   = "sgr-043d1f8924cdac4a5" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }
  # module.service-magi.aws_security_group_rule.vpc-http-sim must be replaced
+/- resource "aws_security_group_rule" "vpc-http-sim" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-4144434432" -> (known after apply)
      ~ security_group_rule_id   = "sgr-053c035d3bdc5e5a6" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }
  # module.service-magi.aws_security_group_rule.vpc-http-sim-ui must be replaced
+/- resource "aws_security_group_rule" "vpc-http-sim-ui" {
      ~ cidr_blocks              = [
          - "10.0.0.0/16",
        ] -> (known after apply) # forces replacement
      ~ id                       = "sgrule-1254889801" -> (known after apply)
      ~ security_group_rule_id   = "sgr-0e04afca151ba885b" -> (known after apply)
      + source_security_group_id = (known after apply)
        # (7 unchanged attributes hidden)
    }

In the plan, it reads a lot of data but can not read the data object to determine its values before apply? Why is it going to recreate rules which already exist?

There are other modules in my configuration built in the same way but all is fine with them.

Steps to Reproduce

Debug Output

No response

Panic Output

No response

Important Factoids

No response

References

#29393

Would you like to implement a fix?

None

@speller speller added bug Addresses a defect in current functionality. needs-triage Waiting for first response or review from a maintainer. labels Mar 6, 2023
@github-actions
Copy link

github-actions bot commented Mar 6, 2023

Community Note

Voting for Prioritization

  • Please vote on this issue by adding a 👍 reaction to the original post to help the community and maintainers prioritize this request.
  • Please see our prioritization guide for information on how we prioritize.
  • Please do not leave "+1" or other comments that do not add relevant new information or questions, they generate extra noise for issue followers and do not help prioritize the request.

Volunteering to Work on This Issue

  • If you are interested in working on this issue, please leave a comment.
  • If this would be your first contribution, please review the contribution guide.

@github-actions github-actions bot added the service/vpc Issues and PRs that pertain to the vpc service. label Mar 6, 2023
@justinretzolk
Copy link
Member

Hey @speller 👋 I left a comment over on #29393 that I believe will also apply here, in that I believe this comes down to the way data sources behave when they're dependent on another resource or module. If the answer I left on the other issue doesn't help, let me know and I'd be happy to help look around some more.

@justinretzolk justinretzolk added waiting-response Maintainers are waiting on response from community or contributor. and removed needs-triage Waiting for first response or review from a maintainer. labels Mar 6, 2023
@speller
Copy link
Contributor Author

speller commented Mar 7, 2023

@justinretzolk I understand what you say. But the data object is configured by the vpc_id. And this id is not changed between runs. It comes from an external variable and then propagated to many modules without changes. So the vpc_id of the data resource should be known in the planning stage.

data "aws_vpc" "vpc" {
  id = var.aws.vpc_id
}

@github-actions github-actions bot removed the waiting-response Maintainers are waiting on response from community or contributor. label Mar 7, 2023
@speller
Copy link
Contributor Author

speller commented Mar 13, 2023

@justinretzolk One more thing. I have other security groups in different modules which are defined in the same way as the problematic ones. They also use the VPC CIDR block. They're also marked as to be recreated but no issues during the apply stage. Only there three rules in this specific security group are often failing in different deployments. Only these three.

@speller
Copy link
Contributor Author

speller commented Mar 13, 2023

@justinretzolk Here is the failing state and plan. After manual deletion of the rules, it is applied with no issues. And later can be applied ok many times. This state is from the second run after creating the deployment from scratch.
Duplicate SGR.zip.gpg.zip
I've added the zip extension to the gpg file to make it uploaded to github.

@speller
Copy link
Contributor Author

speller commented Mar 13, 2023

Here are more details. Dir 1 - failing plan. Dir 2 - manually deleted duplicate rules and succeeded the apply. But, again, this doesn't prevent the issue from occurring next time. I also added the configuration.
Duplicate SGR2.zip.gpg.zip

@speller
Copy link
Contributor Author

speller commented Mar 13, 2023

@justinretzolk
Here I have 2 failures in a row (the configuration is the same, just different variables). 1st run - faced the issue. 2nd run - fixed manually. 3rd run - faced the same again. 4 - fixed again.
Duplicate SGR3.zip.gpg.zip

@good92
Copy link

good92 commented Mar 28, 2023

Why not naming these security groups?

@bryan-bar
Copy link

bryan-bar commented Feb 21, 2024

I am facing the same issue on repeated applys with terraform v1.5.5. In my case, this didn't start popping up until I used the http data resource to dynamically set the controller's ip.

My temporary fix was to expand each rule to only define 1 cidrblock per rule. This allows terraform to succeed but now plan returns 2 different results:

  • Shows that at-least 1 rule is created and none are destroyed, but I can see that the old rule is overwritten. Note how it is re-creating 0.0.0.0/0, which is hardcoded, instead of the dynamic ip.
# module.security_us_west_2.aws_security_group_rule.rule["tcp_22_null_ingress_0.0.0.0/0"] will be created
  + resource "aws_security_group_rule" "rule" {
      + cidr_blocks              = [
          + "0.0.0.0/0",
        ]
      + description              = "SSH"
      + from_port                = 22
      + id                       = (known after apply)
      + protocol                 = "tcp"
      + security_group_id        = "sg-05ed6e13e4337d822"
      + security_group_rule_id   = (known after apply)
      + self                     = false
      + source_security_group_id = (known after apply)
      + to_port                  = 22
      + type                     = "ingress"
    }
...
Apply complete! Resources: 1 added, 0 changed, 0 destroyed.
  • Shows that no changes are needed
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration and found no differences, so no changes are needed.

Apply complete! Resources: 0 added, 0 changed, 0 destroyed.

This is also 2 steps back, 1 forward, since I first remove duplicate cidrblocks and now need to expand it back out per cidrblock. This also slowed down terraform a bit and clutters the plan, however, I prefer this over needing to manually remove resources with awscli or taint resources with terraform.

** Update: With a larger configuration, around ~500+ rules, this failed as well with a duplicate error for some rules, but others succeeded.


I see that there are 2 new resources listed as a note for the aws_security_group_rule resource.

... Both of these resource were added before AWS assigned a [security group rule unique ID](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/security-group-rules.html), and they do not work well in all scenarios using thedescription and tags attributes, which rely on the unique ID. The [aws_vpc_security_group_egress_rule](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_egress_rule) and [aws_vpc_security_group_ingress_rule](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/vpc_security_group_ingress_rule) resources have been added to address these limitations and should be used for all new security group rules...

The issue with the 2 new resources is that it does not accept a list of cidrs, which I would prefer. It also shows it is creating duplicates so it does not solve this issue.

bryan-bar added a commit to EnterpriseDB/edb-terraform that referenced this issue Feb 21, 2024
…re are no changes.

This started after adding the http data source but the bug is inconsistent as it shows the hardcoded value being re-created and not the dynamically set ip address.

Ref: hashicorp/terraform-provider-aws#29797

FIX: AWS security groups module - create a rule per cidrblock.
This still shows that a duplicate rule will be created but terraform does not fail as aws overwrites the old rule.
On a third apply, terraform shows no changes in the plan.
bryan-bar added a commit to EnterpriseDB/edb-terraform that referenced this issue Feb 21, 2024
…re are no changes.

This started after adding the http data source but the bug is inconsistent as it shows the hardcoded value being re-created and not the dynamically set ip address.

Ref: hashicorp/terraform-provider-aws#29797

FIX: AWS security groups module - create a rule per cidrblock.
This still shows that a duplicate rule will be created but terraform does not fail as aws overwrites the old rule.
On a third apply, terraform shows no changes in the plan.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Addresses a defect in current functionality. service/vpc Issues and PRs that pertain to the vpc service.
Projects
None yet
Development

No branches or pull requests

4 participants