Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cycle error when removing a resource along with create_before_destroy #26226

Closed
Nuru opened this issue Sep 12, 2020 · 13 comments
Closed

Cycle error when removing a resource along with create_before_destroy #26226

Nuru opened this issue Sep 12, 2020 · 13 comments
Labels
bug confirmed a Terraform Core team member has reproduced this issue v0.13 Issues (primarily bugs) reported against v0.13 releases

Comments

@Nuru
Copy link

Nuru commented Sep 12, 2020

Terraform fails to apply a plan, citing a dependency cycle, but I think that is wrong.

I am not positive, because I do not quite understand how to parse the error message I am getting; maybe if I could understand the message better I would change my mind.

Terraform Version

Terraform v0.13.2

Terraform Configuration Files

I have a chain of modules that call each other using for_each. The actual configuration is far too complicated, but here is a skeleton.

Click to show skeleton code
locals {
  config = {
    main = ["red", "white", "blue"]
    alt = ["green", "yellow"]
  }
}

module "x" {
  for_each = local.config
  source = "./x"

  clones = each.value
}

# ./x/main.tf
variable "clones" {
  type = list(string)
}

module "y" {
  for_each = toset(var.clones)
  source = "../y"

  color = each.value
}

# ./y/main.tf
variable "color" {
  type = string
}

module "z" {
  source = "../z"
  size = 3
  weight = 4
  color = var.color
}

# insanely complicated module "z" intentionally omitted

Expected Behavior

terraform plan produces a plan, so I was expecting terraform apply planfile to succeed.

Actual Behavior

Plan works, but apply yields this error:

Error: Cycle: 
module.x.module.y.module.z.local.launch_template_id (expand), 
module.x["main"].module.y["red"].module.z.random_pet.cbd[0], 
module.x["main"].module.y["red"].module.z.aws_eks_node_group.cbd[0], 
module.x["main"].module.y["red"].module.z.aws_eks_node_group.cbd[0] (destroy deposed fcfa883e), 
module.x["main"].module.y["red"].module.z.random_pet.cbd[0] (destroy deposed e321685c), 
module.x["main"].module.y["red"].module.z.aws_security_group.remote_access[0] (destroy), 
module.x["main"].module.y["red"].module.z.aws_launch_template.default[0], 
module.x.module.y.module.z.local.launch_template_version (expand)

I cannot understand this error. In particular, I do not know why the first and last lines are lacking keys. Why is it module.x.module.y and not module.x["main"].module.y["red"]? That looks like a bug in the error output to me.

In module z, random_pet, aws_eks_node_group, and aws_launch_template all have create_before_destroy = true. I do not think there is a genuine cycle, but I also do not know how "keepers" are figured in. The plan calls for

  1. launch_template to be updated in-place, which means its ID will not change
  2. security_group no longer needed, so it will be destroyed
  3. security_group is one of random_pet's keepers, so random_pet will be replaced
  4. random_pet.id is part of aws_eks_node_group name, so it will be replaced
  5. launch_template's version is listed as "known after apply", so even though it is just current + 1, we can treat it as opaquely derived, but still I do not see the cycle.

If I were doing this by hand, I could follow the above steps like this:

  1. update the launch template
  2. delete the security group
  3. create a new random_pet
  4. get the updated launch template version
  5. create the new aws_eks_node_group since I now have all the input data: random_pet name, launch_template id and version, and that is it.

Additional Context

It would really help to have better documentation about how derived values are handled with create_before_destroy. I have something like

resource "aws_launch_template" "gen" {
  name_prefix = "lt-gen"
  image_id    = var.ami_image_id

  update_default_version = true

  lifecycle {
    create_before_destroy = true
  }
}

locals {
  launch_template_version = aws_launch_template.gen.latest_version
}

My expectation is that internally, Terraform actually keeps 2 versions of local.launch_template_version around, one with the value for the template that exists at the beginning of the run and one with the newly created version, so that it can use the right version in the right places.

Similarly, I would hope that random_pet only changes once. As soon as any of it keepers change, a new pet is created, and its keepers are not saved until the end of the run.

References

I do not know if this is related to #26150 or not. That issue had a genuine cycle, and while I think I do not have a genuine cycle, even if I do have one, I think the error message is buggy.

This may or may not be related to/duplicate of #26166 but since the documentation on that is minimal, I cannot tell.

@Nuru Nuru added bug new new issue not yet triaged labels Sep 12, 2020
@jbardin
Copy link
Member

jbardin commented Sep 14, 2020

Hi @Nuru

Without the full configuration it's hard to determine if there is a legitimate cycle or not, would it be possible to supply all the configuration from module.z?

Many graph nodes are roughly analogous to the configuration constructs, which represent objects before they are "expanded" into individual instances (currently tagged with (expand) for lack of better terminology). While the graph cycle is difficult to translate back into something recognizable by the user, the output itself here is correct.

@jbardin jbardin added waiting for reproduction unable to reproduce issue without further information waiting-response An issue/pull request is waiting for a response from the community labels Sep 14, 2020
@Nuru
Copy link
Author

Nuru commented Sep 14, 2020

@jbardin wrote:

While the graph cycle is difficult to translate back into something recognizable by the user, the output itself here is correct.

Would you please help me understand the output by narrating what it means, and explaining why some lines are missing keys? The way I read it, it says that when trying to come up with a value for launch_template_id, Terraform needs the launch_template_version , and when trying to come up with a value for launch_template_version it needs launch_template_id. This is not in fact the case. Both can be supplied via variables, but in this case the variables are null and both values derive from data "aws_launch_template".

I feel pretty sure this is a bug in Terraform, if not #26166 then something like it. When do you expect that bug fix to be available? I would like to see if that bug fix resolves this issue first, because creating a simplified reproducible case could take all day.

@ghost ghost removed waiting-response An issue/pull request is waiting for a response from the community labels Sep 14, 2020
@jbardin
Copy link
Member

jbardin commented Sep 14, 2020

The start and end nodes in that output are not significant, the error can't tell what caused the cycle. Since a cycle has no beginning and end we can't attribute any significance to launch_template_id and launch_template_version, other than they are involved. The graph isn't a direct translation of the config, it's the internal representation for the order of processing many steps derived from the combination of the configuration and state. The expansion nodes have no indexes, because they correlate to configuration elements, and need to be evaluated to generate individual instances, e.g. module.x.module.y.module.z.local.launch_template_version is evaluated to create local value nodes for all the parent module index combinations. We really need the configuration to understand how this cycle may have been generated, otherwise it appears to be a legitimate cycle.

I doubt it's related to 26166, since there are no module (close) nodes involved, which are the key type involved in #26186. It's still possible it's a similar bug, but we need a way to reproduce this to know for sure.

@Nuru
Copy link
Author

Nuru commented Sep 14, 2020

@jbardin Thank you for the explanation. While it is true there are no (close) annotations in the error, there are resources being destroyed and modules using create_before_destroy = true and depends_on. Here is a related but different cycle error I got when using this module as module.z

module.x.module.y.module.z.local.launch_template_id (expand),
module.x["main"].module.y["red"].module.z.random_pet.cbd[0],
module.x["main"].module.y["red"].module.z.aws_eks_node_group.cbd[0] (destroy deposed c05ea9c8),
module.x["main"].module.y["red"].module.z.random_pet.cbd[0] (destroy deposed b2505425),
module.x["main"].module.y["red"].module.z.aws_security_group.remote_access[0] (destroy),
module.x["main"].module.y["red"].module.z.aws_launch_template.default[0],
module.x.module.y.module.z.local.launch_template_version (expand),
module.x["main"].module.y["red"].module.z.aws_eks_node_group.cbd[0]

This error was from applying a successfully generated plan after changing var.ec2_ssh_key from non-empty to empty, removing the need for the remote_access security group.

In this case, resources_to_tag was non-empty, so local.features_require_launch_template is true both before and after. Likewise create_before_destroy was true both before and after.

I hope that is enough for you to go on. If not, I will need some help from you figuring out next steps. The module is complex, and the infrastructure around it is also complex, and I do not know how I can translate this into something you can reproduce on your own. Maybe if you already have a test bed in AWS with an EKS cluster to which you can add a managed node group, you can reproduce the bug with the module I linked to. Please let me know.

@jbardin
Copy link
Member

jbardin commented Sep 15, 2020

Thanks @Nuru, I'll see what I can glean from the module source.
Is 0.11.3 the correct module version you used?
Is it possible to provide the trace log from the apply?

@Nuru
Copy link
Author

Nuru commented Sep 15, 2020

Is 0.11.3 the correct module version you used?

Yes. I have seen the error with other similar modules, but this module at tag 0.11.3 is the one that produced the error in #26226 (comment) and is the easiest to understand.

Is it possible to provide the trace log from the apply?

I gave up on trace logging as the logs were in excess of 40,000 lines and I did not know how to redact them. I am willing to continue to work on this with you, but I have limited time available and could use your guidance on how to be more efficient in helping you track down this bug without compromising confidential information.

@jbardin
Copy link
Member

jbardin commented Sep 15, 2020

Thanks again @Nuru. I have a good hunch on what is going wrong here, but still need to recreate the problem somehow.

I believe the aws_security_group.remote_access[0] (destroy) is key here, as it is indirectly referenced from a cbd resource that is being replaced, and therefor needs to be handled with the same ordering constraints. Can you manually find the aws_security_group.remote_access resource in the state file, and see if the instance has "create_before_destroy": true?

That should have been recorded in the state because it's a dependency of another resource which is create_before_destroy, but I want to try and figure out if it wasn't properly detected, or if we did detect it but aren't acting upon that information correctly.

One odd thing about this module is the attempt to make create_before_destroy configurable. Normally the requirements of the resources themselves should define if create_before_destroy needs to be used, meaning there is usually only one correct option, and the module consumer would not be able to make that decision. That shouldn't be causing this bug in any way, but it definitely makes the module more complicated and harder to debug.

@Nuru
Copy link
Author

Nuru commented Sep 15, 2020

@jbardin Thanks again for digging into this.

No, aws_security_group.remote_access does not have create_before_destroy set in the state file.

As for making create_before_destroy optional for the node group, there are a few issues at play here. Using the default "destroy before create" behavior, a node group is destroyed before it is replaced, causing a service outage. This is obviously undesirable in a production environment, and so we would set create_before_destroy always, but there are several reasons we cannot:

  1. As you know, you cannot change a resource from default to create-before-destroy without destroying it, so forcing the change on users would be an unacceptable forcing of the outage we are trying to avoid.
  2. One stated reason create-before-destroy is not the Terraform default is because some resources need to have unique identifiers, and create-before-destroy would fail if it tried to use the same identifier. Well, node groups are one of those resources. Create before destroy fails unless we do something like random_pet to give each node group a unique name.
  3. There is no feature in Terraform that lets us trigger a change of identifier based on whether or not the node group would be replaced, so we cannot easily create a random_pet to support create-before-destroy. Making random_pet dependent on node_group would create a genuine cycle, and changing random_pet unnecessarily would trigger unnecessary node group replacements, which are also undesirable: It takes a long time to replace a node group and is disruptive, causing all the pods to lose their local state.

Terraform could offer an attribute of a resource that says "needs_replacement", although that alone would not be sufficient to use as a "keeper" for random_pet. The lack of such a feature means that we can neither guarantee that random_pet will change when it needs to change nor that it will not change unnecessarily.

  1. Another issue with create-before-destroy on node groups is that when active, for a time you will be running 2 node groups. This can fail because due to AWS Service Quotas. For GPU instances especially, it is not uncommon for people to not have sufficient available quota to run 2 node groups at once.

So create-before-destory for node groups comes with its own set of problem. It is an engineering trade-off each user should decide for their situation, so we made it an option.

@jbardin jbardin added confirmed a Terraform Core team member has reproduced this issue v0.13 Issues (primarily bugs) reported against v0.13 releases and removed waiting for reproduction unable to reproduce issue without further information new new issue not yet triaged labels Sep 15, 2020
@jbardin
Copy link
Member

jbardin commented Sep 15, 2020

Thanks for the update. I have a minimal reproduction here:

variable "remove" {
  default = false
}

resource "null_resource" "a" {
  triggers = {
    removable = local.removable_ids
    updateable = local.c
  }
  lifecycle {
    create_before_destroy = true
  }
}

resource "null_resource" "b" {
  triggers = {
    a = null_resource.a.id
  }
  lifecycle {
    create_before_destroy = true
  }
}

resource "null_resource" "to_remove" {
  count = var.remove ? 0 : 1
}

resource "null_resource" "c" {
  triggers = {
    removable = local.removable_ids
  }
}

locals {
  c = null_resource.c.id
  removable_ids = join("", null_resource.to_remove[*].id)
}

This will trigger the same cycle with

$ terraform apply
$ terraform apply -var remove=true
...
Error: Cycle: 
null_resource.a (destroy deposed e099b5d9),
null_resource.b,
null_resource.b (destroy deposed 6216808d),
null_resource.c (destroy deposed b7b22b9c),
null_resource.to_remove[0] (destroy),
null_resource.c, local.c (expand),
null_resource.a

@jbardin jbardin changed the title Cycle error with module for_each is missing keys, may be completely wrong Cycle error when removing a resource along with create_before_destroy Sep 15, 2020
@jbardin
Copy link
Member

jbardin commented Sep 16, 2020

The cycle can be taken care of by the fix in 26263. This does not completely solve the problem however, since the stale value could now be evaluated. Closing this since it gets past the cycle, and we can now mark this as a duplicate of #25631

@jbardin jbardin closed this as completed Sep 16, 2020
@Nuru
Copy link
Author

Nuru commented Sep 16, 2020

@jbardin Thank you for taking quick action on this bug. I have some remaining questions:

  1. Are you sure this fixes the bug as I experienced it? How can I check? The reason I ask is because my reading of always check the CreateBeforeDestroy value from state #26263 is that it requires aws_security_group.remote_access to have create_before_destroy set in the state file, which it does not.
  2. Do you have a workaround for Incorrect "Provider produced inconsistent final plan" error when changing count or for_each of resources with create_before_destroy #25631? Or do you expect a fix for it to be included in Terraform 0.13.3?

@jbardin
Copy link
Member

jbardin commented Sep 16, 2020

aws_security_group.remote_access must be treated as create_before_destroy, because it is referenced from other resources that are create_before_destroy. The cycle was happening because that status was lost in this particular case, and the order of operations was not updated accordingly. This is a fix specifically for the cycle, and nothing else. You should be able to check this with the next 0.13 release, but it may not appear to work correctly with the existing state, because of #25473.

@ghost
Copy link

ghost commented Oct 17, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked as resolved and limited conversation to collaborators Oct 17, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug confirmed a Terraform Core team member has reproduced this issue v0.13 Issues (primarily bugs) reported against v0.13 releases
Projects
None yet
Development

No branches or pull requests

2 participants