Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make it harder to destroy resources #24658

Open
danawillow opened this issue Apr 14, 2020 · 2 comments
Open

Make it harder to destroy resources #24658

danawillow opened this issue Apr 14, 2020 · 2 comments

Comments

@danawillow
Copy link
Contributor

danawillow commented Apr 14, 2020

(I'm sorry for not filling out the template exactly, but my thoughts didn't really follow that format).

Destroying resources is too easy

tl;dr: Terraform makes it easy to destroy infrastructure and hard to ensure it sticks around. For production systems, the reverse would be preferable.

Background

Each field in a resource has two options to describe what to do if it changes:

  • ForceNew: false if it can be updated in place (usually by calling a PUT or PATCH method on the resource API)
  • ForceNew: true if it cannot be updated in place

If the field cannot be updated in place (it is marked ForceNew), any change to it will destroy and recreate the resource.

Resources are also destroyed if they are removed from the config but still present in the state file.

Problems

This can fail users in several ways:

  1. A field was added to Terraform without update support, and update support is added later. This is a fairly minor issue- the user can just upgrade to the version that has that support. Still, sometimes the person performing the upgrade and the person making the change are not the same person, adding some organizational complexity. Likewise, major version upgrades might take an organization a long time to complete safely, causing them to be stuck at a lower version for longer.
  2. A change was made on the API side (such as a default value changing), which will cause users to have to update their plans to the new value to avoid diffs. Again, this is doable, but it's a scary plan to see with no context (vs something like an error message), especially if it happens at a time where the user modified a different resource and then ran Terraform.
  3. Human error may cause resources to be removed from the config and present in state. https://www.youtube.com/watch?v=ix0Tw8uinWs (the first half) shows an example of two PRs being merged out of order, causing infrastructure to get destroyed. In this case, they didn't quite get the order of their PRs right, so they imported a resource into state before merging the config that included it, and then made a change to other resources in that same config shortly after, which caused Terraform to destroy the newly-imported resource.
  4. Values of interpolated, computed fields could change. See https://coderbook.com/@marcus/prevent-terraform-from-recreating-or-deleting-resource/ for an example.

This behavior becomes devastating in cases where Terraform is being run in a loop for continuous actuation. Although the official recommendation is to never run Terraform without looking at a plan, there are many cases why someone might want to have an auto-approve loop, with plan only being run when a user actually makes a change to their config. For example, users may want to ensure they're detecting and managing drift quickly, and waiting for the next terraform plan to run wouldn't be fast enough.

All of these situations can be avoided when people use Terraform perfectly, making sure they look carefully at all plans before they're applied, and that they understand the correct order to perform non-plan/apply actions (like import). Regardless of the exact reasons why people end up in these situations, though, we've heard from our customers that they'd like more safeguards around accidentally destroying infrastructure, especially pieces that should be long-lived, like databases.

Potential Solutions

The following are a few ideas I have (ones I like and ones I don't), mainly put out here for discussion and consolidation.

Encourage lifecycle.prevent_destroy on everything

This is basically where we're at right now, but it's not great. It requires users to know about the feature at all, and to put it on every single resource (since there's no way to specify it to default to true globally). It has several flaws in its current implementation that make it hard and/or dangerous to use (#16392, #3874, #17599).

Let providers handle it

There's a few options for this:

  1. The provider decides on a per-resource basis whether it can be destroy/recreated, finds out from the SDK whether the destroy was explicit or implicit, and returns an error if it was implicit (this is the proposal in https://github.com/hashicorp/terraform/issues/23330). This requires some decision making on the provider side and is less flexible for our users who expect that behavior on a different set of resources than we have it on.
  2. We remove the ForceNew: true from fields that can't be updated, and add CustomizeDiff behavior to the resource to fail if there is a diff on those fields. This requires no changes to Terraform core/SDK, but requires there to be a special customizediff on basically every resource in the provider. It also means that different providers will behave inconsistently, which isn't a great UX.
  3. Add a field to each resource (or a top-level field in the provider block), "allow_destroy", which would fail the destroy step if set to false. Once again, it's not a consistent experience between providers. This one also would have to be done at apply-time only, meaning users' plans would succeed while the apply fails, which isn't great.

Deprecate lifecycle.prevent_destroy, make that the default behavior, and add lifecycle.allow_recreate

This would make deleting resources something that can only happen by removing it from the config or running terraform destroy, or by opting into an allow_recreate. It would be a breaking change, so have to be rolled out carefully, but it adds safeguards.

This makes it much more explicit that removing a resource from your config will destroy it. It doesn't prevent the human error case of running terraform apply when the resource isn't in the config but is in state, but it at least prevents unintentional destroy/recreates.

This also could be extended to have several different destroy options: in addition to allow_recreate, users could also have abandon (destroy removes from state but does not call Destroy()) or a stricter prevent_all_destroy (which wouldn't allow a destroy/create, a removal from state, or a terraform destroy to succeed).

Add the option to specify default destroy behavior in a terraform{ } or provider{ } block

This lets users specify what they want their default to be across all resources in their configs, so it doesn't need to be specified on each one. This could be combined with the previous solution.

Add a command-line flag --allow-destroy

This would make plans that include destroying a resource fail if the flag is not set. This is similar to making prevent_destroy the default, but with making the destroy allowed on a run of terraform rather than as a property on the resource. This makes it much more explicit what the behavior is when removing a resource from config, and you don't have to worry about comparing with a previous version of state. Users would have to adjust any CI systems to adapt to the new flag, but other than that wouldn't have to make any config changes. However, it may be tedious for users that need to frequently, but not always, replace resources that cannot be updated in-place.

@paultyng
Copy link
Contributor

An additional option here for letting the providers handle it: resources could specify their destroy behavior as part of schema (ie. default to prevention or allowing), basically controlling a default for the lifecycle flag, but preserving the existing lifecycle syntax for overriding it.

@erSitzt
Copy link

erSitzt commented May 12, 2020

I like @danawillow CLI Flag... default behavior should never destroy anything except stated explicitly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants