GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

lucasgrvarela · 2022-04-06T20:14:46Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment or link the pull request to this issue.

Terraform Version

terraform -v
Terraform v1.1.7 
on linux_amd64

Affected Resource(s)

google_container_cluster

Terraform Configuration Files (if applicable)

provider "google" {
    version = "4.13.0"
    project = "my-project"
}

resource "google_container_cluster" "primary" {
    name = "my-tf-tes-2"
    location = "us-central1-a"

    initial_node_count = 1

    workload_identity_config {
        workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
    }
}

data "google_project" "project" {
    project_id = "my-project"
}

# Note that the use of the “google_project” and the workload identity is known method to make cluster creation fail. Any other CREATE_CLUSTER operation could lead to the same.

Issue Description

For any cluster on any version, if a CREATE_CLUSTER operation is initiated, the operation is stored in the state for that resource.

If the CREATE_CLUSTER operation fails and the operation stays in the state (which I think only happens if TGP doesn't try to delete a cluster that failed creation) TGP tries to "resume"/keeps waiting on the operation to finish.

That immediately fails because the operation is not in a PENDING state (PENDING or RUNNING)

terraform-provider-google/google/container_operation.go

Line 37 in 627220d

for _, pending := range w.PendingStates() {

terraform-provider-google/google/container_operation.go

Line 97 in 627220d

func (w *ContainerOperationWaiter) PendingStates() []string {

But it shouldn't fail if an operation is in a DONE state though. Which is the case here.

Reproduction steps:

Get project number

gcloud projects list | grep my-project

Remove permissions to the GKE Service Agent

gcloud projects remove-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/container.serviceAgent

Reapply some permissions.

gcloud projects add-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/editor
Run terraform apply for the tf code above and after 40s the creation should keep failing because of the missing permissions. Although it will return an internal error from the GKE API, but that doesn’t matter. Any failed CREATE_CLUSTER operation can lead to this issue. During the creation the operation will be stored in the state for the cluster resource.
Make terraform exit (Ctrl+C) before it tries to delete the cluster itself.
Make sure the CREATE_CLUSTER operation is stored in the resource’s state
cat terraform.tfstate | grep operation
This should return something like:
“operation”: “operation-<some-hash>”
If it returns “operation: null” the operation was not stored. Try again by making Terraform exit after 20s instead.
Rerun terraform apply and you'll get the error below

Error: Error waiting for resuming GKE cluster: Failed to create cluster
│
│ with google_container_cluster.primary,
│ on main.tf line 6, in resource "google_container_cluster" "primary":
│ 6: resource "google_container_cluster" "primary" {

The error is from here

terraform-provider-google/google/resource_container_cluster.go

Line 1471 in 627220d

    
           waitErr := containerOperationWait(config, op, project, location, "resuming GKE cluster", userAgent, d.Timeout(schema.TimeoutRead))

The problem is the operation is actually DONE. So it shouldn't be trying to "resume"/keep waiting on an operation to finish since its done.

The operation is stored in the state of the resource when the creation began. here

terraform-provider-google/google/resource_container_cluster.go

Line 1379 in 627220d

if err := d.Set("operation", op.Name); err != nil {

Important Facts

The gke cluster/master version doesn’t matter here. We can see from the code references above that this is a behavior from the Google Provider that was intentionally integrated to deal with the terraform process exiting prematurely.

References

All in issue description

The text was updated successfully, but these errors were encountered:

rileykarson · 2022-04-07T16:28:32Z

Related internal issue: b/228111747

rileykarson · 2022-04-07T18:26:59Z

The conditions that this feature gets used should be fairly uncommon, I'm surprised you're triggering it- my understanding was that it only came into play when the Terraform process was killed. Is that not the case here?

lucasgrvarela · 2022-04-09T00:09:40Z

@rileykarson yes I believe that was the case

github-actions · 2023-02-26T02:22:02Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

lucasgrvarela added the bug label Apr 6, 2022

rileykarson added the waiting-response label Apr 7, 2022

github-actions bot removed the waiting-response label Apr 9, 2022

rileykarson added the service/container label Jul 22, 2022

trodge mentioned this issue Jan 26, 2023

Fix an issue with resuming a failed container cluster creation GoogleCloudPlatform/magic-modules#7121

Merged

5 tasks

This was referenced Jan 26, 2023

Fix an issue with resuming a failed container cluster creation hashicorp/terraform-provider-google-beta#5136

Merged

Fix an issue with resuming a failed container cluster creation #13580

Merged

modular-magician closed this as completed in hashicorp/terraform-provider-google-beta#5136 Jan 26, 2023

modular-magician mentioned this issue Jan 26, 2023

Fix an issue with resuming a failed container cluster creation GoogleCloudPlatform/terraform-validator#1318

Merged

5 tasks

github-actions bot locked as resolved and limited conversation to collaborators Feb 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

lucasgrvarela commented Apr 6, 2022

rileykarson commented Apr 7, 2022

rileykarson commented Apr 7, 2022

lucasgrvarela commented Apr 9, 2022

github-actions bot commented Feb 26, 2023

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

Comments

lucasgrvarela commented Apr 6, 2022

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files (if applicable)

Issue Description

Reproduction steps:

Important Facts

References

rileykarson commented Apr 7, 2022

rileykarson commented Apr 7, 2022

lucasgrvarela commented Apr 9, 2022

github-actions bot commented Feb 26, 2023