-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431
GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431
Comments
Related internal issue: b/228111747 |
The conditions that this feature gets used should be fairly uncommon, I'm surprised you're triggering it- my understanding was that it only came into play when the Terraform process was killed. Is that not the case here? |
@rileykarson yes I believe that was the case |
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. |
Community Note
Terraform Version
Affected Resource(s)
Terraform Configuration Files (if applicable)
Issue Description
For any cluster on any version, if a CREATE_CLUSTER operation is initiated, the operation is stored in the state for that resource.
If the CREATE_CLUSTER operation fails and the operation stays in the state (which I think only happens if TGP doesn't try to delete a cluster that failed creation) TGP tries to "resume"/keeps waiting on the operation to finish.
That immediately fails because the operation is not in a PENDING state (PENDING or RUNNING)
terraform-provider-google/google/container_operation.go
Line 37 in 627220d
terraform-provider-google/google/container_operation.go
Line 97 in 627220d
But it shouldn't fail if an operation is in a DONE state though. Which is the case here.
Reproduction steps:
Get project number
gcloud projects list | grep my-project
Remove permissions to the GKE Service Agent
gcloud projects remove-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/container.serviceAgent
Reapply some permissions.
gcloud projects add-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/editor
Run
terraform apply
for the tf code above and after 40s the creation should keep failing because of the missing permissions. Although it will return an internal error from the GKE API, but that doesn’t matter. Any failed CREATE_CLUSTER operation can lead to this issue. During the creation the operation will be stored in the state for the cluster resource.Make terraform exit (Ctrl+C) before it tries to delete the cluster itself.
Make sure the CREATE_CLUSTER operation is stored in the resource’s state
cat terraform.tfstate | grep operation
This should return something like:
“operation”: “operation-<some-hash>”
If it returns “operation: null” the operation was not stored. Try again by making Terraform exit after 20s instead.
Rerun terraform apply and you'll get the error below
The error is from here
terraform-provider-google/google/resource_container_cluster.go
Line 1471 in 627220d
The problem is the operation is actually DONE. So it shouldn't be trying to "resume"/keep waiting on an operation to finish since its done.
The operation is stored in the state of the resource when the creation began. here
terraform-provider-google/google/resource_container_cluster.go
Line 1379 in 627220d
Important Facts
The gke cluster/master version doesn’t matter here. We can see from the code references above that this is a behavior from the Google Provider that was intentionally integrated to deal with the terraform process exiting prematurely.
References
All in issue description
The text was updated successfully, but these errors were encountered: