Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GKE creation error "Error waiting for resuming GKE cluster: Failed to create cluster" #11431

Comments

@lucasgrvarela
Copy link

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment or link the pull request to this issue.

Terraform Version

terraform -v
Terraform v1.1.7 
on linux_amd64

Affected Resource(s)

  • google_container_cluster

Terraform Configuration Files (if applicable)

provider "google" {
    version = "4.13.0"
    project = "my-project"
}

resource "google_container_cluster" "primary" {
    name = "my-tf-tes-2"
    location = "us-central1-a"

    initial_node_count = 1

    workload_identity_config {
        workload_pool = "${data.google_project.project.project_id}.svc.id.goog"
    }
}

data "google_project" "project" {
    project_id = "my-project"
}

# Note that the use of the “google_project” and the workload identity is known method to make cluster creation fail. Any other CREATE_CLUSTER operation could lead to the same.

Issue Description

For any cluster on any version, if a CREATE_CLUSTER operation is initiated, the operation is stored in the state for that resource.

If the CREATE_CLUSTER operation fails and the operation stays in the state (which I think only happens if TGP doesn't try to delete a cluster that failed creation) TGP tries to "resume"/keeps waiting on the operation to finish.

That immediately fails because the operation is not in a PENDING state (PENDING or RUNNING)

for _, pending := range w.PendingStates() {

func (w *ContainerOperationWaiter) PendingStates() []string {

But it shouldn't fail if an operation is in a DONE state though. Which is the case here.

Reproduction steps:

Get project number

  • gcloud projects list | grep my-project

Remove permissions to the GKE Service Agent

  • gcloud projects remove-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/container.serviceAgent

Reapply some permissions.

  • gcloud projects add-iam-policy-binding my-project --member serviceAccount:service-<project-number>@container-engine-robot.iam.gserviceaccount.com --role roles/editor

  • Run terraform apply for the tf code above and after 40s the creation should keep failing because of the missing permissions. Although it will return an internal error from the GKE API, but that doesn’t matter. Any failed CREATE_CLUSTER operation can lead to this issue. During the creation the operation will be stored in the state for the cluster resource.
    Make terraform exit (Ctrl+C) before it tries to delete the cluster itself.
    Make sure the CREATE_CLUSTER operation is stored in the resource’s state
    cat terraform.tfstate | grep operation
    This should return something like:
    “operation”: “operation-<some-hash>”
    If it returns “operation: null” the operation was not stored. Try again by making Terraform exit after 20s instead.

  • Rerun terraform apply and you'll get the error below

Error: Error waiting for resuming GKE cluster: Failed to create cluster
│
│ with google_container_cluster.primary,
│ on main.tf line 6, in resource "google_container_cluster" "primary":
│ 6: resource "google_container_cluster" "primary" {

The error is from here

waitErr := containerOperationWait(config, op, project, location, "resuming GKE cluster", userAgent, d.Timeout(schema.TimeoutRead))

The problem is the operation is actually DONE. So it shouldn't be trying to "resume"/keep waiting on an operation to finish since its done.

The operation is stored in the state of the resource when the creation began. here

if err := d.Set("operation", op.Name); err != nil {

Important Facts

The gke cluster/master version doesn’t matter here. We can see from the code references above that this is a behavior from the Google Provider that was intentionally integrated to deal with the terraform process exiting prematurely.

References

All in issue description

@rileykarson
Copy link
Collaborator

Related internal issue: b/228111747

@rileykarson
Copy link
Collaborator

The conditions that this feature gets used should be fairly uncommon, I'm surprised you're triggering it- my understanding was that it only came into play when the Terraform process was killed. Is that not the case here?

@lucasgrvarela
Copy link
Author

@rileykarson yes I believe that was the case

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Feb 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.