Container Cluster fails to Create when the call to remove default node pool times out #3763

ejschoen · 2019-06-01T13:57:14Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment
If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

terraform version
Terraform v0.12.0
+ provider.google v2.7.0
+ provider.helm v0.9.1
+ provider.kubernetes v1.7.0
+ provider.template v2.1.2

Affected Resource(s)

google_container_cluster

Terraform Configuration Files

provider "google" {
  project     = "${var.google_project}"
  region      = "${var.google_region}"
  zone        = "${var.google_zone}"
}

resource "google_container_cluster" "gcluster" {
  project                  = "${var.google_project}"
  name                     = "${var.cluster_name}"
  location                 = "${var.google_zone}"

  remove_default_node_pool = true
  initial_node_count       = 1

  master_auth {
    username = ""
    password = ""
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }

  lifecycle {
    ignore_changes = [ "master_auth", "network" ]
  }

}

resource "google_container_node_pool" "cluster_nodes" {
  depends_on = [
    "google_container_cluster.cluster"
  ]
  name       = "${var.cluster_name}-node-pool"
  cluster    = "${google_container_cluster.cluster.name}"
  node_count = "${var.cluster_node_count}"

  node_config {
    preemptible  = "${var.preemptible}"
    disk_size_gb = "${var.disk_size_gb}"
    disk_type    = "${var.disk_type}"
    machine_type = "${var.machine_type}"
    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/cloud-platform",
    ]
    metadata = {
      disable-legacy-endpoints = true
      //creator = "${data.google_client_openid_userinfo.me.email}"
    }
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }

}

// The master_auth outputs are used for downstream authentication in kubectl and helm

output "client_certificate" {
  value = "${google_container_cluster.cluster.master_auth.0.client_certificate}"
}

output "client_key" {
  value = "${google_container_cluster.cluster.master_auth.0.client_key}"
}

output "cluster_ca_certificate" {
  value = "${google_container_cluster.cluster.master_auth.0.cluster_ca_certificate}"
}

output "host" {
  value = "${google_container_cluster.cluster.endpoint}"
}

provider "kubernetes" {
  host                   = "${google_container_cluster.cluster.endpoint}"
  client_certificate     = "${base64decode(google_container_cluster.cluster.master_auth.0.client_certificate)}"
  client_key             = "${base64decode(google_container_cluster.cluster.master_auth.0.client_key)}"
  cluster_ca_certificate = "${base64decode(google_container_cluster.cluster.master_auth.0.cluster_ca_certificate)}"
}


// Provision a service account for Tiller

resource "kubernetes_service_account" "helm_account" {
  depends_on = [
    "google_container_node_pool.cluster_nodes",
    "google_container_cluster.cluster",
  ]
  metadata {
    name      = "${var.helm_account_name}"
    namespace = "kube-system"
  }
}

resource "kubernetes_cluster_role_binding" "helm_role_binding" {
  depends_on = [
    "kubernetes_service_account.helm_account"
  ]
  metadata {
    name = "${var.helm_account_name}"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "cluster-admin"
  }
  subject {
    api_group = ""
    kind      = "ServiceAccount"
    name      = "${var.helm_account_name}"
    namespace = "kube-system"
  }
  provisioner "local-exec" {
    command = "sleep 15"
  }
}

// Install the application via Helm.

provider "helm" {
  service_account = "${kubernetes_service_account.helm_account.metadata.0.name}"
  tiller_image = "gcr.io/kubernetes-helm/tiller:${var.tiller_version}"
  kubernetes {
    host                   = "${google_container_cluster.cluster.endpoint}"
    client_certificate     = "${base64decode(google_container_cluster.cluster.master_auth.0.client_certificate)}"
    client_key             = "${base64decode(google_container_cluster.cluster.master_auth.0.client_key)}"
    cluster_ca_certificate = "${base64decode(google_container_cluster.cluster.master_auth.0.cluster_ca_certificate)}"
  }
}

Debug Output

https://gist.github.com/ejschoen/24b2178bed67d5538e630a41b6f6dfec

Expected Behavior

Cluster creation should have succeeded, and cluster delete/recreate should have succeeded.

Actual Behavior

Timeout from client.

Steps to Reproduce

This is not reliably reproducible. The behavior is subtly different each time it happens, but the common thread across all of these failures is that the GCP API call appears to not return within the time allotted by the provider client.

terraform apply

Important Factoids

I recently upgraded to Terraform 0.12, and noticed now that upon failed create attempts, subsequent terraform apply marks the existing cluster (which did finish creating) as tainted, and recreates it. In other cases, terraform apply will try to create the cluster without deleting it, resulting in a 409 error from GCP. This hasn't happened enough for me to tell if the issue is related to when the timeout failure occurred during the initial creation--i.e., when creating the initial cluster or deleting the default node pool.

I'm submitting this as a new issue, but it's related to #3168 and Hashibot wants a new issue linked back to it. I noticed this issue (#3752) and wonder if it's related, too.

References

no timeouts are set on http.Client used in GCP API calls #3168

The text was updated successfully, but these errors were encountered:

chrisst · 2019-06-03T19:22:34Z

From your debug logs (thanks for providing them!) it looks like you caught a timeout error when trying to delete the default node pool. It also looks like we haven't wrapped that particular call in retry logic yet 😳 so it will stop the Create call if it fails. I'll add the retry wrapper shortly.

RE tainting: Since the cluster failed to remove the default node pool during Create I think that tainting + recreating is the correct behavior. In theory we should be able to catch most of the failures and persist the id to state so that Terraform will know that the resource was created in a partial state.

ejschoen · 2019-06-04T00:19:13Z

Thanks! For what it's worth, I was having intermittent connectivity issues, most likely related to Google's cloud outage this past weekend. But I've seen the timeout before when all of their services were ostensibly running well.

ghost · 2019-07-04T13:50:28Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

ghost added the bug label Jun 1, 2019

chrisst self-assigned this Jun 3, 2019

chrisst mentioned this issue Jun 3, 2019

Add retry to removing GKE default node pool GoogleCloudPlatform/magic-modules#1867

Merged

modular-magician closed this as completed in GoogleCloudPlatform/magic-modules#1867 Jun 3, 2019

chrisst changed the title ~~GCP API operation timeouts~~ Container Cluster fails to Create when the call to remove default node pool times out Jun 3, 2019

ghost locked and limited conversation to collaborators Jul 4, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Container Cluster fails to Create when the call to remove default node pool times out #3763

Container Cluster fails to Create when the call to remove default node pool times out #3763

ejschoen commented Jun 1, 2019

chrisst commented Jun 3, 2019

ejschoen commented Jun 4, 2019

ghost commented Jul 4, 2019

Container Cluster fails to Create when the call to remove default node pool times out #3763

Container Cluster fails to Create when the call to remove default node pool times out #3763

Comments

ejschoen commented Jun 1, 2019

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

chrisst commented Jun 3, 2019

ejschoen commented Jun 4, 2019

ghost commented Jul 4, 2019