Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Container Cluster fails to Create when the call to remove default node pool times out #3763

Closed
ejschoen opened this issue Jun 1, 2019 · 3 comments · Fixed by GoogleCloudPlatform/magic-modules#1867
Assignees
Labels

Comments

@ejschoen
Copy link

ejschoen commented Jun 1, 2019

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment
  • If an issue is assigned to the "modular-magician" user, it is either in the process of being autogenerated, or is planned to be autogenerated soon. If an issue is assigned to a user, that user is claiming responsibility for the issue. If an issue is assigned to "hashibot", a community member has claimed the issue already.

Terraform Version

terraform version
Terraform v0.12.0
+ provider.google v2.7.0
+ provider.helm v0.9.1
+ provider.kubernetes v1.7.0
+ provider.template v2.1.2

Affected Resource(s)

  • google_container_cluster

Terraform Configuration Files

provider "google" {
  project     = "${var.google_project}"
  region      = "${var.google_region}"
  zone        = "${var.google_zone}"
}

resource "google_container_cluster" "gcluster" {
  project                  = "${var.google_project}"
  name                     = "${var.cluster_name}"
  location                 = "${var.google_zone}"

  remove_default_node_pool = true
  initial_node_count       = 1

  master_auth {
    username = ""
    password = ""
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }

  lifecycle {
    ignore_changes = [ "master_auth", "network" ]
  }

}

resource "google_container_node_pool" "cluster_nodes" {
  depends_on = [
    "google_container_cluster.cluster"
  ]
  name       = "${var.cluster_name}-node-pool"
  cluster    = "${google_container_cluster.cluster.name}"
  node_count = "${var.cluster_node_count}"

  node_config {
    preemptible  = "${var.preemptible}"
    disk_size_gb = "${var.disk_size_gb}"
    disk_type    = "${var.disk_type}"
    machine_type = "${var.machine_type}"
    oauth_scopes = [
      "https://www.googleapis.com/auth/devstorage.read_only",
      "https://www.googleapis.com/auth/logging.write",
      "https://www.googleapis.com/auth/monitoring",
      "https://www.googleapis.com/auth/service.management.readonly",
      "https://www.googleapis.com/auth/servicecontrol",
      "https://www.googleapis.com/auth/trace.append",
      "https://www.googleapis.com/auth/compute",
      "https://www.googleapis.com/auth/cloud-platform",
    ]
    metadata = {
      disable-legacy-endpoints = true
      //creator = "${data.google_client_openid_userinfo.me.email}"
    }
  }

  timeouts {
    create = "20m"
    update = "15m"
    delete = "15m"
  }

}

// The master_auth outputs are used for downstream authentication in kubectl and helm

output "client_certificate" {
  value = "${google_container_cluster.cluster.master_auth.0.client_certificate}"
}

output "client_key" {
  value = "${google_container_cluster.cluster.master_auth.0.client_key}"
}

output "cluster_ca_certificate" {
  value = "${google_container_cluster.cluster.master_auth.0.cluster_ca_certificate}"
}

output "host" {
  value = "${google_container_cluster.cluster.endpoint}"
}

provider "kubernetes" {
  host                   = "${google_container_cluster.cluster.endpoint}"
  client_certificate     = "${base64decode(google_container_cluster.cluster.master_auth.0.client_certificate)}"
  client_key             = "${base64decode(google_container_cluster.cluster.master_auth.0.client_key)}"
  cluster_ca_certificate = "${base64decode(google_container_cluster.cluster.master_auth.0.cluster_ca_certificate)}"
}


// Provision a service account for Tiller

resource "kubernetes_service_account" "helm_account" {
  depends_on = [
    "google_container_node_pool.cluster_nodes",
    "google_container_cluster.cluster",
  ]
  metadata {
    name      = "${var.helm_account_name}"
    namespace = "kube-system"
  }
}

resource "kubernetes_cluster_role_binding" "helm_role_binding" {
  depends_on = [
    "kubernetes_service_account.helm_account"
  ]
  metadata {
    name = "${var.helm_account_name}"
  }
  role_ref {
    api_group = "rbac.authorization.k8s.io"
    kind      = "ClusterRole"
    name      = "cluster-admin"
  }
  subject {
    api_group = ""
    kind      = "ServiceAccount"
    name      = "${var.helm_account_name}"
    namespace = "kube-system"
  }
  provisioner "local-exec" {
    command = "sleep 15"
  }
}

// Install the application via Helm.

provider "helm" {
  service_account = "${kubernetes_service_account.helm_account.metadata.0.name}"
  tiller_image = "gcr.io/kubernetes-helm/tiller:${var.tiller_version}"
  kubernetes {
    host                   = "${google_container_cluster.cluster.endpoint}"
    client_certificate     = "${base64decode(google_container_cluster.cluster.master_auth.0.client_certificate)}"
    client_key             = "${base64decode(google_container_cluster.cluster.master_auth.0.client_key)}"
    cluster_ca_certificate = "${base64decode(google_container_cluster.cluster.master_auth.0.cluster_ca_certificate)}"
  }
}

Debug Output

https://gist.github.com/ejschoen/24b2178bed67d5538e630a41b6f6dfec

Expected Behavior

Cluster creation should have succeeded, and cluster delete/recreate should have succeeded.

Actual Behavior

Timeout from client.

Steps to Reproduce

This is not reliably reproducible. The behavior is subtly different each time it happens, but the common thread across all of these failures is that the GCP API call appears to not return within the time allotted by the provider client.

  1. terraform apply

Important Factoids

I recently upgraded to Terraform 0.12, and noticed now that upon failed create attempts, subsequent terraform apply marks the existing cluster (which did finish creating) as tainted, and recreates it. In other cases, terraform apply will try to create the cluster without deleting it, resulting in a 409 error from GCP. This hasn't happened enough for me to tell if the issue is related to when the timeout failure occurred during the initial creation--i.e., when creating the initial cluster or deleting the default node pool.

I'm submitting this as a new issue, but it's related to #3168 and Hashibot wants a new issue linked back to it. I noticed this issue (#3752) and wonder if it's related, too.

References

@ghost ghost added the bug label Jun 1, 2019
@chrisst chrisst self-assigned this Jun 3, 2019
@chrisst
Copy link
Contributor

chrisst commented Jun 3, 2019

From your debug logs (thanks for providing them!) it looks like you caught a timeout error when trying to delete the default node pool. It also looks like we haven't wrapped that particular call in retry logic yet 😳 so it will stop the Create call if it fails. I'll add the retry wrapper shortly.

RE tainting: Since the cluster failed to remove the default node pool during Create I think that tainting + recreating is the correct behavior. In theory we should be able to catch most of the failures and persist the id to state so that Terraform will know that the resource was created in a partial state.

@chrisst chrisst changed the title GCP API operation timeouts Container Cluster fails to Create when the call to remove default node pool times out Jun 3, 2019
@ejschoen
Copy link
Author

ejschoen commented Jun 4, 2019

Thanks! For what it's worth, I was having intermittent connectivity issues, most likely related to Google's cloud outage this past weekend. But I've seen the timeout before when all of their services were ostensibly running well.

@ghost
Copy link

ghost commented Jul 4, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Jul 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants