Extend google_container_node_pool documentation on nodepool changes #17522

nemethloci · 2024-03-08T10:09:45Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request.
Please do not leave +1 or me too comments, they generate extra noise for issue followers and do not help prioritize the request.
If you are interested in working on this issue or have submitted a pull request, please leave a comment.
If an issue is assigned to a user, that user is claiming responsibility for the issue.
Customers working with a Google Technical Account Manager or Customer Engineer can ask them to reach out internally to expedite investigation and resolution of this issue.

Terraform Version

v1.7.4

Affected Resource(s)

google_container_node_pool

Terraform Configuration

# Before

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "my-node-pool"
  cluster    = google_container_cluster.primary.id
  node_count = 1

  node_config {
    preemptible  = true
    machine_type = "e2-medium"

    # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
    service_account = google_service_account.default.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

# After

resource "google_container_node_pool" "primary_preemptible_nodes" {
  name       = "my-node-pool"
  cluster    = google_container_cluster.primary.id
  node_count = 1

  node_config {
    preemptible  = true
    machine_type = "e2-standard-2"

    # Google recommends custom service accounts that have cloud-platform scope and permissions granted via IAM Roles.
    service_account = google_service_account.default.email
    oauth_scopes = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
  }
}

Debug Output

No response

Expected Behavior

In case of a nodepool change, that requires changing the underlying template (which is immutable), the provider would ideally:

create a new nodepool with the new template
drain the old pool
delete the old pool

This is unfortunately not the current behavior probably because the above process would require the 'interaction' of multiple providers (google provider to manage the nodepools and the kubernetes provider to drain the nodes. Not sure if the nodepool resize operation on the GCP API could be abused for draining...?). Considering the nodepool upgrades work as expected following the above approach I think it would worth to emphasize in the documentation what would happen in case of non-version related changes to the node pool resource.

This would be especially important as not being aware of the above can contain service outage (see actual behavior for details)

Actual Behavior

Without create_before_destroy flag the nodepool is deleted and then recreated resulting in complete outage with regard to the services bound to the nodepool. In case it is set, there will still be outage of all services as:

the old pool is not drained but terminated abruptly (well, it seems, that scheduling is disabled immediately once a pool is to be deleted, but pods doesn't seem to be waited to be migrated to new pools (respecting poddisruptionbudget), but killed immediately, the autoscaler is trying to create new nodes even in the pool under deletion, which get deleted as well...)
the new node pool's initial node count might be significantly lower than the current size (cluster autoscaling) of the old one as such pods would need to be waiting for the autoscaler before they could be scheduled

Steps to reproduce

terraform apply

Important Factoids

No response

References

#10895

b/331218437

The text was updated successfully, but these errors were encountered:

nemethloci · 2024-03-08T10:11:14Z

Just to make clear: this is a request to improve documentation. Solving the issue behind this goes beyond the scope of this ticket and probably requires the cooperation of the GCP terraform provider developers and GCP itself.

ggtisc · 2024-03-08T21:52:25Z

Hi @nemethloci you can found an alternative documented here for this scenario:
https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/container_node_pool

So you can approach features like strategy and blue_green_settings for upgrades adjusting it according to your requirements.

nemethloci · 2024-03-13T21:15:48Z

Hi @ggtisc the feature you referenced is valid only with regard to upgrades and unfortunately does not apply when changing other aspects of the nodepool, which my request is about (ie: disk type, node type...)

ggtisc · 2024-03-14T20:02:20Z

@nemethloci could you please provide the code of the resource google_container_cluster.primary?

nemethloci · 2024-03-25T09:42:44Z

Here you go:

resource "google_container_cluster" "primary" {
  provider = google-beta
  project  = var.project_id
  location = var.region
  name     = var.cluster_name

  network    = var.cluster_network
  subnetwork = var.cluster_subnetwork

  logging_service    = var.logging_service
  monitoring_service = var.monitoring_service
  min_master_version = local.kubernetes_version
  release_channel {
    channel = "UNSPECIFIED"
  }

  enable_shielded_nodes = var.enable_shielded_nodes

  dynamic "master_authorized_networks_config" {
    for_each = var.master_authorized_networks_config
    content {
      dynamic "cidr_blocks" {
        for_each = lookup(master_authorized_networks_config.value, "cidr_blocks", [])
        content {
          cidr_block   = cidr_blocks.value.cidr_block
          display_name = lookup(cidr_blocks.value, "display_name", null)
        }
      }
    }
  }

  ip_allocation_policy {
    cluster_secondary_range_name  = var.cluster_ip_range_pods
    services_secondary_range_name = var.cluster_ip_range_services
  }

  # We can't create a cluster with no node pool defined, but we want to only use
  # separately managed node pools. So we create the smallest possible default
  # node pool and immediately delete it.
  node_pool {
    name               = "default-pool"
    initial_node_count = 1
  }
  remove_default_node_pool = true

  # We can optionally control access to the cluster
  # See https://cloud.google.com/kubernetes-engine/docs/how-to/private-clusters
  private_cluster_config {
    enable_private_endpoint = var.disable_public_endpoint
    enable_private_nodes    = var.enable_private_nodes
    master_ipv4_cidr_block  = var.master_ipv4_cidr_block
  }

  master_auth {
    client_certificate_config {
      issue_client_certificate = false
    }
  }

  dynamic "authenticator_groups_config" {
    for_each = var.enable_authenticator_groups_config ? [1] : []
    content {
      security_group = var.authenticator_groups_config_security_group
    }
  }

  binary_authorization {
    evaluation_mode = var.enable_binary_authorization
  }

  pod_security_policy_config {
    enabled = var.enable_pod_security_policy_config
  }

  // Configure the workload identity "identity namespace".  Requires additional
  // configuration on the node pool for workload identity to function.
  // See PATT-24
  workload_identity_config {
    workload_pool = "${var.project_id}.svc.id.goog"
  }

  addons_config {

    http_load_balancing {
      disabled = !var.http_load_balancing
    }

    horizontal_pod_autoscaling {
      disabled = !var.horizontal_pod_autoscaling
    }

    network_policy_config {
      disabled = !var.enable_network_policy
    }

    istio_config {
      disabled = !var.enable_managed_istio
      auth     = var.managed_istio_auth
    }

    config_connector_config {
      enabled = var.enable_config_connector
    }

  }

  network_policy {
    # Based on: https://github.com/terraform-google-modules/terraform-google-kubernetes-engine/issues/656#issuecomment-910108440
    # Enabling NetworkPolicy for clusters with DatapathProvider=ADVANCED_DATAPATH is not allowed (yields error)
    enabled = var.enable_dataplane_v2 ? false : (var.enable_network_policy ? true : false)

    # Tigera (Calico Felix) is the only provider
    # CALICO provider overrides datapath_provider setting, leaving Dataplane v2 disabled
    provider = var.enable_dataplane_v2 ? "PROVIDER_UNSPECIFIED" : (var.enable_network_policy ? "CALICO" : "PROVIDER_UNSPECIFIED")
  }

  # This is where Dataplane V2 is enabled.
  datapath_provider = var.enable_dataplane_v2 ? "ADVANCED_DATAPATH" : "DATAPATH_PROVIDER_UNSPECIFIED"

  dynamic "resource_usage_export_config" {
    for_each = var.enable_metering ? [1] : []

    content {
      enable_network_egress_metering       = true
      enable_resource_consumption_metering = true

      bigquery_destination {
        dataset_id = google_bigquery_dataset.cluster_resource_usage[0].dataset_id
      }
    }
  }

  lifecycle {
    ignore_changes = [node_pool]
  }

  maintenance_policy {
    daily_maintenance_window {
      start_time = var.maintenance_start_time
    }
  }

  timeouts {
    create = "30m"
    update = "30m"
    delete = "30m"
  }
}

nemethloci · 2024-03-25T09:46:46Z

One more update: google support has confirmed, that it is expected, that both upon nodepool deletion and manual draining of all nodes of a nodepool, unless the autoscaler is disabled first, new nodes would be spinned of without being cordoned by the autoscaler. As such the 'workaround' process to be used is:

spin up new nodepool and at the same time disable autoscaling for the old
drain old nodepool nodes manually and once finished (ie: via kubectl)
delete the old nodepool

ggtisc · 2024-03-25T16:13:49Z

@nemethloci is this the same issue of the 17668?

melinath · 2024-03-25T16:56:28Z

Note from triage: this doesn't look like a duplicate - that issue was raised by the same author.

nemethloci · 2024-03-25T17:01:03Z

@nemethloci is this the same issue of the 17668?

Yes and no:

yes: both related to the same workflow (deleting/changing a nodepool)
no: The other item intends to solve one problem of many (by turning off autoscaling before deletion), while this one is about to document how deletion of a nodepool should be done in general due to that and other deficiencies, which can not be solved without google changing their APIs and behavior. If the other bugreport would be implemented, there would be one issue less to include in the updated documentation.

Why I have raised two tickets: when I originally reported the documentation related (this) ticket, I thought (based on some early feedback from google engineers), that google will admit, that this is a bug and they will fix it themselves, hence there would be no need to alter the terraform provider itself.

nemethloci · 2024-03-28T08:23:55Z

FYI: google have created a feature request (without any commitment with regard to timeline):
https://issuetracker.google.com/issues/331403526

wyardley · 2024-10-21T19:51:36Z

I think GoogleCloudPlatform/magic-modules#12014 will at least partially help, by allowing in-place updates (without recreating the node pool) of at least some of the fields in google_container_cluster.node_config

I didn't mark this as being "fixed by" that, because I'm not 100% sure it solves all cases and if it's the same underlying issue, but feel free to update that and / or comment here if you think that should fix it.

roaks3 · 2024-10-23T20:50:18Z

Agreed that in-place updates help, but there are still some ForceNew fields within node_config, so I think we should keep this open as a request to mention the dangers (like abrupt node termination) of changing those fields on this resource.

nemethloci added the bug label Mar 8, 2024

github-actions bot added forward/review In review; remove label to forward service/container labels Mar 8, 2024

ggtisc self-assigned this Mar 8, 2024

ggtisc added documentation waiting-response and removed bug labels Mar 8, 2024

github-actions bot removed the waiting-response label Mar 8, 2024

ggtisc assigned NickElliot and unassigned NickElliot Mar 14, 2024

ggtisc added the waiting-response label Mar 14, 2024

github-actions bot removed the waiting-response label Mar 25, 2024

nemethloci mentioned this issue Mar 25, 2024

Disable nodepool autoscaling before deletion #17668

Open

ggtisc added the waiting-response label Mar 25, 2024

github-actions bot removed the waiting-response label Mar 25, 2024

melinath added size/s and removed forward/review In review; remove label to forward labels Mar 25, 2024

melinath added this to the Goals milestone Mar 25, 2024

modular-magician added the forward/linked label Mar 25, 2024

roaks3 unassigned ggtisc Mar 28, 2024

wyardley mentioned this issue Oct 21, 2024

container: fix missing updates for google_container_cluster.node_config subfields GoogleCloudPlatform/magic-modules#12014

Merged

This was referenced Oct 25, 2024

container: fix missing updates for google_container_cluster.node_config subfields #20038

Merged

container: fix missing updates for google_container_cluster.node_config subfields hashicorp/terraform-provider-google-beta#8522

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend google_container_node_pool documentation on nodepool changes #17522

Extend google_container_node_pool documentation on nodepool changes #17522

nemethloci commented Mar 8, 2024 •

edited by modular-magician

Loading

nemethloci commented Mar 8, 2024

ggtisc commented Mar 8, 2024

nemethloci commented Mar 13, 2024

ggtisc commented Mar 14, 2024

nemethloci commented Mar 25, 2024

nemethloci commented Mar 25, 2024

ggtisc commented Mar 25, 2024

melinath commented Mar 25, 2024

nemethloci commented Mar 25, 2024

nemethloci commented Mar 28, 2024

wyardley commented Oct 21, 2024 •

edited

Loading

roaks3 commented Oct 23, 2024

Extend google_container_node_pool documentation on nodepool changes #17522

Extend google_container_node_pool documentation on nodepool changes #17522

Comments

nemethloci commented Mar 8, 2024 • edited by modular-magician Loading

Community Note

Terraform Version

Affected Resource(s)

Terraform Configuration

Debug Output

Expected Behavior

Actual Behavior

Steps to reproduce

Important Factoids

References

nemethloci commented Mar 8, 2024

ggtisc commented Mar 8, 2024

nemethloci commented Mar 13, 2024

ggtisc commented Mar 14, 2024

nemethloci commented Mar 25, 2024

nemethloci commented Mar 25, 2024

ggtisc commented Mar 25, 2024

melinath commented Mar 25, 2024

nemethloci commented Mar 25, 2024

nemethloci commented Mar 28, 2024

wyardley commented Oct 21, 2024 • edited Loading

roaks3 commented Oct 23, 2024

nemethloci commented Mar 8, 2024 •

edited by modular-magician

Loading

wyardley commented Oct 21, 2024 •

edited

Loading