Kubernetes Production Ready Checklist

Roughly stack racked by importance and ease of implementation.

You can work your way down this list linearly for maximum return on investment for however much time you have.

Many config templates and examples are available in the excellent HariSekhon/Kubernetes-configs repo referenced throughout this page for faster implementation seeing what has actually worked and just editing a few lines specific to your environment.

Healthchecks
Horizontal Pod Autoscaler
Pod Disruption Budget
Pod Anti-Affinity
Ingress
Applications
DNS - Automatic DNS Records for Apps
Secrets - Automated Secrets
- External Secrets
- Sealed Secrets
Namespaces
Pod Security Policies
Governance, Security & Best Practices
Find Deprecated API objects to replace
Helm
- Helm is not IaC idempotent by itself
- Quickly update any Helm Charts in a kustomization.yaml file

Healthchecks

Readiness / liveness probes are critically important for the following reasons:

Readiness Probes
1. only direct traffic to pods which are fully initialized and functioning
2. don't let users see frequent errors from pods which have been recently migrated / restarted which happens frequently on Kubernetes clusters
Liveness Probes
1. restart pods which are stuck after encountering state errors either at runtime or initialization time (eg. pull from a config source at initialization or a database connection failing to establish during startup)
2. this is the only probe that will restart the pod to reset its state to overcome such issues
Startup Probes
1. newer versions of Kubernetes give a specific check for startup. This is useful for apps which have long initialization times but you don't want to set high times on Readiness probes which would delay dropping later malfunctioning pods out of the Kubernetes internal service load balancer in good time - which would end up sending requests in the interim which may be surfaced as errors to users

See the deployment and statefulset templates:

HariSekhon/Kubernetes-configs - deployment.yaml

HariSekhon/Kubernetes-configs - statefulset.yaml

Horizontal Pod Autoscaler

Make sure your pods scale up to meet traffic demands and scale down off-peak to not waste resources and cloud usage costs.

HariSekhon/Kubernetes-configs - horizontal-pod-autoscaler.yaml

Pod Disruption Budget

Ensure the Kubernetes scheduler doesn't take down more pods than you can afford for High Availability purposes or scaling capacity purposes to still be able to serve full traffic at the current scaling level.

Set your pod disruption budget according to your capacity and app's ability to handle a certain number of pods being unavailable at a given time due to being migrated (killed and restarted on another node):

This is doubly important if you're running apps:

apps with a strict quorum requirements
1. eg. ZooKeeper, Consul or Etcd which cannot tolerate more than 1-2 nodes being unavailable before causing complete outages
apps with sharded replicas (common with NoSQL systems)
1. eg. Elasticsearch, SolrCloud, Cassandra, MongoDB, Couchbase where often an outage of 2 nodes could cause partial outages via shard unavailability, incomplete results or query failures

HariSekhon/Kubernetes-configs - pod-disruption-budget.yaml

Pod Anti-Affinity

Ensure your pod replicas are spread across nodes for maximum availability and stability.

By default the Kubernetes scheduler will attempt to do a basic spread of pods across nodes but Pod Anti-Affinity rules enhance this in the following ways:

High Availability

spread across different servers to protect against random hardware failure of a single server causing an outage
spread across different cloud availability zones to protect against a single datacenter outage eg. power failure / fire / flood / networking issue

Stable On-Demand vs Preemptible / Spot Instances

On cloud, choose between running your pods on full priced on-demand nodes or on discounted price preemptible or spot instances which are much cheaper.

Cost Optimization

If your application can take random pod migrations such as a horizontally scaled web farm, then use preemptible or spot instances to save significant money on your cloud budget.

This is part of basic best practice cloud cost optimization.

Single Instance App Stability

If you have an app like Jenkins server which is a single point of failure then you should definitely run it on stable on-demand nodes unless you like having several minute outages of your Jenkins UI and job scheduler while the Jenkins server pod is restarted on another node.

Jenkins for example takes several minutes to start up, you don't want this happening every day on GCP preemptible nodes or randomly on AWS spot instances.

App Sensitivity to Disruptions

Some apps like coordination services or clustered shared data services may not fair well if randomly restarted in any uncontrolled number such as spot instances may do.

Pod Disruption Budgets can't help here as they only control the Kubernetes scheduler's decision about how many pods to reap and redeploy elsewhere at one time. The Kubernetes scheduler and therefore pod disruption budgets have no control over the lower level Cloud's decision to reap spot instances at any time, meaning they could randomly take out any number of nodes upon a surge of demand for spot instances.

Do not run quorum coordination services on spot / preemptible instances for this reason as you could lose too many of them at the same time, causing a complete quorum outage and impacting all other applications depending on them for coordination.

No spot / preemptible for:

Coordination Services:
- ZooKeeper
- Consul
- Etcd
NoSQL data sharding services:

Performance Engineering

You may also choose to ensure certain apps are not deployed alongside other performance hungry apps to optimize the performance available to them.

PDB Configs

HariSekhon/Kubernetes-configs - deployment.yaml

HariSekhon/Kubernetes-configs - statefulset.yaml

Ingress

Ingress Controllers

Set up a stable HTTPS entrypoint to your apps with DNS and SSL.

Nginx (config)
Kong (config)
Traefik (config)
HAProxy (config)

Automatic SSL - Cert Manager

Set up Cert Manager for Automatic Certificate Management using the popular free Let's Encrypt certificate authority.

You can also use your cloud certificate authority if your corporate policy dictates.

HariSekhon/Kubernetes-configs - cert-manager

App Ingresses

Ensure each app has an ingress address to be reachable via a URL.

Otherwise you'll have to waste time kubectl port-foward tunneling to access it each time.

If you are stuck doing that, either because you haven't yet gotten all your Ingress magic set up yet, then you may want to use HariSekhon/DevOps-Bash-tools -kubectl_port_forward.sh.

In some cases this can't be avoided, such as Spark jobs launched by Informatica due to having the UI on randomly launched job driver pods.

If your ingress controllers are working, set up your app ingresses by editing this config:

HariSekhon/Kubernetes-configs -ingress.yaml

See also various app-specific ingresses already configured in HariSekhon/Kubernetes-configs repo under */overlay/ingress.yaml.

Applications

App Lifecycle Management

Set up ArgoCD to automatically deploy, update and repair your Kubernetes configs from the saved good config in git ie. 'GitOps'.

HariSekhon/Kubernetes-configs - argocd

App Resource Requests & Limits

Setting appropriate resource requests and limits is critical to both performance and reliability.

Otherwise, apps will end up over-contended - degrading their performance or being outright killed by Linux's OOM Killer to save the host from crashing - resulting in sudden pod recreations on other nodes and possible service disruptions.

See resources sections in

HariSekhon/Kubernetes-configs - deployment.yaml

HariSekhon/Kubernetes-configs - statefulset.yaml

App Right-Sizing - Goldilocks & Vertical Pod Autoscaler

But what to set your resource requests and limits to?

Install Goldilocks to generate VPAs for resource recommendations with a nice dashboard.

It will tell you exactly how much your app is using so you can tune its resource requests and limits after setting an initial estimate of your best guess.

HariSekhon/Kubernetes-configs - Goldilocks

DNS - Automatic DNS Records for Apps

Install External DNS to automatically create DNS records for your apps.

It integrates with many popular DNS providers such as Cloudflare, AWS Route53, GCP Cloud DNS etc.

HariSekhon/Kubernetes-configs - External DNS

Secrets - Automated Secrets

Install one of the following:

External Secrets
Sealed Secrets

External Secrets

External Secrets integrates with and pulls secrets from:

AWS Secrets Manager
GCP Secret Manager
Azure Key Vault
Hashicorp Vault

HariSekhon/Kubernetes-configs - External Secrets

Sealed Secrets

Sealed Secrets by Bitnami is a simpler solution in which you encrypt a secret using a private key unique to the cluster which results in a blob that is safe to store in Git because it can only be decrypted by the cluster to regenerate the Kubernetes secret object.

The drawback of this approach is that the secret must be generated for each cluster - whereas External Secrets config can be inherited across clusters - while if a Sealed Secrets cluster (or more accurately the Sealed Secrets installation with the private key on that cluster) is destroyed and recreated, then the sealed secrets are unrecoverable and you must regenerate all the secrets.

This means this is no good for fast DR or recreations of Kubernetes clusters unless you can also back up and restore the sealed secrets private keys for the cluster.

HariSekhon/Kubernetes-configs - Sealed Secrets

Namespaces

Resource Quotas per Namespace

On multi-tenant Kubernetes clusters, create a namespace for each app / team and limit the amount of CPU and RAM resources they are allowed to request from the cluster's Kubernetes scheduler in their app resource requests.

This will prevent one team or app from greedily using up all the cluster resources and allow for better resource planning.

HariSekhon/Kubernetes-configs - resource-quota.yaml

Limit Ranges

These set default resource requests and limits for apps within the namespace.

Make these frugle and force people to right-size their apps in a couple quick iterations at time of deployment using Goldilocks.

HariSekhon/Kubernetes-configs - limit-range.yaml

Network Policies

Restrict communications between namespaces containing different apps and teams.

This is equivalent to old school internal firewalling between different LAN subnets inside the Kubernetes cluster.

If one app in one namespace was to get compromised, there is no reason to allow it to be using as a launching pad to attack adjacent apps in the cluster.

This will also force teams to document the network connections and services their app is using in order for you to permit their network access.

HariSekhon/Kubernetes-configs - network-policy.yaml

Pod Security Policies

Deprecated in newer versions of Kubernetes.

HariSekhon/Kubernetes-configs - pod-security-policy.yaml

Governance, Security & Best Practices

Install Polaris for a recommendations dashboard full of best practices.

HariSekhon/Kubernetes-configs - Polaris

Find Deprecated API objects to replace

Run Pluto against your cluster before Kubernetes cluster upgrades.

The following scripts are useful from in the popular DevOps Bash Tools repo:

pluto_detect_kustomize_materialize.sh
- recursively materializes all kustomization.yaml and runs Pluto on each directory to work around this issue
pluto_detect_helm_materialize.sh
- recursively materializes all helm Chart.yaml and runs Pluto on each directory to work around this issue
pluto_detect_kubectl_dump_objects.sh
- dumps all live Kubernetes objects to /tmp all can run Pluto to detect deprecated API objects on the cluster from any source

Helm

Helm is not IaC idempotent by itself

People who deploy directly from Helm CLI should be aware that is PoC territory.

You must wrap Helm in Kustomize or ArgoCD or similar to detect live object config drift!

Quickly update any Helm Charts in a `kustomization.yaml` file

Use kustomize_update_helm_chart_versions.sh in the popular DevOps Bash Tools repo.

Migrated from HariSekhon/Kubernetes-configs repo

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kubernetes-production-ready-checklist.md

kubernetes-production-ready-checklist.md

Kubernetes Production Ready Checklist

Healthchecks

Horizontal Pod Autoscaler

Pod Disruption Budget

Pod Anti-Affinity

High Availability

Stable On-Demand vs Preemptible / Spot Instances

Cost Optimization

Single Instance App Stability

App Sensitivity to Disruptions

Performance Engineering

PDB Configs

Ingress

Ingress Controllers

Automatic SSL - Cert Manager

App Ingresses

Applications

App Lifecycle Management

App Resource Requests & Limits

App Right-Sizing - Goldilocks & Vertical Pod Autoscaler

DNS - Automatic DNS Records for Apps

Secrets - Automated Secrets

External Secrets

Sealed Secrets

Namespaces

Resource Quotas per Namespace

Limit Ranges

Network Policies

Pod Security Policies

Governance, Security & Best Practices

Find Deprecated API objects to replace

Helm

Helm is not IaC idempotent by itself

Quickly update any Helm Charts in a `kustomization.yaml` file

Files

kubernetes-production-ready-checklist.md

Latest commit

History

kubernetes-production-ready-checklist.md

File metadata and controls

Kubernetes Production Ready Checklist

Healthchecks

Horizontal Pod Autoscaler

Pod Disruption Budget

Pod Anti-Affinity

High Availability

Stable On-Demand vs Preemptible / Spot Instances

Cost Optimization

Single Instance App Stability

App Sensitivity to Disruptions

Performance Engineering

PDB Configs

Ingress

Ingress Controllers

Automatic SSL - Cert Manager

App Ingresses

Applications

App Lifecycle Management

App Resource Requests & Limits

App Right-Sizing - Goldilocks & Vertical Pod Autoscaler

DNS - Automatic DNS Records for Apps

Secrets - Automated Secrets

External Secrets

Sealed Secrets

Namespaces

Resource Quotas per Namespace

Limit Ranges

Network Policies

Pod Security Policies

Governance, Security & Best Practices

Find Deprecated API objects to replace

Helm

Helm is not IaC idempotent by itself

Quickly update any Helm Charts in a kustomization.yaml file

Quickly update any Helm Charts in a `kustomization.yaml` file