The base cluster configs are derived from onedr0p cluster template and then customized with my modifications and additons to work with my home infrastructure and needs.
This cluster is based on talos linux with the configurations managed through git and applied through flux CD. This cluster includes external-dns that is used to update external and internal DNS and ingress-nginx for SSL with Cloudflare. Cloudflare Tunnel is also included to provide external access to certain applications deployed in your cluster. Postgres database is deployed using cloudnative-pg and Redis compatible database (dragonflydb) is deployed using dragonfly-operator.
- Components: ingress-nginx, external-dns and cloudflared, flux, cert-manager, spegel, reloader, system-upgrade-controller, openebs and cilium, rook-ceph, cloudnative-pg.
Other features include:
Renovate is a tool that automates dependency management. It is designed to scan your repository around the clock and open PRs for out-of-date dependencies it finds. Common dependencies it can discover are Helm charts, container images, GitHub Actions, Ansible roles... even Flux itself!
Merging a PR will cause Flux to apply the update to your cluster.
The base Renovate configuration in your repository can be viewed at .github/renovate.json5.
GitHub Actions with helpful workflows.
Note: All nodes are able to run workloads, including the controller nodes. No workers are deployed in my cluster at the moment. All nodes are deployed on vSphere 8.0.
-
I started with talos linux vmware deploy script and customized it to deploy the VMs required using the configuration files generated in steps below. GOVC will need to installed on the system prior to this step. I chose to deploy 5 control plane VMs with 8c/32G/500G(System)/300G(Ceph/Rook).
-
Continue on to π Getting Started
Dev Container is used to run the environment that has all the necessary tools.
devcontainer
requires Docker and VSCode installed.
- Start Docker and open the repository in VSCode. There will be a pop-up asking you to use the
devcontainer
, click the button to start using it.
-
Create talos secrets
task talos:bootstrap-gensecret task talos:bootstrap-genconfig
-
Deploy talos VM with the configs generated from step 1.
./vmware.sh upload_ova ./vmware.sh create
-
Boostrap talos and get kubeconfig
task talos:bootstrap-install task talos:fetch-kubeconfig
-
Install cilium and kubelet-csr-approver into the cluster
task talos:bootstrap-apps
-
Apply GPU patch to the GPU node/s.
cd kubernetes/talos/clusterconfig talosctl -n <node-ip> patch mc --patch @gpu-patch.yaml
-
Upgrade talos to the correct schematic generated from talos since the OVA doesn't include any required extensions for this repo; GPU node has a different schematic ID compared to regular node due to added modules/extensions siderolabs/nonfree-kmod-nvidia and siderolabs/nvidia-container-toolkit.
talosctl -n <node-ip> upgrade --image factory.talos.dev/installer/<schematic-id>:<talos-ver>
π©ΉNote: I applied the upgrade to GPU node/s with GPU specific schematic ID and also to regular nodes with regular schematic ID with just intel-ucode and iscsi extension.
-
Verify NVIDIA kernel modules and extensions are loaded
talosctl -n <node-ip> read /proc/modules #nvidia_uvm 1146880 - - Live 0xffffffffc2733000 (PO) #nvidia_drm 69632 - - Live 0xffffffffc2721000 (PO) #nvidia_modeset 1142784 - - Live 0xffffffffc25ea000 (PO) #nvidia 39047168 - - Live 0xffffffffc00ac000 (PO)
talosctl -n <node-ip> get extensions #NODE NAMESPACE TYPE ID VERSION NAME VERSION #172.31.41.27 runtime ExtensionStatus 000.ghcr.io-frezbo-nvidia-container-toolkit-510.60.02-v1.9.0 nvidia-container-toolkit 510.02-v1
talosctl -n <node-ip> read /proc/driver/nvidia/version #NVRM version: NVIDIA UNIX x86_64 Kernel Module 510.60.02 Wed Mar 16 11:24:05 UTC 2022 #GCC version: gcc version 11.2.0 (GCC)
-
Create nvidia runtime class
kubectl apply -f nvidia-runtime.yaml
-
Create secret for vmtools daemonset
# create new talos API credentials talosctl -n <cp-node-ip> config new vmtoolsd-secret.yaml --roles os:admin # import API credentials into K8s kubectl -n kube-system create secret generic talos-vmtoolsd-config --from-file=talosconfig=./vmtoolsd-secret.yaml # delete temporary credentials file rm vmtoolsd-secret.yaml
-
Install vmtools daemonset from manifest
kubectl apply -f https://raw.githubusercontent.com/siderolabs/talos-vmtoolsd/master/deploy/latest.yaml
-
Verify flux can be installed
flux check --pre # βΊ checking prerequisites # β kubectl 1.27.3 >=1.18.0-0 # β Kubernetes 1.27.3+k3s1 >=1.16.0-0 # β prerequisites checks passed
-
Install flux and sync the cluster to the Git repository
task flux:bootstrap # namespace/flux-system configured # customresourcedefinition.apiextensions.k8s.io/alerts.notification.toolkit.fluxcd.io created # ...
-
Verify flux components are running in the cluster
kubectl -n flux-system get pods -o wide # NAME READY STATUS RESTARTS AGE # helm-controller-5bbd94c75-89sb4 1/1 Running 0 1h # kustomize-controller-7b67b6b77d-nqc67 1/1 Running 0 1h # notification-controller-7c46575844-k4bvr 1/1 Running 0 1h # source-controller-7d6875bcb4-zqw9f 1/1 Running 0 1h
The external-dns
application created in the network
namespace will handle creating public DNS records and Private DNS records.
Cert-Manager is configured with Cloudflare for DNS validation and the ACME certs are issued by Google Public CA (GTS - Google Trust Services) that is attached to a gCloud project. LetsEnrypt is also defined as a cluster issuer just as a backup.
By default flux will periodically check your git repository for changes. In order to have Flux reconcile on git push
Github should be configured to send push
events to Flux.
-
Obtain the webhook path
π Hook id and path should look like
/hook/12ebd1e363c641dc3c2e430ecf3cee2b3c7a5ac9e1234506f6f5f3ce1230e123
kubectl -n flux-system get receiver github-receiver -o jsonpath='{.status.webhookPath}'
-
Piece together the full URL with the webhook path appended
https://flux-webhook.${bootstrap_cloudflare_domain}/hook/12ebd1e363c641dc3c2e430ecf3cee2b3c7a5ac9e1234506f6f5f3ce1230e123
-
Navigate to the settings of your repository on Github, under "Settings/Webhooks" press the "Add webhook" button. Fill in the webhook url and
bootstrap_flux_github_webhook_token
secret and save.
There might be a situation which necessiates starting from scratch. This will completely destroy cluster and the VMs. This cluster's databases and volumes are synchronized to s3 based repos and can be bootstrapped from those backups to restore state.
# Nuke cluster
./vmware.sh destroy
Inspiration for my repo came from these repos below and the opensource community.
- onedr0p/home-ops - This is a mono repository for my home infrastructure and Kubernetes cluster. I try to adhere to Infrastructure as Code (IaC) and GitOps practices using tools like Ansible, Terraform, Kubernetes, Flux, Renovate, and GitHub Actions.
- bjw-s/home-ops - π Welcome to my Home Operations repository. This is a mono repository for my home infrastructure and Kubernetes cluster. I try to adhere to Infrastructure as Code (IaC) and GitOps practices using the tools like Ansible, Terraform, Kubernetes, Flux, Renovate and GitHub Actions.