Skip to content
This repository has been archived by the owner on Jun 29, 2022. It is now read-only.

Commit

Permalink
baremetal: integrate automated (re-)provisioning logic
Browse files Browse the repository at this point in the history
The bare-metal platform currently lacks any understanding of whether an
instance is actually running the configuration that lokoctl put into
matchbox, because, when the configuration was updated, there is no
notification to the user that PXE booting has to be done again for this
instance.
Also, it is not clear to the user when to boot from PXE because the PXE
boot must happen after lokoctl populated matchbox with the (new)
configuration but before any other steps time out.
The goal of this patch is to bring the baremetal platform to an
actually usable level and support automated provisioning and
reprovisioning, in a configurable way regardless if IPMI is used or VMs
are created. In addition, we don't want to require a complicated PXE
boot for each configuration update because it is slow, fragile and
needs a special DHCP infrastructure which may be lacking in a
production environment.

Add user-defined commands in the "pxe_commands" variable to perform
automated PXE provisioning at the right time, i.e. initally at the
first run or when recreating a node.
To address the problem that PXE booting is a long and maybe even manual
process, or even impossible at production side, we can rely on Ignition
to simulate reprovisioning by creating the "first_boot" flag file via
SSH and issuing a reboot, which makes Ignition fetch the configuration
from matchbox, and if we make sure to clean the root filesystem by
formatting it, the result is the same as if reprovisioned was done with
a PXE boot.
The logic is achieved by a "null resource" in Terraform that executes a
helper script which either does a PXE boot or uses SSH to trigger a
reprovisioning with Ignition. It also handles the case of ignoring
userdata changes for controller nodes to prevent losing etcd state.
Since there is no notion of a baremetal node on the Terraform level
(reminder: all this exercise here is done because we don't have a
Terraform provider doing this for us) a local flag file is created
under the asset directory on the machine which runs lokoctl. If it
exists, the node was provisioned with PXE and SSH will be used for
reprovisioning, if it does not exist, it will be provisioned with PXE
during inital setup and for the next reprovisioning because the user
forced recreating the node by deleting the flag file.
Another flag file on the node is used to check whether a node was
successfully reprovisioned.
When SSH is used to reprovision, the kernel parameters for GRUB are
updated directly because they are not part of the Ignition
configuration.
The "copy-controller-secrets" step is run after recreating a controller
node, again since there is no notion of a node object this is solved by
depending on the variables itself which define the node state.
Also add user-defined commands in the "install_pre_reboot_cmds"
variable to run after the PXE OS installation and before booting into
the final OS, needed to set up persistent booting from disk after the
PXE booting was configured in "pxe_commands".
The whole patch is used by Racker (https://github.com/kinvolk/racker),
and can be tested either with the "bootstrap/prepare.sh" script to
create VMs with lokoctl or by running Racker in the QEMU IPMI simulator
environment through the "racker-sim/ipmi-env.sh" script and a Racker
Docker image built with "installer/conf.yaml" pointing to this
Lokomotive branch.
  • Loading branch information
pothos committed Jun 22, 2021
1 parent f5965b1 commit bbb13a6
Show file tree
Hide file tree
Showing 17 changed files with 264 additions and 21 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -13,8 +13,22 @@ module "controller" {
set_standard_hostname = false
clc_snippets = concat(lookup(var.clc_snippets, var.controller_names[count.index], []), [
<<EOF
filesystems:
- name: root
mount:
device: /dev/disk/by-label/ROOT
format: ext4
wipe_filesystem: true
label: ROOT
storage:
files:
- path: /ignition_ran
filesystem: root
mode: 0644
contents:
inline: |
Flag file indicating that Ignition ran.
Should be deleted by the SSH step that checks it.
- path: /etc/hostname
filesystem: root
mode: 0644
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
module "controller_profile" {
source = "../../../matchbox-flatcar"
count = length(var.controller_names)
asset_dir = var.asset_dir
node_name = var.controller_names[count.index]
node_mac = var.controller_macs[count.index]
node_domain = var.controller_domains[count.index]
download_protocol = var.download_protocol
os_channel = var.os_channel
os_version = var.os_version
Expand All @@ -17,4 +19,7 @@ module "controller_profile" {
ignition_clc_config = module.controller[count.index].clc_config
cached_install = var.cached_install
wipe_additional_disks = var.wipe_additional_disks
ignore_changes = true
pxe_commands = var.pxe_commands
install_pre_reboot_cmds = var.install_pre_reboot_cmds
}
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,12 @@ resource "null_resource" "copy-controller-secrets" {
]
}


# Triggered when the Ignition Config changes (used to recreate a controller)
triggers = {
clc_config = module.controller[count.index].clc_config
kernel_console = join(" ", var.kernel_console)
kernel_args = join(" ", var.kernel_args)
etcd_ca_cert = module.bootkube.etcd_ca_cert
etcd_server_cert = module.bootkube.etcd_server_cert
etcd_peer_cert = module.bootkube.etcd_peer_cert
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -227,3 +227,15 @@ variable "wipe_additional_disks" {
description = "Wipes any additional disks attached, if set to true"
default = false
}

variable "pxe_commands" {
type = string
default = "echo 'you must (re)provision the node by booting via iPXE from http://MATCHBOX/boot.ipxe'; exit 1"
description = "shell commands to execute for PXE (re)provisioning, with access to the variables $mac (the MAC address), $name (the node name), and $domain (the domain name), e.g., 'bmc=bmc-$domain; ipmitool -H $bmc power off; ipmitool -H $bmc chassis bootdev pxe; ipmitool -H $bmc power on'"
}

variable "install_pre_reboot_cmds" {
type = string
default = "true"
description = "shell commands to execute on the provisioned host after installation finished and before reboot, e.g., docker run --privileged --net host --rm debian sh -c 'apt update && apt install -y ipmitool && ipmitool chassis bootdev disk options=persistent'"
}
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,22 @@ module "worker" {
set_standard_hostname = false
clc_snippets = concat(lookup(var.clc_snippets, var.worker_names[count.index], []), [
<<EOF
filesystems:
- name: root
mount:
device: /dev/disk/by-label/ROOT
format: ext4
wipe_filesystem: true
label: ROOT
storage:
files:
- path: /ignition_ran
filesystem: root
mode: 0644
contents:
inline: |
Flag file indicating that Ignition ran.
Should be deleted by the SSH step that checks it.
- path: /etc/hostname
filesystem: root
mode: 0644
Expand Down
Original file line number Diff line number Diff line change
@@ -1,8 +1,10 @@
module "worker_profile" {
source = "../../../matchbox-flatcar"
count = length(var.worker_names)
asset_dir = var.asset_dir
node_name = var.worker_names[count.index]
node_mac = var.worker_macs[count.index]
node_domain = var.worker_domains[count.index]
download_protocol = var.download_protocol
os_channel = var.os_channel
os_version = var.os_version
Expand All @@ -17,4 +19,6 @@ module "worker_profile" {
ignition_clc_config = module.worker[count.index].clc_config
cached_install = var.cached_install
wipe_additional_disks = var.wipe_additional_disks
pxe_commands = var.pxe_commands
install_pre_reboot_cmds = var.install_pre_reboot_cmds
}
2 changes: 2 additions & 0 deletions assets/terraform-modules/matchbox-flatcar/profiles.tf
Original file line number Diff line number Diff line change
Expand Up @@ -33,6 +33,7 @@ data "ct_config" "install-ignitions" {
kernel_console = join(" ", var.kernel_console)
kernel_args = join(" ", var.kernel_args)
wipe_additional_disks = var.wipe_additional_disks
install_pre_reboot_cmds = var.install_pre_reboot_cmds
# only cached-container-linux profile adds -b baseurl
baseurl_flag = ""
mac_address = var.node_mac
Expand Down Expand Up @@ -80,6 +81,7 @@ data "ct_config" "cached-install-ignitions" {
kernel_console = join(" ", var.kernel_console)
kernel_args = join(" ", var.kernel_args)
wipe_additional_disks = var.wipe_additional_disks
install_pre_reboot_cmds = var.install_pre_reboot_cmds
# profile uses -b baseurl to install from matchbox cache
baseurl_flag = "-b ${var.http_endpoint}/assets/flatcar"
mac_address = var.node_mac
Expand Down
87 changes: 87 additions & 0 deletions assets/terraform-modules/matchbox-flatcar/pxe-helper.sh.tmpl
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
# (executed in-line, #!/... would be ignored)
# Terraform template variable substitution:
name=${name}
domain=${domain}
mac=${mac}
asset_dir=${asset_dir}
ignore_changes=${ignore_changes}
kernel_args="${kernel_args}"
kernel_console="${kernel_console}"
ignition_endpoint="${ignition_endpoint}"
# From now on use $var for dynamic shell substitution

if test -f "$asset_dir/$mac" && [ "$(cat "$asset_dir/$mac")" = "$domain" ]; then
echo "found $asset_dir/$mac containing $domain, skipping PXE install"
node_exists=yes
else
echo "$asset_dir/$mac does not contain $domain, forcing PXE install"
node_exists=no
fi

if [ $node_exists = yes ]; then
if $ignore_changes ; then
echo "Keeping old config because 'ignore_changes' is set."
exit 0
else
# run single commands that can be retried without a side effect in case the connection got disrupted
count=30
while [ $count -gt 0 ] && ! ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NumberOfPasswordPrompts=0 core@$domain sudo touch /boot/flatcar/first_boot; do
sleep 1
count=$((count - 1))
done
if [ $count -eq 0 ]; then
echo "error reaching $domain via SSH, please remove the $asset_dir/$mac file to force a PXE install"
exit 1
fi
echo "created the first_boot flag file to reprovision $domain"
count=5
while [ $count -gt 0 ] && ! ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NumberOfPasswordPrompts=0 core@$domain "printf 'set linux_append=\"$kernel_args ignition.config.url=$ignition_endpoint?mac=$mac&os=installed\"\\nset linux_console=\"$kernel_console\"\\n' | sudo tee /usr/share/oem/grub.cfg"; do
sleep 1
count=$((count - 1))
done
if [ $count -eq 0 ]; then
echo "error reaching $domain via SSH, please retry"
exit 1
fi
count=5
while [ $count -gt 0 ] && ! ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NumberOfPasswordPrompts=0 core@$domain sudo systemctl reboot; do
sleep 1
count=$((count - 1))
done
if [ $count -eq 0 ]; then
echo "error reaching $domain via SSH, please reboot manually"
exit 1
fi
echo "rebooted the $domain"
fi
else
# the user may provide ipmitool commands or any other logic for forcing a PXE boot
${pxe_commands}
fi

echo "checking that $domain comes up"
count=600
# check that we can reach the node and that it has the flag file which we remove here, indicating a reboot happened which prevents a race when issuing the reboot takes longer (both the systemctl reboot and PXE case)
# Just in case the connection breaks and SSH may report an error code but still execute successfully, we will first check file existence and then delete with "rm -f" to be able to rerun both commands.
# This sequence gives us the same error reporting as just running "rm" once.
while [ $count -gt 0 ] && ! ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NumberOfPasswordPrompts=0 core@$domain test -f /ignition_ran; do
sleep 1
count=$((count - 1))
done
if [ $count -eq 0 ]; then
echo "error: failed verifying with SSH if $domain came up by checking the /ignition_ran flag file"
exit 1
fi
count=5
while [ $count -gt 0 ] && ! ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o NumberOfPasswordPrompts=0 core@$domain sudo rm -f /ignition_ran; do
sleep 1
count=$((count - 1))
done
if [ $count -eq 0 ]; then
echo "error: failed to remove the /ignition_ran flag file on $domain"
exit 1
else
echo "$domain came up again"
fi
# only write the state file once the system is up, this allows to rerun lokoctl if the first PXE boot did not work and it will try again
echo $domain > "$asset_dir/$mac"
14 changes: 14 additions & 0 deletions assets/terraform-modules/matchbox-flatcar/ssh.tf
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
resource "null_resource" "reprovision-node-when-ignition-changes" {
# Triggered when the Ignition Config changes
triggers = {
ignition_config = matchbox_profile.node.raw_ignition
kernel_args = join(" ", var.kernel_args)
kernel_console = join(" ", var.kernel_console)
}
# Wait for the new Ignition config object to be ready before rebooting
depends_on = [matchbox_group.node]
# Trigger running Ignition on the next reboot (first_boot flag file) and reboot the instance, or, if the instance needs to be (re)provisioned, run external commands for PXE booting (also runs on the first provisioning)
provisioner "local-exec" {
command = templatefile("${path.module}/pxe-helper.sh.tmpl", { domain = var.node_domain, name = var.node_name, mac = var.node_mac, pxe_commands = var.pxe_commands, asset_dir = var.asset_dir, kernel_args = join(" ", var.kernel_args), kernel_console = join(" ", var.kernel_console), ignition_endpoint = format("%s/ignition", var.http_endpoint), ignore_changes = var.ignore_changes })
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,7 @@ storage:
echo 'set linux_append="${kernel_args} ignition.config.url=${ignition_endpoint}?mac=${mac_address}&os=installed"' >> /tmp/oemfs/grub.cfg
echo 'set linux_console="${kernel_console}"' >> /tmp/oemfs/grub.cfg
umount /tmp/oemfs
${install_pre_reboot_cmds}
systemctl reboot
passwd:
users:
Expand Down
28 changes: 28 additions & 0 deletions assets/terraform-modules/matchbox-flatcar/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -88,3 +88,31 @@ variable "wipe_additional_disks" {
description = "Wipes any additional disks attached, if set to true"
default = false
}

variable "ignore_changes" {
description = "When set to true, ignores the reprovisioning of the node (unless the MAC address flag file is removed to force a PXE install)."
type = bool
default = false
}

variable "asset_dir" {
description = "Path to a directory where generated assets should be placed (contains secrets)"
type = string
}

variable "node_domain" {
type = string
description = "Node FQDN (e.g node1.example.com)."
}

variable "pxe_commands" {
type = string
description = "shell commands to execute for PXE (re)provisioning, with access to the variables $mac (the MAC address), $name (the node name), and $domain (the domain name), e.g., 'bmc=bmc-$domain; ipmitool -H $bmc power off; ipmitool -H $bmc chassis bootdev pxe; ipmitool -H $bmc power on'."
default = "echo 'you must (re)provision the node by booting via iPXE from http://MATCHBOX/boot.ipxe'; exit 1"
}

variable "install_pre_reboot_cmds" {
type = string
description = "shell commands to execute on the provisioned host after installation finished and before reboot, e.g., docker run --privileged --net host --rm debian sh -c 'apt update && apt install -y ipmitool && ipmitool chassis bootdev disk options=persistent'."
default = "true"
}
1 change: 1 addition & 0 deletions ci/baremetal/baremetal-cluster.lokocfg.envsubst
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ cluster "bare-metal" {
"node2",
"node3",
]
pxe_commands = "true" # The VMs are booted up outside of the CI Docker image at the right time already and we will not reprovision nor could do so because the VMs are managed at another level
# Adds oidc flags to API server with default values.
# Acts as a smoke test to check if API server is functional after addition
# of extra flags.
Expand Down
14 changes: 14 additions & 0 deletions docs/configuration-reference/platforms/baremetal.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ cluster "bare-metal" {
network_ip_auto_detection = "can-reach=172.18.169.0"
wipe_additional_disks = true
pxe_commands = "bmc=bmc-$node; ipmitool -H $bmc power off; ipmitool -H $bmc chassis bootdev pxe; ipmitool -H $bmc power on"
}
```

Expand All @@ -158,6 +160,16 @@ os_version = var.custom_default_os_version
```

You should set a valid `pxe_commands` value to automate the provisioning, e.g., with IPMI as sketched above or a bit more reliable
as done in [Racker here](https://github.com/kinvolk/racker/blob/0.1.5/bootstrap/pxe-boot.sh), or for libvirt VMs with some
`virsh …; virt-install …` commands as done in [Racker here](https://github.com/kinvolk/racker/blob/0.1.5/bootstrap/prepare.sh#L637).
PXE boots are normally only needed for the first OS installation and as long as SSH works they will be skipped for regular reprovisioning.
This is controlled by the existence of the MAC address flag file under `cluster-assets` and you can delete it to force a PXE installation
when the next reprovisioning takes place after a userdata change.

Depending on how the PXE boot was forced, you should ensure that afterwards persistent booting from disk is configured again when the PXE
installation is done, by setting a respective value for `install_pre_reboot_cmds`.

## Attribute reference


Expand Down Expand Up @@ -201,6 +213,8 @@ os_version = var.custom_default_os_version
| `oidc.client_id` | A client id that all tokens must be issued for. | "clusterauth" | string | false |
| `oidc.username_claim` | JWT claim to use as the user name. | "email" | string | false |
| `oidc.groups_claim` | JWT claim to use as the user’s group. | "groups" | string | false |
| `pxe_commands` | Shell commands to execute for PXE (re)provisioning, with access to the variables $mac (the MAC address), $name (the node name), and $domain (the domain name), e.g., `bmc=bmc-$domain; ipmitool -H $bmc power off; ipmitool -H $bmc chassis bootdev pxe; ipmitool -H $bmc power on` | "echo 'you must (re)provision the node by booting via iPXE from http://MATCHBOX/boot.ipxe'; exit 1" | string | false |
| `install_pre_reboot_cmds` | shell commands to execute on the provisioned host after installation finished and before reboot, e.g., `docker run --privileged --net host --rm debian sh -c 'apt update && apt install -y ipmitool && ipmitool chassis bootdev disk options=persistent'` | "true" (a no-op) | string | false |
| `conntrack_max_per_core` | Maximum number of entries in conntrack table per CPU on all nodes in the cluster. If you require more fain-grained control over this value, set it to 0 and add CLC snippet setting `net.netfilter.nf_conntrack_max` sysctl setting per node pool. See [Flatcar documentation about sysctl](https://docs.flatcar-linux.org/os/other-settings/#tuning-sysctl-parameters) for more details. | 32768 | number | false |
| `wipe_additional_disks` | Wipes any additional disks attached to the machine. | false | bool | false |

Expand Down
14 changes: 11 additions & 3 deletions docs/quickstarts/baremetal.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,21 +8,24 @@ weight: 10
This quickstart guide walks through the steps needed to create a Lokomotive cluster on bare metal with
Flatcar Container Linux utilizing PXE.

We recommend you to check out [Racker](https://github.com/kinvolk/racker) for an integrated solution
based on the Lokomotive bare metal platform.

By the end of this guide, you'll have a working Kubernetes cluster with 1 controller node and 2
worker nodes.

## Requirements

* Basic understanding of Kubernetes concepts.
* Terraform v0.13.x installed locally.
* Machines with at least 2GB RAM, 30GB disk, PXE-enabled NIC and IPMI.
* Machines with at least 2.5GB RAM, 30GB disk, PXE-enabled NIC and IPMI.
* PXE-enabled [network boot](https://coreos.com/matchbox/docs/latest/network-setup.html) environment.
* Matchbox v0.6+ deployment with API enabled.
* Matchbox credentials `client.crt`, `client.key`, `ca.crt`.
* An SSH key pair for management access.
* `kubectl` installed locally to access the Kubernetes cluster.

Note that the machines should only be powered on after starting the installation, see below.
Note that without a proper `pxe_commands` value the machines should only be powered on manually after starting the installation, see below.

## Steps

Expand Down Expand Up @@ -171,6 +174,9 @@ cluster "bare-metal" {
"node2",
"node3",
]
# Automation to force a PXE boot, a dummy sleep for now as you will do it manually.
pxe_commands = "sleep 300"
}
```
Expand All @@ -197,7 +203,9 @@ Run the following command to create the cluster:
lokoctl cluster apply
```

**Proceed to Power on the PXE machines while this loops.**
**Proceed to Power on the PXE machines while this loops, but only after Matchbox has the configuration ready.**
See the [configuration reference](../configuration-reference/platforms/baremetal.md) on how to properly configure `pxe_commands`
to automate the provisioning reliably.

Once the command finishes, your Lokomotive cluster details are stored in the path you've specified
under `asset_dir`.
Expand Down
Loading

0 comments on commit bbb13a6

Please sign in to comment.