Fix etcd not starting up when using a custom access address #11388

derselbst · 2024-07-17T11:51:30Z

What type of PR is this?

/kind bug

What this PR does / why we need it:

It is already possible to override the address of etcd via variables etcd_address and etcd_access_address. When configuring etcd to run in a DHCP network (nodes with dynamic IPs), it might be desirable to set

etcd_address: "0.0.0.0" # i.e. bind etcd to any address
etcd_access_address: "{{ ansible_fqdn }}" # access etcd via the host's FQDN

Currently, this results in a TLS error, because the certificate of etcd only includes the pure hostname and IPs, but it doesn't include etcd_access_address as subject alternative name (SAN). Hence, when etcd attempts to communicate via a custom access address, the certificate is untrusted and etcd fails to start up.

This PR fixes this, by including etcd_access_address as SAN.

Which issue(s) this PR fixes:

None I'm aware of.

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

NONE

Fix etcd not starting up when using a custom access address

k8s-ci-robot · 2024-07-17T11:51:40Z

Hi @derselbst. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

MrFreezeex · 2024-07-17T12:09:37Z

Hi thanks for the PR, this makes sense to me however could you revert the change you did to switch from using the host var to ansible_hostname AFAICS those two can be different as one represent the name in the inventory and the other one is the hostname configured on the host IIUC. Otherwise lgtm, thanks!

/ok-to-test

derselbst · 2024-07-17T12:18:52Z

Right, previously, host was expanded to inventory_hostname. This can be any name, esp. it can be a different name than the true hostname. IMO, it doesn't make sense to put this arbitrary inventory_hostname into the SAN section of the certificate, which is why I believe using ansible_hostname is the correct approach here.

The compromise to keep old and new behavior, i.e.

DNS.{{ counter["dns"] }} = {{ host }}{{ increment(counter, 'dns') }}
DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_hostname'] }}{{ increment(counter, 'dns') }}
DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_fqdn'] }}{{ increment(counter, 'dns') }}

could result in duplicate entries, if inventory_hostname == ansible_hostname, and I'm not sure if that wouldn't cause issues.

Alternatively, I could completely remove the line
DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_hostname'] }}{{ increment(counter, 'dns') }}

which means that it won't work when etcd_access_address is set to ansible_hostname AND inventory_hostname != ansible_hostname.

What's your suggestion?

derselbst · 2024-07-17T12:21:25Z

BTW, currently, kubespray's default behavior is to expand everything to IPs. I.e. most users will rely on the (static) IPs being listed in the SAN section for etcd to work correctly by default.

MrFreezeex · 2024-07-17T12:56:39Z

Right, previously, host was expanded to inventory_hostname. This can be any name, esp. it can be a different name than the true hostname. IMO, it doesn't make sense to put this arbitrary inventory_hostname into the SAN section of the certificate, which is why I believe using ansible_hostname is the correct approach here.

FYI I am not entirely sure the condition is triggered but kubespray could populates /etc/hosts with that in some case with this code:

kubespray/roles/kubernetes/preinstall/tasks/0090-etchosts.yml

Line 8 in 5f35b66

    
                 {%- if ('ansible_hostname' in hostvars[item] and item != hostvars[item]['ansible_hostname']) %} {{ hostvars[item]['ansible_hostname'] }}.{{ dns_domain }} {{ hostvars[item]['ansible_hostname'] }} {% else %} {{ item }}.{{ dns_domain }} {{ item }} {% endif %}

The compromise to keep old and new behavior, i.e.
DNS.{{ counter["dns"] }} = {{ host }}{{ increment(counter, 'dns') }}
DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_hostname'] }}{{ increment(counter, 'dns') }}
DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_fqdn'] }}{{ increment(counter, 'dns') }}
could result in duplicate entries, if inventory_hostname == ansible_hostname, and I'm not sure if that wouldn't cause issues.

Alternatively, I could completely remove the line DNS.{{ counter["dns"] }} = {{ hostvars[host]['ansible_hostname'] }}{{ increment(counter, 'dns') }}

which means that it won't work when etcd_access_address is set to ansible_hostname AND inventory_hostname != ansible_hostname.

What's your suggestion?

You could either revert the change on inventory_hostname/ansible_hostname which would not change the current behavior or adds both if the two are different with an if condition IMO.

derselbst · 2024-07-17T13:37:37Z

FYI I am not entirely sure the condition is triggered but kubespray could populates /etc/hosts with that in some case with this code:

The code you have linked here will always end up writing the ansible_hostname to the /etc/hosts, no matter what condition becomes true or false:

if ('ansible_hostname' in hostvars[item] - I can hardly think of a case where ansible was unable to discover the host's name, so I consider this always to be true
item != hostvars[item]
a. If evaluates to true, it writes the ansible_hostname
b. If evaluates to false, it still writes ansible_hostname because that's just the identical to what item is set to.

My assumption in 1. will only fail, if facts gathering has been disabled. Then, using inventory_hostname will truely be different to using ansible_hostname, as the latter is undefined. If that's your argument, please give me a final "you insist", and I will add extra guards for these ansible_* variables to be defined and ansible_hostname to be different to host as you suggested. (IMO, that would make the logic unnecessarily complex and silently result in an incompletely populated etcd certificate when facts gathering has been disabled for whatever reason).

MrFreezeex · 2024-07-17T15:23:03Z

Ok so yes indeed this condition in the code that populate /etc/hosts looks to be not very useful as we gather facts before executing preinstall. Also the apiserver cert is also using ansible_hostname so sounds good to go as you are suggesting, thanks for digging into this 👍.

Two things though:

Could you highlight this switch/cleanup in the changelog just in case someone use inventory hostname somehow?

Could you update the SANs with ansible_hostname + ansible_fqdn of etcd deployed through kubeadm so that both deployment methods have somewhat similar behaviors:

kubespray/roles/kubernetes/control-plane/templates/kubeadm-config.v1beta3.yaml.j2

Lines 77 to 90 in 5f35b66

    
               serverCertSANs: 
        
           {% for san in etcd_cert_alt_names %} 
        
                 - "{{ san }}" 
        
           {% endfor %} 
        
           {% for san in etcd_cert_alt_ips %} 
        
                 - "{{ san }}" 
        
           {% endfor %} 
        
               peerCertSANs: 
        
           {% for san in etcd_cert_alt_names %} 
        
                 - "{{ san }}" 
        
           {% endfor %} 
        
           {% for san in etcd_cert_alt_ips %} 
        
                 - "{{ san }}" 
        
           {% endfor %}

Thanks again!

derselbst · 2024-07-18T14:51:21Z

Could you update the SANs with ansible_hostname + ansible_fqdn of etcd deployed through kubeadm

I wasn't aware that this deployment method exists. I looked into that. It is currently not needed to update the SANs here, because etcd_access_addresses and etcd_address are not propagated when this deployment method is used. Because of this, the generated kubeadm config will look like this:

kind: ClusterConfiguration
clusterName: cluster.local
etcd:
  local:
    imageRepository: "quay.io/coreos"
    imageTag: "v3.5.10"
    dataDir: "/var/lib/etcd"
    extraArgs:
      metrics: basic
      election-timeout: "5000"
      heartbeat-interval: "250"
      auto-compaction-retention: "8"
      snapshot-count: "10000"
    serverCertSANs:
      - etcd.kube-system.svc.cluster.local
      - etcd.kube-system.svc
      - etcd.kube-system
      - etcd
    peerCertSANs:
      - etcd.kube-system.svc.cluster.local
      - etcd.kube-system.svc
      - etcd.kube-system
      - etcd

And the generated static pod manifest for the etcd pod will contain hardcoded IPs. This there is no way to override any bind addresses currently, hence it is not required to adjust the SAN for this deployment method, as it currently does not behave "similar" to the host deployment method.

Yet, I find the kubeadm deployment method interesting. I'll probably adopt it for my next deployment and when doing so, I'll have to make the two mentioned settings propagate correctly. So it's likely that I'll file a follow-up PR in a few weeks or so.

Could you highlight this switch/cleanup in the changelog just in case someone use inventory hostname somehow?

During the above exercise I found that a cleaner approach is to include the etcd_access_address of the groups[etcd] member nodes in that SAN section, rather than hardcoding ansible_fqdn and friends.

I have restored the previous line DNS.{{ counter["dns"] }} = {{ host }}{{ increment(counter, 'dns') }} - yet I still don't think this is necessary.

MrFreezeex · 2024-07-18T14:58:37Z

roles/etcd/templates/openssl.conf.j2

@@ -25,6 +25,9 @@ authorityKeyIdentifier=keyid:always,issuer
 [alt_names]
 DNS.1 = localhost
 {% for host in groups['etcd'] %}
+{# The address which etcd uses to access its members must be included in the SAN, otherwise etcd will fail with a TLS error upon startup. #}
+DNS.{{ counter["dns"] }} = {{ hostvars[host]['etcd_access_address'] }}{{ increment(counter, 'dns') }}


This sounds a good idea but this var is an IP by default so this won't work in the default case so maybe there should be a check similar to this not (hostvars[host]['etcd_access_address'] | ansible.utils.ipaddr) }}? WDYT?

Yeah, I saw failing tests complaining about it not being defined as well. I've added a guard.

MrFreezeex

Thanks!
/lgtm

cyclinder

thanks!

/lgtm

yankay · 2024-07-26T01:35:03Z

Thanks @derselbst
/approve

k8s-ci-robot · 2024-07-26T01:35:11Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: cyclinder, derselbst, MrFreezeex, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [yankay]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot requested review from cyclinder and MrFreezeex July 17, 2024 11:51

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 17, 2024

derselbst force-pushed the etcd-cert-fqdn branch from 0adf426 to e68ad47 Compare July 18, 2024 14:46

derselbst changed the title ~~Fix etcd certificate to include host's FQDN as SAN~~ Fix etcd certificate to include etcd_access_address as SAN Jul 18, 2024

derselbst changed the title ~~Fix etcd certificate to include etcd_access_address as SAN~~ Fix etcd not starting up when using a custom access address Jul 18, 2024

MrFreezeex reviewed Jul 18, 2024

View reviewed changes

derselbst force-pushed the etcd-cert-fqdn branch from e68ad47 to 2e8938f Compare July 18, 2024 17:32

Fix etcd certificate to acces address as SAN

02c7e8b

derselbst force-pushed the etcd-cert-fqdn branch from 2e8938f to 02c7e8b Compare July 18, 2024 17:34

MrFreezeex approved these changes Jul 18, 2024

View reviewed changes

k8s-ci-robot assigned MrFreezeex Jul 18, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 18, 2024

cyclinder approved these changes Jul 19, 2024

View reviewed changes

k8s-ci-robot assigned cyclinder Jul 19, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jul 26, 2024

k8s-ci-robot merged commit 242edd1 into kubernetes-sigs:master Jul 26, 2024
39 checks passed

derselbst deleted the etcd-cert-fqdn branch July 26, 2024 15:33

yankay mentioned this pull request Aug 28, 2024

Release Proposal v2.26 #11447

Closed

kpoxo6op pushed a commit to kpoxo6op/kubespray that referenced this pull request Dec 27, 2024

Fix etcd certificate to acces address as SAN (kubernetes-sigs#11388)

ea31ba2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix etcd not starting up when using a custom access address #11388

Fix etcd not starting up when using a custom access address #11388

derselbst commented Jul 17, 2024 •

edited

Loading

k8s-ci-robot commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024

derselbst commented Jul 17, 2024

derselbst commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024

derselbst commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024 •

edited

Loading

derselbst commented Jul 18, 2024

MrFreezeex Jul 18, 2024 •

edited

Loading

derselbst Jul 18, 2024

MrFreezeex left a comment

cyclinder left a comment

yankay commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

Fix etcd not starting up when using a custom access address #11388

Fix etcd not starting up when using a custom access address #11388

Conversation

derselbst commented Jul 17, 2024 • edited Loading

k8s-ci-robot commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024

derselbst commented Jul 17, 2024

derselbst commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024

derselbst commented Jul 17, 2024

MrFreezeex commented Jul 17, 2024 • edited Loading

derselbst commented Jul 18, 2024

MrFreezeex Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

derselbst Jul 18, 2024

Choose a reason for hiding this comment

MrFreezeex left a comment

Choose a reason for hiding this comment

cyclinder left a comment

Choose a reason for hiding this comment

yankay commented Jul 26, 2024

k8s-ci-robot commented Jul 26, 2024

derselbst commented Jul 17, 2024 •

edited

Loading

MrFreezeex commented Jul 17, 2024 •

edited

Loading

MrFreezeex Jul 18, 2024 •

edited

Loading