Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] new node can't join the cluster if time is not synced #2860

Closed
bk201 opened this issue Oct 3, 2022 · 8 comments
Closed

[BUG] new node can't join the cluster if time is not synced #2860

bk201 opened this issue Oct 3, 2022 · 8 comments
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Milestone

Comments

@bk201
Copy link
Member

bk201 commented Oct 3, 2022

Describe the bug

A new node can't join an existing cluster if its time is not in sync.

To Reproduce
Steps to reproduce the behavior:

  1. Prepare a new node.
  2. Configure bios date and time in the future and install harvester with timesync against NTP.
  3. The node can't join, rke2 log:
Sep 05 07:59:00 <***> rke2[2657]: time="2022-09-05T07:59:00Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: CA cert validation failed: Get \"[https://127.0.0.1:9345/cacerts\](https://127.0.0.1:9345/cacerts/)": x509: certificate has expired or is not yet valid: current time 2022-09-05T07:59:00Z is before 2022-09-05T08:16:15Z"

Expected behavior

The node can join.

Support bundle

Environment

  • Harvester ISO version: v1.0.3.
  • Underlying Infrastructure (e.g. Baremetal with Dell PowerEdge R630):

Additional context
It takes time to make time in sync; one way is to wait for time sync in rancherd.

@bk201 bk201 added kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact) reproduce/always Reproducible 100% of the time labels Oct 3, 2022
@tjjh89017
Copy link
Contributor

tjjh89017 commented Oct 6, 2022

Workaround

Adjust BIOS time to correct one before installing Harvester first node.

reproduce step

Manual install

  1. Install harvester in bare-metal server
  2. don't boot to harvester
  3. enter bios and modify the time to future (I tested shift the time to the next month or next year)
  4. boot into harvester
  5. journalctl -u rke2-server
  6. it will show the log about CA cert problem
  7. login to first node (create mode node) and check the create date of files in /var/lib/rancher/rke2/server/tls/* and check the date is the future date.

Auto install (PXE install)

  1. enter bios and modify the time to future (I tested shift the time to the next month or next year)
  2. execute PXE install
  3. boot into harvester
  4. journalctl -u rke2-server
  5. it will show the log about CA cert problem
  6. login to first node (create mode node) and check the create date of files in /var/lib/rancher/rke2/server/tls/* and check the date is the future date.

@tjjh89017
Copy link
Contributor

Possible solution:

We should execute rancherd/rke2-server/rke2-agent service after systemd-time-wait-sync.service
and enable systemd-time-wait-sync.service

Override systemd service config with After=<original target/service> systemd-time-wait-sync.service
and systemctl enable systemd-time-wait-sync.service

Note

Only tested with correct and proper NTP server and NTP settings.
This solution didn't test without NTP or NTP is slow or unreachable.

@harvesterhci-io-github-bot
Copy link
Collaborator

harvesterhci-io-github-bot commented Oct 25, 2022

Pre Ready-For-Testing Checklist

  • If NOT labeled: not-require/test-plan Has the e2e test plan been merged? Have QAs agreed on the automation test case? If only test case skeleton w/o implementation, have you created an implementation issue?
    • The automation skeleton PR is at:
    • The automation test case PR is at:

*~~ [ ] If the fix introduces the code for backward compatibility Has a separate issue been filed with the label release/obsolete-compatibility?~~
The compatibility issue is filed at:

@harvesterhci-io-github-bot
Copy link
Collaborator

Automation e2e test issue: harvester/tests#576

@tjjh89017
Copy link
Contributor

tjjh89017 commented Oct 25, 2022

Test step

It's not easy to test this issue in VM because we can not handle RTC hardware clock of a VM.

Test 1 (correct NTP)

rancherd and rke2 should wait for time sync.

  1. prepare 2 or more baremetal machines.
  2. enter BIOS of both machines, and adjust time to the future (next year)
  3. Install harvester with proper NTP server.
  4. enter root shell, check system date with command "date"
  5. Check if rancherd and rke2-server are started.
  6. Check if 2nd node can join the cluster.

Test 2 (cannot connect to NTP server or without NTP)

If NTP server is unreachable or other NTP problem, ignore NTP and start Rancherd and rke2.

  1. prepare 1 baremetal machines.
  2. enter BIOS of both machines, and adjust time to the future (next year)
  3. Install harvester in airgap environment without NTP.
  4. Boot into Harvester and wait for 3 min
  5. Enter root shell, check systemd-time-wait-sync.service status (systemctl status systemd-time-wait-sync.service)
  6. Check status should not be "Active: active (exited)", should be failed or other error.
  7. Check rancherd and rke2-server are working (systemctl status rancherd.service rke2-server.service)
  8. Rancherd should be "finish bootstrap"
  9. Rke2-server or rke2-agent should be" active (running)"
  10. 2nd node could not always join the cluster because the time is out of sync. (Expected)

@TachunLin TachunLin self-assigned this Jan 3, 2023
@TachunLin
Copy link

Test 1 result (correct NTP): PASS

  • The second node can join to cluster when bios time is not synced
    image

  • The third node can join to cluster when bios time is not synced
    image

  • Current date 2023/01/04 GMT+8

  • Date in node 1 bios settings 2023/01/05

  • Date in node 2 bios settings 2023/01/06

  • Date in node 3 bios settings 2023/01/07

  • Check system date in Harvester node

rancher@node1:~> date
Wed 04 Jan 2023 11:15:08 AM UTC
rancher@node1:~> sudo systemctl status rancherd
● rancherd.service - Rancher Bootstrap
     Loaded: loaded (/lib/systemd/system/rancherd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/rancherd.service.d
             └─override.conf
     Active: inactive (dead) since Wed 2023-01-04 11:02:54 UTC; 14min ago
       Docs: https://github.com/rancher/rancherd
    Process: 2048 ExecStart=/usr/bin/rancherd bootstrap (code=exited, status=0/SUCCESS)
   Main PID: 2048 (code=exited, status=0/SUCCESS)

Jan 04 11:02:53 node1 rancherd[2048]: time="2023-01-04T11:02:53Z" level=info msg="[stdout]: deployment \"system-upgrade-controller\" successfully rolled out"
Jan 04 11:02:53 node1 rancherd[2048]: time="2023-01-04T11:02:53Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20230104-105959-app>
Jan 04 11:02:53 node1 rancherd[2048]: time="2023-01-04T11:02:53Z" level=info msg="Running command: /usr/bin/rancherd [retry /var/lib/rancher/rke2/bin/kubectl -n cattle-system wait --for=con>
Jan 04 11:02:54 node1 rancherd[2048]: time="2023-01-04T11:02:54Z" level=info msg="[stdout]: plan.upgrade.cattle.io/system-agent-upgrader condition met"
Jan 04 11:02:54 node1 rancherd[2048]: time="2023-01-04T11:02:54Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20230104-105959-app>
Jan 04 11:02:54 node1 rancherd[2048]: time="2023-01-04T11:02:54Z" level=info msg="Running command: /usr/bin/rancherd [retry /var/lib/rancher/rke2/bin/kubectl -n fleet-local wait --for=condi>
Jan 04 11:02:54 node1 rancherd[2048]: time="2023-01-04T11:02:54Z" level=info msg="[stdout]: cluster.provisioning.cattle.io/local condition met"
Jan 04 11:02:54 node1 rancherd[2048]: time="2023-01-04T11:02:54Z" level=info msg="Successfully Bootstrapped Rancher (v2.6.9/v1.24.7+rke2r1)"
Jan 04 11:02:54 node1 systemd[1]: rancherd.service: Succeeded.
Jan 04 11:02:54 node1 systemd[1]: Finished Rancher Bootstrap.
rancher@node1:~> sudo systemctl status rke2-server
● rke2-server.service - Rancher Kubernetes Engine v2 (server)
     Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/rke2-server.service.d
             └─override.conf
     Active: active (running) since Wed 2023-01-04 11:02:50 UTC; 24min ago
       Docs: https://github.com/rancher/rke2#readme
    Process: 13180 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
    Process: 13182 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 13184 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
    Process: 13186 ExecStartPre=/usr/sbin/harv-update-rke2-server-url server (code=exited, status=0/SUCCESS)
   Main PID: 13188 (rke2)
      Tasks: 944

@TachunLin
Copy link

TachunLin commented Jan 4, 2023

Test 2 (cannot connect to NTP server or without NTP):

  • The systemd-time-wait-sync.service status would be Wait Until Kernel Time Synchronized
node1:~ # systemctl status systemd-time-wait-sync.service
● systemd-time-wait-sync.service - Wait Until Kernel Time Synchronized
     Loaded: loaded (/usr/lib/systemd/system/systemd-time-wait-sync.service; enabled; vendor preset: disabled)
    Drop-In: /etc/systemd/system/systemd-time-wait-sync.service.d
             └─override.conf
     Active: failed (Result: timeout) since Fri 2024-01-05 08:10:01 UTC; 13min ago
       Docs: man:systemd-time-wait-sync.service(8)
   Main PID: 1254 (code=exited, status=0/SUCCESS)

Jan 05 08:07:01 node1 systemd-time-wait-sync[1254]: adjtime state 5 status 40 time Fri 2024-01-05 08:07:01.417680 UTC
Jan 05 08:10:01 node1 systemd[1]: systemd-time-wait-sync.service: start operation timed out. Terminating.
Jan 05 08:10:01 node1 systemd-time-wait-sync[1254]: Exit without adjtimex synchronized.
Jan 05 08:10:01 node1 systemd[1]: systemd-time-wait-sync.service: Failed with result 'timeout'.
Jan 05 08:10:01 node1 systemd[1]: Failed to start Wait Until Kernel Time Synchronized.

  • The rancherd.service in "finish bootstrap"
node1:~ #  systemctl status rancherd.service
● rancherd.service - Rancher Bootstrap
     Loaded: loaded (/lib/systemd/system/rancherd.service; enabled; vendor preset: enabled)
    Drop-In: /etc/systemd/system/rancherd.service.d
             └─override.conf
     Active: inactive (dead) since Fri 2024-01-05 08:12:56 UTC; 1h 0min ago
       Docs: https://github.com/rancher/rancherd
    Process: 2232 ExecStart=/usr/bin/rancherd bootstrap (code=exited, status=0/SUCCESS)
   Main PID: 2232 (code=exited, status=0/SUCCESS)

Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="[stdout]: deployment \"system-upgrade-controller\" successfully rolled out"
Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20240105-081001-app>
Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="Running command: /usr/bin/rancherd [retry /var/lib/rancher/rke2/bin/kubectl -n cattle-system wait --for=con>
Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="[stdout]: plan.upgrade.cattle.io/system-agent-upgrader condition met"
Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="No image provided, creating empty working directory /var/lib/rancher/rancherd/plan/work/20240105-081001-app>
Jan 05 08:12:55 node1 rancherd[2232]: time="2024-01-05T08:12:55Z" level=info msg="Running command: /usr/bin/rancherd [retry /var/lib/rancher/rke2/bin/kubectl -n fleet-local wait --for=condi>
Jan 05 08:12:56 node1 rancherd[2232]: time="2024-01-05T08:12:56Z" level=info msg="[stdout]: cluster.provisioning.cattle.io/local condition met"
Jan 05 08:12:56 node1 rancherd[2232]: time="2024-01-05T08:12:56Z" level=info msg="Successfully Bootstrapped Rancher (v2.6.9/v1.24.7+rke2r1)"
Jan 05 08:12:56 node1 systemd[1]: rancherd.service: Succeeded.
Jan 05 08:12:56 node1 systemd[1]: Finished Rancher Bootstrap.
  • The rke2-server status is running

    node1:~ # systemctl status rke2-server.service
    ● rke2-server.service - Rancher Kubernetes Engine v2 (server)
         Loaded: loaded (/usr/local/lib/systemd/system/rke2-server.service; enabled; vendor preset: disabled)
        Drop-In: /etc/systemd/system/rke2-server.service.d
                 └─override.conf
         Active: active (running) since Fri 2024-01-05 08:12:53 UTC; 1h 1min ago
           Docs: https://github.com/rancher/rke2#readme
        Process: 13086 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service (code=exited, status=0/SUCCESS)
        Process: 13089 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
        Process: 13090 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
        Process: 13091 ExecStartPre=/usr/sbin/harv-update-rke2-server-url server (code=exited, status=0/SUCCESS)
       Main PID: 13093 (rke2)
          Tasks: 966
    
  • The 2nd node can't join the cluster because the time is out of sync.

@TachunLin
Copy link

Verified fixed on master-a310138d-head (22/12/27). Close this issue.

Result

Verified when machine bios time is not synced

  1. The new node can join to cluster if set correct NTP server in result

  2. The new node can't join the cluster if no NTP server available (airgapped) in result

Test Information

  • Test Environment: 3 nodes Harvester bare machines
  • Harvester version: master-a310138d-head (22/12/27)

Verify Steps

#2860 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/0 Must be fixed in this release reproduce/always Reproducible 100% of the time severity/2 Function working but has a major issue w/o workaround (a major incident with significant impact)
Projects
None yet
Development

No branches or pull requests

4 participants