Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ssh runner times out #3334

Open
LucienBrule opened this issue Sep 6, 2022 · 6 comments
Open

[BUG] ssh runner times out #3334

LucienBrule opened this issue Sep 6, 2022 · 6 comments
Labels
nested-virt To identify issues that are related to nested virtualization

Comments

@LucienBrule
Copy link

General information

  • OS: Linux
  • Hypervisor: KVM under ESXI
  • Did you run crc setup before starting it (Yes/No)? Yes
  • Running CRC on: VM

CRC version

CRC version: 2.8.0+9afa0b95
OpenShift version: 4.11.1
Podman version: 4.1.1

CRC status

[developer@devbox]~% crc status --log-level debug
DEBU CRC version: 2.8.0+9afa0b95
DEBU OpenShift version: 4.11.1
DEBU Podman version: 4.1.1
DEBU Running 'crc status'
DEBU Checking file: /home/developer/.crc/machines/crc/.crc-exist
DEBU Checking file: /home/developer/.crc/machines/crc/.crc-exist
DEBU Found binary path at /home/developer/.crc/bin/crc-driver-libvirt
DEBU Launching plugin server for driver libvirt
DEBU Plugin server listening at address 127.0.0.1:36001
DEBU () Calling .GetVersion
DEBU Using API Version 1
DEBU () Calling .SetConfigRaw
DEBU () Calling .GetMachineName
DEBU (crc) Calling .GetBundleName
DEBU (crc) Calling .GetState
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="Getting current state..."
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="Fetching VM..."
DEBU (crc) Calling .GetIP
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="GetIP called for crc"
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="Getting current state..."
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="IP address: 192.168.130.11"
DEBU (crc) Calling .GetIP
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="GetIP called for crc"
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="Getting current state..."
DEBU (crc) DBG | time="2022-09-06T09:50:38-04:00" level=debug msg="IP address: 192.168.130.11"
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1
DEBU Using ssh private keys: [/home/developer/.crc/machines/crc/id_ecdsa /home/developer/.crc/cache/crc_libvirt_4.11.1_amd64/id_ecdsa_crc]
DEBU SSH command results: err: <nil>, output: 32737570816 13514244096 /sysroot
DEBU authentication operator is still progressing, Reason: OAuthServerDeployment_NewGeneration
DEBU Making call to close driver server
DEBU (crc) Calling .Close
DEBU Successfully made call to close driver server
DEBU Making call to close connection to plugin binary
DEBU (crc) DBG | time="2022-09-06T09:50:40-04:00" level=debug msg="Closing plugin on server side"
CRC VM:          Running
OpenShift:       Starting (v4.11.1)
Podman:
Disk Usage:      13.51GB of 32.74GB (Inside the CRC VM)
Cache Usage:     34.13GB
Cache Directory: /home/developer/.crc/cache

CRC config

[developer@devbox]~% crc config view
- consent-telemetry                     : yes
- cpus                                  : 8
- enable-shared-dirs                    : false
- kubeadmin-password                    : developer
- nameserver                            : 1.1.1.1
- pull-secret-file                      : /home/developer/pull-secret.txt

Host Operating System

NAME="Fedora Linux"
VERSION="36 (Server Edition)"
ID=fedora
VERSION_ID=36
VERSION_CODENAME=""
PLATFORM_ID="platform:f36"
PRETTY_NAME="Fedora Linux 36 (Server Edition)"
ANSI_COLOR="0;38;2;60;110;180"
LOGO=fedora-logo-icon
CPE_NAME="cpe:/o:fedoraproject:fedora:36"
HOME_URL="https://fedoraproject.org/"
DOCUMENTATION_URL="https://docs.fedoraproject.org/en-US/fedora/f36/system-administrators-guide/"
SUPPORT_URL="https://ask.fedoraproject.org/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"
REDHAT_BUGZILLA_PRODUCT="Fedora"
REDHAT_BUGZILLA_PRODUCT_VERSION=36
REDHAT_SUPPORT_PRODUCT="Fedora"
REDHAT_SUPPORT_PRODUCT_VERSION=36
PRIVACY_POLICY_URL="https://fedoraproject.org/wiki/Legal:PrivacyPolicy"
VARIANT="Server Edition"
VARIANT_ID=server

kernel: 5.18.19-200.fc36.x86_64

Steps to reproduce

  1. Install ESXI 7
  2. Create Fedora 36 VM from ESXI console with iso
  3. Install operating system
  4. Get shell into os
  5. install crc
  6. run crc setup
  7. run crc start

Expected

crc start doesn't time out.

Actual

crc start hangs when attempting to get ssh access to the VM.
The timeout is only 300 seconds as per https://github.com/code-ready/crc/blob/main/pkg/crc/machine/start.go#L335

Solution

I pulled the code and made the timeout 10000 seconds. After waiting about 15 minutes, it came up and finished installing successfully.

Etc

So I monitored the crc vm with cockpit, when booting it hangs for about 10 minutes at 'Booting RHEL FCOS' (the first line before it goes into console spam mode/systemd). Of interest, other VMs run under KVM don't have this long boot time problem, including the stock FCOS image. So my fix of increasing the timeout works for now but perhaps there's something that can be done about the boot time.

@LucienBrule LucienBrule added kind/bug Something isn't working status/need triage labels Sep 6, 2022
@gbraad
Copy link
Contributor

gbraad commented Sep 7, 2022

Nested virtualization setups aren't supported.
This is most likely the reason why the functions times out.
We can not reproduce these issues.
In the docs this is suggested to be run on a native hypervisor.

Note: @code-ready/crc-team we could add a timeout override that will just ignore these values until the user decides to break out.

@gbraad gbraad added nested-virt To identify issues that are related to nested virtualization and removed kind/bug Something isn't working status/need triage labels Sep 7, 2022
@gbraad
Copy link
Contributor

gbraad commented Sep 8, 2022

We have discussed to ignore or extend the ssh-timeout, but this is not a common issue.
We would like to understand more why the boot is taking longer in your case.

@praveenkumar
Copy link
Member

@LucienBrule Can you share sudo journalctl -b from the vm. You can use https://github.com/code-ready/crc/wiki/Debugging-guide to access the VM.

@polatsinan
Copy link

I am having more or less the same problem during starting crc.
Start command:

crc start -c 6 -m 11500 -p pull-secret.txt

Error:

INFO Checking if running as non-root              
INFO Checking if running inside WSL2              
INFO Checking if crc-admin-helper executable is cached 
INFO Checking for obsolete admin-helper executable 
INFO Checking if running on a supported CPU architecture 
INFO Checking minimum RAM requirements            
INFO Checking if crc executable symlink exists    
INFO Checking if Virtualization is enabled        
INFO Checking if KVM is enabled                   
INFO Checking if libvirt is installed             
INFO Checking if user is part of libvirt group    
INFO Checking if active user/process is currently part of the libvirt group 
INFO Checking if libvirt daemon is running        
INFO Checking if a supported libvirt version is installed 
INFO Checking if crc-driver-libvirt is installed  
INFO Checking crc daemon systemd socket units     
INFO Checking if systemd-networkd is running      
INFO Checking if NetworkManager is installed      
INFO Checking if NetworkManager service is running 
INFO Checking if /etc/NetworkManager/conf.d/crc-nm-dnsmasq.conf exists 
INFO Checking if /etc/NetworkManager/dnsmasq.d/crc.conf exists 
INFO Checking if libvirt 'crc' network is available 
INFO Checking if libvirt 'crc' network is active  
INFO Loading bundle: crc_libvirt_4.11.1_amd64...  
INFO Creating CRC VM for openshift 4.11.1...      
INFO Generating new SSH key pair...               
INFO Generating new password for the kubeadmin user 
INFO Starting CRC VM for openshift 4.11.1...      
INFO CRC instance is running with IP 192.168.130.11 
INFO CRC VM is running                            
INFO Updating authorized keys...                  
INFO Configuring shared directories               
INFO Check internal and public DNS query...       
INFO Check DNS query from host...                 
INFO Verifying validity of the kubelet certificates... 
INFO Starting kubelet service                     
INFO Waiting for kube-apiserver availability... [takes around 2min] 
Temporary error: ssh command error:
command : timeout 5s oc get cm --context admin --cluster crc --kubeconfig /opt/kubeconfig
err     : Process exited with status 124
 (x13)

General information

  • OS: Linux
  • Hypervisor: KVM
  • Did you run crc setup before starting it (Yes/No)? Yes
  • Running CRC on: VM

CRC version

CRC version: 2.8.0+217b3bd
OpenShift version: 4.11.1
Podman version: 4.1.1

CRC status

DEBU CRC version: 2.8.0+217b3bd                   
DEBU OpenShift version: 4.11.1                    
DEBU Podman version: 4.1.1                        
DEBU Running 'crc status'                         
DEBU Checking file: /home/student/.crc/machines/crc/.crc-exist 
DEBU Checking file: /home/student/.crc/machines/crc/.crc-exist 
DEBU Found binary path at /home/student/.crc/bin/crc-driver-libvirt 
DEBU Launching plugin server for driver libvirt   
DEBU Plugin server listening at address 127.0.0.1:42519 
DEBU () Calling .GetVersion                       
DEBU Using API Version 1                          
DEBU () Calling .SetConfigRaw                     
DEBU () Calling .GetMachineName                   
DEBU (crc) Calling .GetBundleName                 
DEBU (crc) Calling .GetState                      
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="Getting current state..." 
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="Fetching VM..." 
DEBU (crc) Calling .GetIP                         
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="GetIP called for crc" 
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="Getting current state..." 
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="IP address: 192.168.130.11" 
DEBU (crc) Calling .GetIP                         
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="GetIP called for crc" 
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="Getting current state..." 
DEBU Running SSH command: df -B1 --output=size,used,target /sysroot | tail -1 
DEBU (crc) DBG | time="2022-09-10T08:40:07+02:00" level=debug msg="IP address: 192.168.130.11" 
DEBU Using ssh private keys: [/home/student/.crc/machines/crc/id_ecdsa /home/student/.crc/cache/crc_libvirt_4.11.1_amd64/id_ecdsa_crc] 
DEBU SSH command results: err: <nil>, output: 32737570816 12010016768 /sysroot 
DEBU cannot get OpenShift status: stat /home/student/.crc/machines/crc/kubeconfig: no such file or directory 
DEBU Making call to close driver server           
DEBU (crc) Calling .Close                         
DEBU (crc) DBG | time="2022-09-10T08:40:33+02:00" level=debug msg="Closing plugin on server side" 
DEBU Successfully made call to close driver server 
DEBU Making call to close connection to plugin binary 
CRC VM:          Running
OpenShift:       Unreachable (v4.11.1)
Podman:          
Disk Usage:      12.01GB of 32.74GB (Inside the CRC VM)
Cache Usage:     17.04GB
Cache Directory: /home/student/.crc/cache

CRC config

- consent-telemetry                     : no

Host Operating System

NAME="Red Hat Enterprise Linux"
VERSION="8.6 (Ootpa)"
ID="rhel"
ID_LIKE="fedora"
VERSION_ID="8.6"
PLATFORM_ID="platform:el8"
PRETTY_NAME="Red Hat Enterprise Linux 8.6 (Ootpa)"
ANSI_COLOR="0;31"
CPE_NAME="cpe:/o:redhat:enterprise_linux:8::baseos"
HOME_URL="https://www.redhat.com/"
DOCUMENTATION_URL="https://access.redhat.com/documentation/red_hat_enterprise_linux/8/"
BUG_REPORT_URL="https://bugzilla.redhat.com/"

REDHAT_BUGZILLA_PRODUCT="Red Hat Enterprise Linux 8"
REDHAT_BUGZILLA_PRODUCT_VERSION=8.6
REDHAT_SUPPORT_PRODUCT="Red Hat Enterprise Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="8.6"

@LucienBrule
Copy link
Author

@praveenkumar journalctl output

Also here's the output of oc adm must-gather
must-gather.zip


Take 1

asciicast

Highlights

Time Event
0:10 Starting CRC VM for openshift 4.11.1...
0:24 Waiting for machine to come up 1/60
3:20 CRC instance is running with IP
3:20 Waiting until ssh is available
3:20 ssh retry loop (1)
12:50 ssh retry loop (533)
12:53 Ctrl C
13:10 run nmap
13:30 curl the host
14:00 ssh with incorrect keys
14:24 ssh with correct keys
14:50 journalctl

From the recording above, the crc start command got stuck waiting for the machine to come up under libvirt (ending at at 3:20). It got to retry attempt 60/60, called GetConfigRaw, re checked if the machine was up and in that check it was up.

So now the instance is assumed to be up because it has an ip address assigned to it. It then attempts to ssh into the machine, entering a retry loop.

I exited the crc start command at 12:53, and eventually terminated the machine.

There's no recording of this, but here's what I did:

$ sudo virsh shutdown crc

because crc stop hung too.

Take 2

asciicast

Highlights

Time Event
00:08 crc start --log-level-debug
00:10 calls .Start
00:17 SSH retry loop attempt 0
00:30 Booting RHEL CoreOS
06:08 Linux boot sequence
06:27 Welcome to RHEL ... Initramfs!
06:47 core login prompt
08:12 Starting openshift instance..
09:00 ssh into crc vm
09:46 journalctl -f on crc vm
18:13 Error Cluster is not ready: cluster operators...
18:14 Update kubeconfig
18:15 display creds, ending command
18:57 Login as kubeadmin (success)
21:14 oc debug node/crc
25:13 access openshift console (watch me fail at curl)

Work Around

My current work around is:

  1. Used patched binary that allows longer timeouts
  2. Start the machine the first time (when its created)
  3. Kill the machine with virsh
  4. Start the machine a second time
  5. Either get success or ignore the ssh timeout error
  6. use credentials to login
  7. Wait about 10 minutes
  8. Carry on as usual (access console/ deploy / etc)

@gbraad I think you can get a sense of where the boot is hanging from the asciinema.

I understand nested virtualization is not a supported use case, however it does appear to work when the timeouts are increased.

It would be cool if the timeouts could be set via config though.

@praveenkumar
Copy link
Member

@LucienBrule Thank you for providing all the details, we will discuss it in coming week and try to put a plan in place.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nested-virt To identify issues that are related to nested virtualization
Projects
None yet
Development

No branches or pull requests

4 participants