Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vagrant halt hangs on Debian VMs #10737

Open
termonio opened this issue Mar 17, 2019 · 5 comments
Open

Vagrant halt hangs on Debian VMs #10737

termonio opened this issue Mar 17, 2019 · 5 comments

Comments

@termonio
Copy link

termonio commented Mar 17, 2019

Background

  • vagrant 2.2.4

  • vagrant-libvirt 0.0.45

  • two Debian 9.8 VMs (one custom build, one provided by Vagrant Cloud) running on a KVM host

I use the following Vagrantfile

Vagrant.configure("2") do |config|
  config.vm.box = "generic/debian9"
  config.vm.provider :libvirt do |libvirt|
    libvirt.uri = "qemu+ssh://<user>@<serverip>/system"
  end
  config.ssh.proxy_command = "ssh -l <user> <serverip> nc %h %p"
  config.ssh.insert_key = false
end

Expected behaviour

vagrant halt shuts down the VM and the command terminates.

Actual behaviour

Spinning up the machines (vagrant up --provider=libvirt) and connecting via ssh (vagrant ssh) works as expected.

vagrant halt does shut down the VM (as seen for instance in VirtualMachineManager) but the command hangs and eventually needs a ctrl-c. vagrant global-status shows the instance still as running.

vagrant halt --debug ends with

/usr/bin/lsb_release -i 2>/dev/null | grep -qi 'debian' && exit
fi
if test -r /etc/issue; then
cat /etc/issue | grep -qi 'debian' && exit
fi
exit 1
 (sudo=false)
DEBUG ssh: stderr: 41e57d38-b4f7-4e46-9c38-13873d338b86-vagrant-ssh
DEBUG ssh: Exit status: 0
 INFO guest: Detected: debian!
DEBUG guest: Searching for cap: halt
DEBUG guest: Checking in: debian
DEBUG guest: Checking in: linux
DEBUG guest: Found cap: halt in linux
 INFO guest: Execute capability: halt [#<Vagrant::Machine: default (VagrantPlugins::ProviderLibvirt::Provider)>] (debian)
DEBUG ssh: Re-using SSH connection.
 INFO ssh: Execute: shutdown -h now (sudo=true)
DEBUG ssh: stderr: mesg: ttyname failed: Inappropriate ioctl for device

DEBUG ssh: stderr: 41e57d38-b4f7-4e46-9c38-13873d338b86-vagrant-ssh
DEBUG ssh: Sending SSH keep-alive...
DEBUG ssh: Sending SSH keep-alive...
DEBUG ssh: Sending SSH keep-alive...
DEBUG ssh: Sending SSH keep-alive...
DEBUG ssh: Sending SSH keep-alive...

(Note: I also provided this information as a comment on a related issue).

Observations and guesses

From what I have understood after looking into the Vagrant source and the debug log, Vagrant identifies the machine as Debian and uses the halt capabilities of a Linux machine and sends a shutdown -h now as sudo via ssh to the VM. However, this ssh command seems to hang in channel.wait in the source code snippet listed below.

./embedded/gems/2.2.4/gems/vagrant-2.2.4/plugins/communicators/winssh/communicator.rb:

begin
  keep_alive = nil

  if @machine.config.ssh.keep_alive
    # Begin sending keep-alive packets while we wait for the script
    # to complete. This avoids connections closing on long-running
    # scripts.
    keep_alive = Thread.new do
      loop do
        sleep 5
        @logger.debug("Sending SSH keep-alive...")
        connection.send_global_request("keep-alive@openssh.com")
      end
    end
  end

  # Wait for the channel to complete
  begin
    channel.wait
  rescue Errno::ECONNRESET, IOError
    @logger.info(
      "SSH connection unexpected closed. Assuming reboot or something.")
    exit_status = 0
    pty = false
  rescue Net::SSH::ChannelOpenFailed
    raise Vagrant::Errors::SSHChannelOpenFail
  rescue Net::SSH::Disconnect
    raise Vagrant::Errors::SSHDisconnected
  end
ensure
  # Kill the keep-alive thread
  keep_alive.kill if keep_alive
end

The exception block is never executed and Vagrant logs "Sending SSH keep-alive..." until terminated by ctrl-c

Even though the machine has shutdown successfully Vagrant does not seem to notice that the ssh connection is closed. There are other issues that go in a similar direction (e.g. here). One recommendation is to ensure that libpam-systemd is installed and UsePam yes is present in /etc/ssh/sshd_config. My machines meet this recommendation. Moreover, looking at the detailed discussion about not terminating ssh connections on restart / reboot, openssh (1:7.2p2-6) has provided a fix that serves to terminate ssh sessions cleanly if systemd doesn't do this itself.

To verify that ssh connections are indeed cleanly terminated, I connected to the running VM (private IP 192.168.121.29 via ssh

  • Connection 1 (this correspondens to the ssh connection details provided by vagrant ssh-config)
ssh -o ProxyCommand="ssh -l <user> <serverip> nc -q0  %h %p" \
	-l vagrant -i ~/.vagrant.d/insecure_private_key 192.168.121.29
  • Connection 2 (uses the the -J option to specify a jump host that is present on recent implementations)
ssh -J  <user>@<serverip> \
	-l vagrant -i ~/.vagrant.d/insecure_private_key 192.168.121.29

After a shutdown connection 2 immediately reports a broken pipe. In contrast, connection 1 hangs (and requires a ~. to reset).

I tried various settings of the ssh.proxy_command (-q0, -N, -q0 -N). These settings should quit the connection after receiving EOF. I can't provide ssh -J ... via proxy_command (or, to be precise, I haven't found a way of doing so). So it seems that the problem could be related to the underlying Ruby SSH library. A workaround is much appreciated as this has impact on important use cases (given the market share of Debian).

@briancain
Copy link
Member

Hey there @termonio - I haven't tried libvirt (which we do not officially support on this repo), but this is not an issue at all with the virtualbox and vmware provider. I tried these debian boxes and they all were successfully halted without issue:

  • bento/debian-9.4
  • debian/stretch64
  • generic/debian9

I'm guessing it's possible that it has something to do with the libvirt provider, since the debug message seems to be coming from that plugin:

INFO guest: Execute capability: halt [#<Vagrant::Machine: default (VagrantPlugins::ProviderLibvirt::Provider)>] (debian)

@termonio
Copy link
Author

It seems to me that the problem surfaces because in my setup hypervisor and Vagrant are not running on the same machine and communication is done via proxied ssh. To communicate with the VM, the ssh connection uses the hypervisor machine as a jump host. This is handled via a ssh proxy command that uses netcat. This type of setup (similar to the one in connection 1 even with the -q0 option) does not detect when the VM reboots / halts. This appears to me the reason why the exception block (on Errno::ECONNRESET, IOError) in the snippet above is never executed. The vagrant-libvirt line you are referring to just calls the :halt capability of the VM in Vagrant. If there is some ssh setup via a jump host that detects closed connections reliably that would enhance the spectrum of use cases of Vagrant considerably.

@DanHam
Copy link

DanHam commented May 1, 2019

@termonio Please see my comment HERE for a possible fix.

I think this is the same issue as seen in #6207

@termonio
Copy link
Author

termonio commented May 2, 2019

@DanHam: I did ensure that my machines have libpam-systemd installed and UsePam yes is present in /etc/ssh/sshd_config. Unfortunately, this does not fix the issue in the setup I described.

@smbambling
Copy link

@terceiro I'm seeing this same behavior on CentOS 7 VMs as well, I'm also using proxycommand to start on a remote libvirt server. Did you ever find a solution to this ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants