Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support live migration of VMs with attached volumes #12694

Closed
benoitjpnet opened this issue Dec 15, 2023 · 13 comments · Fixed by #13823
Closed

Support live migration of VMs with attached volumes #12694

benoitjpnet opened this issue Dec 15, 2023 · 13 comments · Fixed by #13823
Assignees
Labels
Feature New feature, not a bug
Milestone

Comments

@benoitjpnet
Copy link

Following cluster:

root@mc10:~# lxc cluster ls
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| NAME |            URL            |      ROLES      | ARCHITECTURE | FAILURE DOMAIN | DESCRIPTION | STATE  |      MESSAGE      |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc10 | https://192.168.1.10:8443 | database-leader | x86_64       | default        |             | ONLINE | Fully operational |
|      |                           | database        |              |                |             |        |                   |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc11 | https://192.168.1.11:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
| mc12 | https://192.168.1.12:8443 | database        | x86_64       | default        |             | ONLINE | Fully operational |
+------+---------------------------+-----------------+--------------+----------------+-------------+--------+-------------------+
root@mc10:~# 

I start one VM:

lxc launch ubuntu:22.04 v1 --vm --target mc10

I move it:

root@mc10:~# lxc exec v1 -- uptime
 11:45:17 up 0 min,  0 users,  load average: 0.59, 0.13, 0.04
root@mc10:~# 

root@mc10:~# lxc move v1 --target mc11
Error: Instance move to destination failed: Error transferring instance data: Failed migration on target: Failed getting migration target filesystem connection: websocket: bad handshake
root@mc10:~# 
@roosterfish
Copy link
Contributor

Hi @benoitjpnet, it looks like live migration isn't yet enabled on your cluster. You can confirm by checking the LXD daemons error logs using journalctl -u snap.lxd.daemon.

@roosterfish roosterfish added the Incomplete Waiting on more information from reporter label Jan 2, 2024
@benoitjpnet
Copy link
Author

The only error I see is:

Jan 02 08:44:47 mc10 lxd.daemon[2134]: time="2024-01-02T08:44:47Z" level=error msg="Failed migration on target" clusterMoveSourceName=builder err="Failed getting migration target filesystem connection: websocket: bad handshake" instance=builder live=true project=default push=false

This lacks a more explicit error message.

But thank you, I re-read the documentation and I missed:

Set migration.stateful to true on the instance.

Then I am doing lxc move v1 --target mc10 but it is stuck. I guess it is not related to Microcloud though.

@roosterfish
Copy link
Contributor

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

@benoitjpnet
Copy link
Author

Concerning the stuck part:

Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=builder_var_lib_laminar driver=disk err="Stateful migration unsupported" instance=builder project=default
Jan 02 08:50:50 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=builder instanceType=virtual-machine project=default
Jan 02 08:50:51 mc10 lxd.daemon[2134]: time="2024-01-02T08:50:51Z" level=warning msg="Failed reading from state connection" err="read tcp 192.168.1.10:57884->192.168.1.11:8443: use of closed network connection" instance=builder instanceType=virtual-machine project=default

I use Ceph RBD + CephFS and it seems CephFS is not supported for live migration :(

@benoitjpnet
Copy link
Author

Can you check the logs on both ends (source and target host)? One of them should tell that the migration has to be enabled in the config.

I was not able to find such logs/messages.

@roosterfish
Copy link
Contributor

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

@tomponline this looks to be an error on LXD side when migrating VMs that have a CephFS volume attached. Should we block migration of VMs with attached volumes? At least the error from qemu below kind of indicates that this is not supported.
Is that the reason why the DiskVMVirtiofsdStart function returns Stateful migration unsupported?

On the source host you can see the following log messages:

Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:21 m2 lxd.daemon[7034]: time="2024-01-02T13:11:21Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
...
Jan 02 13:11:50 m2 lxd.daemon[7034]: time="2024-01-02T13:11:50Z" level=error msg="Failed migration on source" clusterMoveSourceName=v1 err="Failed starting state transfer to target: Migration is disabled when VirtFS export path 'NULL' is mounted in the guest using mount_tag 'lxd_vol'" instance=v1 live=true project=default push=false

On the target side:

Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for device, using 9p as a fallback" device=vol driver=disk err="Stateful migration unsupported" instance=v1 project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Unable to use virtio-fs for config drive, using 9p as a fallback" err="Stateful migration unsupported" instance=v1 instanceType=virtual-machine project=default
Jan 02 13:11:50 m1 lxd.daemon[4537]: time="2024-01-02T13:11:50Z" level=warning msg="Failed reading from state connection" err="read tcp 10.171.103.8:38154->10.171.103.138:8443: use of closed network connection" instance=v1 instanceType=virtual-machine project=default

@benoitjpnet
Copy link
Author

I was able to reproduce the warnings including the hanging migration. I guess you have added a new CephFS storage pool to the MicroCloud cluster and attached one of its volumes to the v1 instance which you are trying to migrate?

Correct.

@tomponline tomponline transferred this issue from canonical/microcloud Jan 3, 2024
@tomponline tomponline changed the title Live migration of VM is not working Live migration of VM with attached volume is not working Jan 3, 2024
@tomponline
Copy link
Member

Thanks @roosterfish @benoitjpnet I have moved this to LXD for triaging.

@benoitjpnet can you confirm that live migration works if there is no volume attached?

@tomponline tomponline added Incomplete Waiting on more information from reporter and removed Incomplete Waiting on more information from reporter labels Jan 3, 2024
@benoitjpnet
Copy link
Author

Yes it works.

root@mc10:~# lxc launch ubuntu:22.04 v1 --vm --target mc10 -d root,size=10GiB -d root,size.state=4GiB -c limits.memory=4GiB -c limits.cpu=4 -c migration.stateful=true
Creating v1
Starting v1
root@mc10:~# lxc exec v1 -- uptime
 13:10:21 up 0 min,  0 users,  load average: 0.74, 0.19, 0.06
root@mc10:~# lxc move v1 --target mc11
root@mc10:~# lxc exec v1 -- uptime
 13:10:47 up 0 min,  0 users,  load average: 0.49, 0.17, 0.06
root@mc10:~# 

@tomponline tomponline removed the Incomplete Waiting on more information from reporter label Jan 3, 2024
@tomponline
Copy link
Member

@MusicDin please can you evaluate what happens when trying to migrate (both live and non-live modes) a VM with custom volumes attached (filesystem and block types) and identify what does and doesn't work.

I suspect we will need quite a bit of work to add support for live-migrating custom block volumes in remote storage, and that live migrating of VMs with custom local volumes isn't going to work either.

So we are likely going to need to land an improvement to detect incompatible scenarios and return a clear error message, and then potentially add a work item for a future roadmap to improve migration support of custom volumes.

@tomponline tomponline added the Bug Confirmed to be a bug label Jan 8, 2024
@tomponline
Copy link
Member

#12733 improves the error the user sees in this situation.

@tomponline tomponline added Feature New feature, not a bug and removed Bug Confirmed to be a bug labels Feb 21, 2024
@tomponline tomponline added this to the later milestone Feb 21, 2024
@tomponline tomponline changed the title Live migration of VM with attached volume is not working Support live migration of VMs with attached volumes Feb 21, 2024
@tomponline
Copy link
Member

Seems relevant lxc/incus#686

@tomponline
Copy link
Member

Hi @boltmark as you're working on some migration work wrt to #13695 I thought it would also be a good opportunity for you to take a look at fixing this issue considering lxc/incus#686

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature, not a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants