Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem: QEMUVM killed before shutdown command #698

Merged
merged 3 commits into from
Sep 13, 2024
Merged

Conversation

olethanh
Copy link
Collaborator

@olethanh olethanh commented Sep 12, 2024

When running the controller as a systemd service (the normal usecase in prod) and stopping the service
the QEMU process was always killed BEFORE the shutdown command could be sent
so the VM could not properly clean up

As an additional symptom this error appeared and confused dev and user

Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,187 | ERROR | Task exception was never retrieved
Sep 05 12:24:03 testing-hetzner python3[2468548]: future: <Task finished name='Task-3' coro=<QemuVM.stop() done, defined at /opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py:149> exception=QMPCapabilitiesError()>
Sep 05 12:24:03 testing-hetzner python3[2468548]: Traceback (most recent call last):
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 151, in stop
Sep 05 12:24:03 testing-hetzner python3[2468548]:     self.send_shutdown_message()
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 141, in send_shutdown_message
Sep 05 12:24:03 testing-hetzner python3[2468548]:     client = self._get_qmpclient()
Sep 05 12:24:03 testing-hetzner python3[2468548]:              ^^^^^^^^^^^^^^^^^^^^^
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 136, in _get_qmpclient
Sep 05 12:24:03 testing-hetzner python3[2468548]:     client.connect()
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/qmp.py", line 162, in connect
Sep 05 12:24:03 testing-hetzner python3[2468548]:     return self.__negotiate_capabilities()
Sep 05 12:24:03 testing-hetzner python3[2468548]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/qmp.py", line 88, in __negotiate_capabilities
Sep 05 12:24:03 testing-hetzner python3[2468548]:     raise QMPCapabilitiesError
Sep 05 12:24:03 testing-hetzner python3[2468548]: qmp.QMPCapabilitiesError
Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,285 | WARNING | Process terminated with 0

Solution: Use mixed kill mode in Systemd, which will at first only send
the term signal to the main process, and give the VM time to properly
cleanup and shutdown.

Note that some time the "shutdown" error is not acted upon so stoping
the process is still necessary. It seems to happend when the boot is not
completed yet.

So a fallback kill is done after a timeout.

Self proofreading checklist

  • The new code clear, easy to read and well commented.
  • New code does not duplicate the functions of builtin or popular libraries.
  • [na ] New classes and functions contain docstrings explaining what they provide.
  • [na] All new code is covered by relevant tests.
  • Documentation has been updated regarding these changes.

Changes

see Solution section

How to test

To test this you need to have the controller run inside systemd.
This PR modify the systemd unit, easiest way is to copy the file packaging/aleph-vm/etc/systemd/system/aleph-vm-controller@.service inside /etc/systemd/system/aleph-vm-controller@.service and run sudo systemcl deamon-reload

Then start a QEMU instance (confidential or not), then stop it (via the client or via systemd)

Note

this PR superseed #694

When running the controller as a systemd service (the normal usecase in prod) and stopping the service
the QEMU process was always killed BEFORE the shutdown command could be sent
so the VM could not properly clean up

As an additional symptom this error appeared and confused dev and user
```
Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,187 | ERROR | Task exception was never retrieved
Sep 05 12:24:03 testing-hetzner python3[2468548]: future: <Task finished name='Task-3' coro=<QemuVM.stop() done, defined at /opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py:149> exception=QMPCapabilitiesError()>
Sep 05 12:24:03 testing-hetzner python3[2468548]: Traceback (most recent call last):
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 151, in stop
Sep 05 12:24:03 testing-hetzner python3[2468548]:     self.send_shutdown_message()
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 141, in send_shutdown_message
Sep 05 12:24:03 testing-hetzner python3[2468548]:     client = self._get_qmpclient()
Sep 05 12:24:03 testing-hetzner python3[2468548]:              ^^^^^^^^^^^^^^^^^^^^^
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 136, in _get_qmpclient
Sep 05 12:24:03 testing-hetzner python3[2468548]:     client.connect()
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/qmp.py", line 162, in connect
Sep 05 12:24:03 testing-hetzner python3[2468548]:     return self.__negotiate_capabilities()
Sep 05 12:24:03 testing-hetzner python3[2468548]:            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Sep 05 12:24:03 testing-hetzner python3[2468548]:   File "/opt/aleph-vm/qmp.py", line 88, in __negotiate_capabilities
Sep 05 12:24:03 testing-hetzner python3[2468548]:     raise QMPCapabilitiesError
Sep 05 12:24:03 testing-hetzner python3[2468548]: qmp.QMPCapabilitiesError
Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,285 | WARNING | Process terminated with 0
```

Solution: Use mixed kill mode in Systemd, which will at first only send
the term signal to the main process, and give the VM time to properly
cleanup and shutdown.

Note that some time the "shutdown" error is not acted upon so stoping
the process is still necessary. It seems to happend when the boot is not
completed yet.

So a fallback kill is done after a timeout.
@olethanh olethanh requested review from hoh and nesitor and removed request for hoh September 12, 2024 14:41
Copy link

codecov bot commented Sep 12, 2024

Codecov Report

Attention: Patch coverage is 0% with 1 line in your changes missing coverage. Please review.

Project coverage is 62.19%. Comparing base (7ee7d04) to head (4c171d0).
Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
src/aleph/vm/controllers/__main__.py 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #698      +/-   ##
==========================================
- Coverage   62.20%   62.19%   -0.02%     
==========================================
  Files          69       69              
  Lines        6073     6074       +1     
  Branches      641      641              
==========================================
  Hits         3778     3778              
- Misses       2143     2144       +1     
  Partials      152      152              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@olethanh olethanh requested a review from hoh September 13, 2024 09:14
Co-authored-by: Hugo Herter <git@hugoherter.com>
@hoh hoh merged commit eb2863d into main Sep 13, 2024
20 of 22 checks passed
@hoh hoh deleted the ol-qemu-clean-shutdown branch September 13, 2024 16:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants