-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem: QEMUVM killed before shutdown command #698
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
When running the controller as a systemd service (the normal usecase in prod) and stopping the service the QEMU process was always killed BEFORE the shutdown command could be sent so the VM could not properly clean up As an additional symptom this error appeared and confused dev and user ``` Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,187 | ERROR | Task exception was never retrieved Sep 05 12:24:03 testing-hetzner python3[2468548]: future: <Task finished name='Task-3' coro=<QemuVM.stop() done, defined at /opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py:149> exception=QMPCapabilitiesError()> Sep 05 12:24:03 testing-hetzner python3[2468548]: Traceback (most recent call last): Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 151, in stop Sep 05 12:24:03 testing-hetzner python3[2468548]: self.send_shutdown_message() Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 141, in send_shutdown_message Sep 05 12:24:03 testing-hetzner python3[2468548]: client = self._get_qmpclient() Sep 05 12:24:03 testing-hetzner python3[2468548]: ^^^^^^^^^^^^^^^^^^^^^ Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/aleph/vm/hypervisors/qemu/qemuvm.py", line 136, in _get_qmpclient Sep 05 12:24:03 testing-hetzner python3[2468548]: client.connect() Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/qmp.py", line 162, in connect Sep 05 12:24:03 testing-hetzner python3[2468548]: return self.__negotiate_capabilities() Sep 05 12:24:03 testing-hetzner python3[2468548]: ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Sep 05 12:24:03 testing-hetzner python3[2468548]: File "/opt/aleph-vm/qmp.py", line 88, in __negotiate_capabilities Sep 05 12:24:03 testing-hetzner python3[2468548]: raise QMPCapabilitiesError Sep 05 12:24:03 testing-hetzner python3[2468548]: qmp.QMPCapabilitiesError Sep 05 12:24:03 testing-hetzner python3[2468548]: 2024-09-05 12:24:03,285 | WARNING | Process terminated with 0 ``` Solution: Use mixed kill mode in Systemd, which will at first only send the term signal to the main process, and give the VM time to properly cleanup and shutdown. Note that some time the "shutdown" error is not acted upon so stoping the process is still necessary. It seems to happend when the boot is not completed yet. So a fallback kill is done after a timeout.
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #698 +/- ##
==========================================
- Coverage 62.20% 62.19% -0.02%
==========================================
Files 69 69
Lines 6073 6074 +1
Branches 641 641
==========================================
Hits 3778 3778
- Misses 2143 2144 +1
Partials 152 152 ☔ View full report in Codecov by Sentry. |
hoh
reviewed
Sep 13, 2024
packaging/aleph-vm/etc/systemd/system/aleph-vm-controller@.service
Outdated
Show resolved
Hide resolved
hoh
reviewed
Sep 13, 2024
Co-authored-by: Hugo Herter <git@hugoherter.com>
hoh
approved these changes
Sep 13, 2024
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
When running the controller as a systemd service (the normal usecase in prod) and stopping the service
the QEMU process was always killed BEFORE the shutdown command could be sent
so the VM could not properly clean up
As an additional symptom this error appeared and confused dev and user
Solution: Use mixed kill mode in Systemd, which will at first only send
the term signal to the main process, and give the VM time to properly
cleanup and shutdown.
Note that some time the "shutdown" error is not acted upon so stoping
the process is still necessary. It seems to happend when the boot is not
completed yet.
So a fallback kill is done after a timeout.
Self proofreading checklist
Changes
see Solution section
How to test
To test this you need to have the controller run inside systemd.
This PR modify the systemd unit, easiest way is to copy the file
packaging/aleph-vm/etc/systemd/system/aleph-vm-controller@.service
inside/etc/systemd/system/aleph-vm-controller@.service
and runsudo systemcl deamon-reload
Then start a QEMU instance (confidential or not), then stop it (via the client or via systemd)
Note
this PR superseed #694