Skip to content

Commit

Permalink
Fix Qemu hang silently on failed boot
Browse files Browse the repository at this point in the history
Ticket: JIRA-344

Problem:
When QEMU was failing to boot the hard drive file image provided by the user, for example we have cases of user  using an ext4 image for firefracker
instead of a qemu disk image (this was facilitated by an oversight in the typescript sdk), the qemu process and hence the controller would hang indefinetly
without showing an error message.

Analysis

1. the Boot process was not part of the logs or the process output. (even inside the server) which is part of what was making it hard to debug.
2. QEMU try to boot via the network even if it is useless
3. After failing all boot method the qemu process and thus the controller  is still running indefinitely

Solution:
Change the option for qemu
-nographics make it output the boot process on the standard output (and thus the logs)
-boot order=c only boot the first hard drive (not sure if this actually
work)
-boot reboot-timeout=1 make it reboot if if fail to boot, but since we have -no-reboot the process just stop (default is -1 no reboot)
  • Loading branch information
olethanh committed Jan 6, 2025
1 parent bcdf0c0 commit df98ec6
Show file tree
Hide file tree
Showing 2 changed files with 16 additions and 3 deletions.
7 changes: 7 additions & 0 deletions src/aleph/vm/hypervisors/qemu/qemuvm.py
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,13 @@ async def start(
# Tell to put the output to std fd, so we can include them in the log
"-serial",
"stdio",
# nographics. Seems redundant with -serial stdio but without it the boot process is not displayed on stdout
"-nographic",
# Boot
# order=c only first hard drive
# reboot-timeout in combination with -no-reboot, makes it so qemu stop if there is no bootable device
"-boot",
"order=c,reboot-timeout=1",
# Uncomment for debug
# "-serial", "telnet:localhost:4321,server,nowait",
# "-snapshot", # Do not save anything to disk
Expand Down
12 changes: 9 additions & 3 deletions src/aleph/vm/hypervisors/qemu_confidential/qemuvm.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,12 +87,18 @@ async def start(
"-qmp",
f"unix:{self.qmp_socket_path},server,nowait",
# Tell to put the output to std fd, so we can include them in the log
"-nographic",
"-serial",
"stdio",
"--no-reboot", # Rebooting from inside the VM shuts down the machine
"-S",
# nographics. Seems redundant with -serial stdio but without it the boot process is not displayed on stdout
"-nographic",
# Boot
# order=c only first hard drive
# reboot-timeout in combination with -no-reboot, makes it so qemu stop if there is no bootable device
"-boot",
"order=c,reboot-timeout=1",
# Confidential options
# Do not start CPU at startup, we will start it via QMP after injecting the secret
"-S",
"-object",
f"sev-guest,id=sev0,policy={self.sev_policy},cbitpos={sev_info.c_bit_position},"
f"reduced-phys-bits={sev_info.phys_addr_reduction},"
Expand Down

0 comments on commit df98ec6

Please sign in to comment.