Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

Closed
scallyob opened this issue Apr 24, 2024 · 32 comments
Closed
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@scallyob
Copy link

scallyob commented Apr 24, 2024

WARNING: This issues is only above very old AMD CPUs, not supported by AMD anymore. Some workaround listed below have severe security consequences, do not apply them unless you really understand all the implications!

How to file a helpful issue

Qubes OS release

4.2.1

Brief summary

I did a full update on April 23 and rebooted April 24. First reboot since April 1.
Now no Qubes/VMs will start.

Steps to reproduce

  1. Update dom0 and all VMs.
  2. reboot

Expected behavior

VMs set to autostart start up

Actual behavior

"libxenlight failed to create new-domain" pop ups for sys-net, sys-firewall, etc

qvm-ls - shows all VMs halted

/var/log/libvirt/libxl/libxl-driver.log shows:

libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 1: Stubdom 2 for 1 startup: startup timed out
libxl: libxl_create.c:1975: domcreate_devmodel_started: Domain 1:device model did not start -9

This repeats many times with the Domain # changing

(there are also errors related to PCI device, but these are present for previous boots as well and are not new. The errors above do not appear until today in the log.)

@scallyob scallyob added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Apr 24, 2024
@krystian-hebel
Copy link

krystian-hebel commented Apr 24, 2024

Good thing I refreshed list of issues, I was about to report the same. I can confirm that VMs were restarted right after upgrade and they worked until following boot.

I also checked that after removing all network controllers from sys-net and changing its type to PVH all VMs can be started (obviously without networking), so this seems to be a problem with HVM.

This happened on KGPE-D16 with Opteron 6282 SE, with ASUS firmware 3001 (i.e. no coreboot).

@scallyob
Copy link
Author

KGPE-D16 with Opteron 6282 SE

Oh no, I'm on same hardware!

@scallyob
Copy link
Author

Kernel downgrade in dom0 and sys-net did not produce different results.

@marmarek
Copy link
Member

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

@scallyob
Copy link
Author

There's a lot in there, not sure what to look for. Don't see errors or warnings.
(talked about this here: https://forum.qubes-os.org/t/no-qubes-vms-boot-after-latest-updates/26033/14)

@marmarek
Copy link
Member

Errors can be buried quite deep there... look for anything after starting qemu.

@scallyob
Copy link
Author

I see it start qemu, with a bunch of options, each time it tries to boot. Nothing obvious to me after that that seems like a problem.

@krystian-hebel
Copy link

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

For me it ends with:

image

If full log may contain something useful I may try to get it tomorrow, without network it will require some finesse.

@marmarek
Copy link
Member

Indeed nothing obvious there... But one worrying thing is the timing: the "Rescanning PCI Frontend" messages are on stubdomain cleanup, and based on timestamps it's pretty close to starting qemu. AFAIR the startup timeout is 10s, but usually the stubdomain startup takes below 1s. This is pretty old system, it may be that recent workaround for speculative-execution bugs made it significantly slower.

Is it with current-testing (in dom0) enabled or not? Best to identify which update specifically broke it. dnf history may help, but my guess is Xen package. There is also Xen update in current-testing since yesterday, maybe this one will help?

@krystian-hebel
Copy link

Is it with current-testing (in dom0) enabled or not?

No, just default ones.

I also think this may be caused by Xen, that would explain why initial restart of VMs succeeded and only after full reboot they won't come up, will check different versions later.

@scallyob
Copy link
Author

Is it with current-testing (in dom0) enabled or not?

No.

Tried downgrading to: xen-hvm-stubdom-linux-4.2.9-1.fc37.x86_64.rpm AND xen-hvm-stubdom-linux-full-4.2.9-1.fc37.x86_64.rpm

No change.

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

@marmarek
Copy link
Member

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

xen package needs to match exact version of python3-xen - so you need this one too

@scallyob
Copy link
Author

scallyob commented Apr 24, 2024

Ok, that allowed the downgrade, now everything is booting.

CORRECTION: everything seems to boot EXCEPT for sys-usb

start failed: Timed out during operation: cannot acquire state change lock (held by monitor=shutdown-event-20) [Fixed: caused by switching from HVM to PV]

@andrewdavidwong andrewdavidwong added C: Xen P: major Priority: major. Between "default" and "critical" in severity. needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. affects-4.2 This issue affects Qubes OS 4.2. and removed P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. labels Apr 25, 2024
@krystian-hebel
Copy link

The problem is still present in 4.17.4-1 from current-testing.

@krystian-hebel
Copy link

krystian-hebel commented Apr 26, 2024

All VMs start when booting with spec-ctrl=no-ibpb-entry, which makes me believe this is a performance issue that may have been present for some time, but hidden until XSA-455.

I've noticed the same problem (at least I think it's the same, but didn't do as much testing as on KGPE) on HP t630 with slightly newer CPU. In that case, sys-usb was failing which didn't allow me to log in and I didn't had PS/2 keyboard at hand. spec-ctrl=no-ibpb-entry helped in that case as well. (KGPE doesn't have sys-usb because OS is installed on USB drive)

@marmarek is it possible to relax the startup timeout to see if it helps?

@marmarek
Copy link
Member

The timeout is hardcoded in libxl (look for LIBXL_STUBDOM_START_TIMEOUT), to 30s. If 30s is not enough to start, I doubt relaxing it will result in a working system (even if it will start, the stubdomain will likely be too slow for sys-net/sys-usb to work at all...)

@marmarek
Copy link
Member

So, I'm afraid there is not much hope for this old-ish system... The only way to make the system kinda-usable has a tradeoff with security here, by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv. It does mean that stubdomain will be able to mount the attack, potentially leaking memory of any other VM (so isolation of sys-net/sys-usb and any other HVM becomes weaker). If that is not an acceptable risk, blame AMD for making buggy CPU, and replace with something newer...

@andyhhp
Copy link

andyhhp commented Apr 26, 2024

Yeah sorry... you need https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf and the update from Feb this year in order to have an AMD CPU not needing this mitigation for safety

@scallyob
Copy link
Author

by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv

  1. So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)
  2. For someone who doesn't know how to do this, how hard is this to do?

Thanks for looking into this, despite the disappointing conclusion.

@RA-Kooi
Copy link

RA-Kooi commented Apr 26, 2024

  1. So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)

Yes, by not updaitng Xen you will be vulnerable to vulnerabilities in Xen itself as well as being vulnerable to this CPU bug.

  1. For someone who doesn't know how to do this, how hard is this to do?

Depending on how Xen is started it could differ, but chances are there's a file called xen.cfg in /boot. In this file you will something like this:

[xen]
options=

Simply append spec-ctrl=ibpb-entry=no-pv to the end of the options line.

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. and removed needs diagnosis Requires technical diagnosis from developer. Replace with "diagnosed" or remove if otherwise closed. labels Apr 26, 2024
Copy link

This issue has been closed as "declined." This means that the issue describes a legitimate bug (in the case of bug reports) or proposal (in the case of enhancements and tasks), and it is actionable, at least in principle. Nonetheless, it has been decided that no action will be taken on this issue. Here are some examples of reasons why an issue may be declined:

  • No solution can be found.
  • The proposed action is not possible.
  • The proposed action would weaken security to an unacceptable degree.
  • The proposed action would be too costly (in time, money, or other resources) relative to the benefits it would provide.
  • The proposed action would make some things better while making other things worse, and the trade-off is not worthwhile.

These are just general examples. If the specific reason for this particular issue being declined has not already been provided, please feel free to leave a comment below asking for an explanation.

We respect the time and effort you have taken to file this issue, and we understand that this outcome may be unsatisfying. Please accept our sincere apologies and know that we greatly value your participation and membership in the Qubes community.

If anyone reading this believes that this issue was closed in error or that the resolution of "declined" is not accurate, please leave a comment below saying so, and the Qubes team will review this issue again. For more information, see How issues get closed.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024
@scallyob
Copy link
Author

scallyob commented May 4, 2024

Depending on how Xen is started it could differ, but chances are there's a file called xen.cfg in /boot. In this file you will something like this:

[xen]
options=

Simply append spec-ctrl=ibpb-entry=no-pv to the end of the options line.

I do not have xen.cfg in /boot (or any of its subdirectories)
I have xen-4.17.3.config, which appears to not be the same thing and says "do not edit". Any other suggestions on how to accomplish this?

@Tonux599
Copy link

Tonux599 commented May 5, 2024

@scallyob you can add

GRUB_CMDLINE_XEN_DEFAULT="$GRUB_CMDLINE_XEN_DEFAULT spec-ctrl=ibpb-entry=no-pv"

to the end of /etc/default/grub and update with grub2-mkconfig -o /boot/grub2/grub.cfg and then reboot.

This is a really unfortunate situation as the KGPE-D16 is the most powerful, binary blob free (when used with Libreboot, old Coreboot, or Dasharo) system that supports Qubes.

I understand that AMD is unlikely to push any microcode updates for these CPU's to aid fixing this, so would appreciate if we could further seek any possible resolutions.

@marmarek could we potentially seek increasing LIBXL_STUBDOM_START_TIMEOUT beyond 30 seconds to see if performance is not degraded once it starts? Qubes with PCI assignments have always been slow to start but it's generally once per boot.

Failing that, would it be best to have PCI Qubes use PV with spec-ctrl=ibpb-entry=no-pv, or use HVM with spec-ctrl=ibpb-entry=no-hvm?
edit: sorry I made a presumption here. HVM PCI Qubes still fail with spec-ctrl=ibpb-entry=no-hvm.

Until #4318 is completed this system is all we got if you want Qubes on blob free firmware.

@andyhhp
Copy link

andyhhp commented May 8, 2024

IBPB (Indirect Branch Prediction Barrier) is the thing AMD retrofitted in microcode for Spectre-v2 defences.

It is very expensive. Sadly it's also the only protection against Branch Type Confusion (BTC, marketed as Retbleed), and Speculative Return Stack Overflow (SRSO, marketed as Inception).

Even in CPUs newer than the Spectre discovery, it's still expensive, and that's with all the pipeline improvements that the CPU vendors could bear to put in.

The performance will be degraded for the lifetime of the VMs. IBPB is issued on every entry into Xen, so that's every interrupt/vmexit (HVM guests) or every syscall/pagefault/etc (PV guests).

@thedeadliestcatch
Copy link

This is affecting Intel too, on a recent NUC system FYI running a 13th gen CPU.

@andyhhp
Copy link

andyhhp commented May 15, 2024

This is affecting Intel too, on a recent NUC system FYI running a 13th gen CPU.

This specific issue really isn't affecting Intel systems. If you're seeing similar symptoms, it will be a different cause. Please open a new bug.

@Tonux599
Copy link

Tonux599 commented Oct 9, 2024

@marmarek could we potentially seek increasing LIBXL_STUBDOM_START_TIMEOUT beyond 30 seconds to see if performance is not degraded once it starts? Qubes with PCI assignments have always been slow to start but it's generally once per boot.

@marmarek could this still be tried? There is still a community that exists that value blob free firmware and would like to use Qubes. Even if increasing this timeout results in a very slow system, it should be up to the user if they want to accept that trade-off.

@marmarek
Copy link
Member

marmarek commented Oct 9, 2024

No, I don't see any sense in that. If something that normally takes about 1s doesn't complete in 30s, it doesn't sound like usable system at all. It means pretty much everything will be 3000% slower. On the other hand, increasing the timeout will affect also users of otherwise perfectly usable system, as in case of some errors they will need to wait longer.

@Tonux599
Copy link

Tonux599 commented Oct 9, 2024

No, I don't see any sense in that. If something that normally takes about 1s doesn't complete in 30s, it doesn't sound like usable system at all. It means pretty much everything will be 3000% slower. On the other hand, increasing the timeout will affect also users of otherwise perfectly usable system, as in case of some errors they will need to wait longer.

Thank you for your response.

@arhabd
Copy link

arhabd commented Oct 9, 2024

sad day for kgpe-d16

@krystian-hebel
Copy link

Some numbers from my HP t630:

  • Booted with qubes.skip_autostart to avoid starting sys-net and sys-usb simultaneously, as it always fails.
  • Manually starting sys-net afterwards in qube manager usually succeeds within 30s timeout.
  • Starting Firefox and letting it fully render Fedora start page (counted from clicking on Firefox icon until "Latest Council Video" thumbnail) on already started sys-net (HVM) on otherwise idle machine takes ~20 minutes. Out of that, rendering takes at least 15 minutes. For comparison, on non-Qubes laptop loading and rendering takes ~5 seconds.
  • Doing the same in personal (PVH) is "much faster", it takes roughly 65 seconds. I haven't checked how much of that is waiting for sys-firewall/sys-net to provide data.

So while increasing LIBXL_STUBDOM_START_TIMEOUT could help those VMs get up in time, they wouldn't be useful (maybe except some CPU-bounded benchmarks, but that's not what HVM is used for).

@andyhhp
Copy link

andyhhp commented Oct 28, 2024

Sorry, but this is the cost of keeping your VM's secrets secret on buggy hardware.

spec-ctrl=no-ibpb-entry will make it work again, but with the consequence that you're disabling the protection for Branch Type Confusion (CVE-2022-23825) and Speculative Return Stack Overflow (CVE-2023-20569).

If you've risk assessed, and decided this is acceptable, then fine. But if you care about protecting against these attacks, then the IBPB is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.2 This issue affects Qubes OS 4.2. C: Xen diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity. R: declined Resolution: While a legitimate bug or proposal, it has been decided that no action will be taken. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

9 participants