No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

scallyob · 2024-04-24T15:32:50Z

WARNING: This issues is only above very old AMD CPUs, not supported by AMD anymore. Some workaround listed below have severe security consequences, do not apply them unless you really understand all the implications!

How to file a helpful issue

Qubes OS release

4.2.1

Brief summary

I did a full update on April 23 and rebooted April 24. First reboot since April 1.
Now no Qubes/VMs will start.

Steps to reproduce

Update dom0 and all VMs.
reboot

Expected behavior

VMs set to autostart start up

Actual behavior

"libxenlight failed to create new-domain" pop ups for sys-net, sys-firewall, etc

qvm-ls - shows all VMs halted

/var/log/libvirt/libxl/libxl-driver.log shows:

libxl: libxl_dm.c:2857:stubdom_xswait_cb: Domain 1: Stubdom 2 for 1 startup: startup timed out
libxl: libxl_create.c:1975: domcreate_devmodel_started: Domain 1:device model did not start -9

This repeats many times with the Domain # changing

(there are also errors related to PCI device, but these are present for previous boots as well and are not new. The errors above do not appear until today in the log.)

krystian-hebel · 2024-04-24T15:46:35Z

Good thing I refreshed list of issues, I was about to report the same. I can confirm that VMs were restarted right after upgrade and they worked until following boot.

I also checked that after removing all network controllers from sys-net and changing its type to PVH all VMs can be started (obviously without networking), so this seems to be a problem with HVM.

This happened on KGPE-D16 with Opteron 6282 SE, with ASUS firmware 3001 (i.e. no coreboot).

scallyob · 2024-04-24T16:14:34Z

KGPE-D16 with Opteron 6282 SE

Oh no, I'm on same hardware!

scallyob · 2024-04-24T17:52:49Z

Kernel downgrade in dom0 and sys-net did not produce different results.

marmarek · 2024-04-24T18:37:37Z

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

scallyob · 2024-04-24T18:41:30Z

There's a lot in there, not sure what to look for. Don't see errors or warnings.
(talked about this here: https://forum.qubes-os.org/t/no-qubes-vms-boot-after-latest-updates/26033/14)

marmarek · 2024-04-24T18:58:22Z

Errors can be buried quite deep there... look for anything after starting qemu.

scallyob · 2024-04-24T19:07:11Z

I see it start qemu, with a bunch of options, each time it tries to boot. Nothing obvious to me after that that seems like a problem.

krystian-hebel · 2024-04-24T20:21:18Z

What can you see in /var/log/xen/console/guest-sys-net-dm.log ?

For me it ends with:

If full log may contain something useful I may try to get it tomorrow, without network it will require some finesse.

marmarek · 2024-04-24T20:56:32Z

Indeed nothing obvious there... But one worrying thing is the timing: the "Rescanning PCI Frontend" messages are on stubdomain cleanup, and based on timestamps it's pretty close to starting qemu. AFAIR the startup timeout is 10s, but usually the stubdomain startup takes below 1s. This is pretty old system, it may be that recent workaround for speculative-execution bugs made it significantly slower.

Is it with current-testing (in dom0) enabled or not? Best to identify which update specifically broke it. dnf history may help, but my guess is Xen package. There is also Xen update in current-testing since yesterday, maybe this one will help?

krystian-hebel · 2024-04-24T21:10:33Z

Is it with current-testing (in dom0) enabled or not?

No, just default ones.

I also think this may be caused by Xen, that would explain why initial restart of VMs succeeded and only after full reboot they won't come up, will check different versions later.

scallyob · 2024-04-24T21:29:35Z

Is it with current-testing (in dom0) enabled or not?

No.

Tried downgrading to: xen-hvm-stubdom-linux-4.2.9-1.fc37.x86_64.rpm AND xen-hvm-stubdom-linux-full-4.2.9-1.fc37.x86_64.rpm

No change.

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

marmarek · 2024-04-24T21:40:30Z

Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:

The operation would result in removing the following protected packages: qubes-core-dom0

So not sure how to proceed.

xen package needs to match exact version of python3-xen - so you need this one too

scallyob · 2024-04-24T22:25:57Z

Ok, that allowed the downgrade, now everything is booting.

~~CORRECTION: everything seems to boot EXCEPT for sys-usb~~

~~start failed: Timed out during operation: cannot acquire state change lock (held by monitor=shutdown-event-20)~~ [Fixed: caused by switching from HVM to PV]

krystian-hebel · 2024-04-25T13:04:54Z

The problem is still present in 4.17.4-1 from current-testing.

krystian-hebel · 2024-04-26T11:24:23Z

All VMs start when booting with spec-ctrl=no-ibpb-entry, which makes me believe this is a performance issue that may have been present for some time, but hidden until XSA-455.

I've noticed the same problem (at least I think it's the same, but didn't do as much testing as on KGPE) on HP t630 with slightly newer CPU. In that case, sys-usb was failing which didn't allow me to log in and I didn't had PS/2 keyboard at hand. spec-ctrl=no-ibpb-entry helped in that case as well. (KGPE doesn't have sys-usb because OS is installed on USB drive)

@marmarek is it possible to relax the startup timeout to see if it helps?

marmarek · 2024-04-26T11:56:51Z

The timeout is hardcoded in libxl (look for LIBXL_STUBDOM_START_TIMEOUT), to 30s. If 30s is not enough to start, I doubt relaxing it will result in a working system (even if it will start, the stubdomain will likely be too slow for sys-net/sys-usb to work at all...)

marmarek · 2024-04-26T12:21:34Z

So, I'm afraid there is not much hope for this old-ish system... The only way to make the system kinda-usable has a tradeoff with security here, by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv. It does mean that stubdomain will be able to mount the attack, potentially leaking memory of any other VM (so isolation of sys-net/sys-usb and any other HVM becomes weaker). If that is not an acceptable risk, blame AMD for making buggy CPU, and replace with something newer...

andyhhp · 2024-04-26T12:48:06Z

Yeah sorry... you need https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf and the update from Feb this year in order to have an AMD CPU not needing this mitigation for safety

scallyob · 2024-04-26T17:03:19Z

by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with spec-ctrl=ibpb-entry=no-pv

So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)
For someone who doesn't know how to do this, how hard is this to do?

Thanks for looking into this, despite the disappointing conclusion.

RA-Kooi · 2024-04-26T22:41:46Z

So this would be a better medium-term solution than excluding xen from updates? (Short-term that is what I am doing: excluding xen updates. Long-term I guess I need to look into buying a new computer.)

Yes, by not updaitng Xen you will be vulnerable to vulnerabilities in Xen itself as well as being vulnerable to this CPU bug.

For someone who doesn't know how to do this, how hard is this to do?

Depending on how Xen is started it could differ, but chances are there's a file called xen.cfg in /boot. In this file you will something like this:

[xen]
options=

Simply append spec-ctrl=ibpb-entry=no-pv to the end of the options line.

github-actions · 2024-04-26T22:57:57Z

This issue has been closed as "declined." This means that the issue describes a legitimate bug (in the case of bug reports) or proposal (in the case of enhancements and tasks), and it is actionable, at least in principle. Nonetheless, it has been decided that no action will be taken on this issue. Here are some examples of reasons why an issue may be declined:

No solution can be found.
The proposed action is not possible.
The proposed action would weaken security to an unacceptable degree.
The proposed action would be too costly (in time, money, or other resources) relative to the benefits it would provide.
The proposed action would make some things better while making other things worse, and the trade-off is not worthwhile.

These are just general examples. If the specific reason for this particular issue being declined has not already been provided, please feel free to leave a comment below asking for an explanation.

We respect the time and effort you have taken to file this issue, and we understand that this outcome may be unsatisfying. Please accept our sincere apologies and know that we greatly value your participation and membership in the Qubes community.

If anyone reading this believes that this issue was closed in error or that the resolution of "declined" is not accurate, please leave a comment below saying so, and the Qubes team will review this issue again. For more information, see How issues get closed.

scallyob · 2024-05-04T15:59:55Z

Depending on how Xen is started it could differ, but chances are there's a file called xen.cfg in /boot. In this file you will something like this:
[xen]
options=
Simply append spec-ctrl=ibpb-entry=no-pv to the end of the options line.

I do not have xen.cfg in /boot (or any of its subdirectories)
I have xen-4.17.3.config, which appears to not be the same thing and says "do not edit". Any other suggestions on how to accomplish this?

Tonux599 · 2024-05-05T22:48:30Z

@scallyob you can add

GRUB_CMDLINE_XEN_DEFAULT="$GRUB_CMDLINE_XEN_DEFAULT spec-ctrl=ibpb-entry=no-pv"

to the end of /etc/default/grub and update with grub2-mkconfig -o /boot/grub2/grub.cfg and then reboot.

This is a really unfortunate situation as the KGPE-D16 is the most powerful, binary blob free (when used with Libreboot, old Coreboot, or Dasharo) system that supports Qubes.

I understand that AMD is unlikely to push any microcode updates for these CPU's to aid fixing this, so would appreciate if we could further seek any possible resolutions.

@marmarek could we potentially seek increasing LIBXL_STUBDOM_START_TIMEOUT beyond 30 seconds to see if performance is not degraded once it starts? Qubes with PCI assignments have always been slow to start but it's generally once per boot.

~~Failing that, would it be best to have PCI Qubes use PV with spec-ctrl=ibpb-entry=no-pv, or use HVM with spec-ctrl=ibpb-entry=no-hvm?~~
edit: sorry I made a presumption here. HVM PCI Qubes still fail with spec-ctrl=ibpb-entry=no-hvm.

Until #4318 is completed this system is all we got if you want Qubes on blob free firmware.

andyhhp · 2024-05-08T21:43:44Z

IBPB (Indirect Branch Prediction Barrier) is the thing AMD retrofitted in microcode for Spectre-v2 defences.

It is very expensive. Sadly it's also the only protection against Branch Type Confusion (BTC, marketed as Retbleed), and Speculative Return Stack Overflow (SRSO, marketed as Inception).

Even in CPUs newer than the Spectre discovery, it's still expensive, and that's with all the pipeline improvements that the CPU vendors could bear to put in.

The performance will be degraded for the lifetime of the VMs. IBPB is issued on every entry into Xen, so that's every interrupt/vmexit (HVM guests) or every syscall/pagefault/etc (PV guests).

thedeadliestcatch · 2024-05-15T19:15:07Z

This is affecting Intel too, on a recent NUC system FYI running a 13th gen CPU.

andyhhp · 2024-05-15T20:54:04Z

This is affecting Intel too, on a recent NUC system FYI running a 13th gen CPU.

This specific issue really isn't affecting Intel systems. If you're seeing similar symptoms, it will be a different cause. Please open a new bug.

Tonux599 · 2024-10-09T20:00:19Z

@marmarek could we potentially seek increasing LIBXL_STUBDOM_START_TIMEOUT beyond 30 seconds to see if performance is not degraded once it starts? Qubes with PCI assignments have always been slow to start but it's generally once per boot.

@marmarek could this still be tried? There is still a community that exists that value blob free firmware and would like to use Qubes. Even if increasing this timeout results in a very slow system, it should be up to the user if they want to accept that trade-off.

marmarek · 2024-10-09T20:56:49Z

No, I don't see any sense in that. If something that normally takes about 1s doesn't complete in 30s, it doesn't sound like usable system at all. It means pretty much everything will be 3000% slower. On the other hand, increasing the timeout will affect also users of otherwise perfectly usable system, as in case of some errors they will need to wait longer.

Tonux599 · 2024-10-09T21:01:23Z

No, I don't see any sense in that. If something that normally takes about 1s doesn't complete in 30s, it doesn't sound like usable system at all. It means pretty much everything will be 3000% slower. On the other hand, increasing the timeout will affect also users of otherwise perfectly usable system, as in case of some errors they will need to wait longer.

Thank you for your response.

arhabd · 2024-10-09T22:17:36Z

sad day for kgpe-d16

krystian-hebel · 2024-10-28T11:38:11Z

Some numbers from my HP t630:

Booted with qubes.skip_autostart to avoid starting sys-net and sys-usb simultaneously, as it always fails.
Manually starting sys-net afterwards in qube manager usually succeeds within 30s timeout.
Starting Firefox and letting it fully render Fedora start page (counted from clicking on Firefox icon until "Latest Council Video" thumbnail) on already started sys-net (HVM) on otherwise idle machine takes ~20 minutes. Out of that, rendering takes at least 15 minutes. For comparison, on non-Qubes laptop loading and rendering takes ~5 seconds.
Doing the same in personal (PVH) is "much faster", it takes roughly 65 seconds. I haven't checked how much of that is waiting for sys-firewall/sys-net to provide data.

So while increasing LIBXL_STUBDOM_START_TIMEOUT could help those VMs get up in time, they wouldn't be useful (maybe except some CPU-bounded benchmarks, but that's not what HVM is used for).

andyhhp · 2024-10-28T15:50:09Z

Sorry, but this is the cost of keeping your VM's secrets secret on buggy hardware.

spec-ctrl=no-ibpb-entry will make it work again, but with the consequence that you're disabling the protection for Branch Type Confusion (CVE-2022-23825) and Speculative Return Stack Overflow (CVE-2023-20569).

If you've risk assessed, and decided this is acceptable, then fine. But if you care about protecting against these attacks, then the IBPB is needed.

scallyob added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Apr 24, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 26, 2024

This was referenced May 17, 2024

Test TrenchBoot support on AMD hardware with TPM 2.0 and TPM 1.2 with legacy boot mode TrenchBoot/trenchboot-issues#23

Closed

TrenchBoot Secure Kernel Loader (SKL) improvements for AMD server CPUs with multiple nodes TrenchBoot/trenchboot-issues#20

Closed

docelic mentioned this issue Jun 2, 2024

Some error during install results in no Qubes/VMs starting when type is HVM #9281

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

scallyob commented Apr 24, 2024 •

edited by marmarek

Loading

krystian-hebel commented Apr 24, 2024 •

edited

Loading

scallyob commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024

krystian-hebel commented Apr 24, 2024

marmarek commented Apr 24, 2024

krystian-hebel commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024 •

edited

Loading

krystian-hebel commented Apr 25, 2024

krystian-hebel commented Apr 26, 2024 •

edited

Loading

marmarek commented Apr 26, 2024

marmarek commented Apr 26, 2024

andyhhp commented Apr 26, 2024

scallyob commented Apr 26, 2024

RA-Kooi commented Apr 26, 2024

github-actions bot commented Apr 26, 2024

scallyob commented May 4, 2024 •

edited

Loading

Tonux599 commented May 5, 2024 •

edited

Loading

andyhhp commented May 8, 2024

thedeadliestcatch commented May 15, 2024

andyhhp commented May 15, 2024 •

edited

Loading

Tonux599 commented Oct 9, 2024

marmarek commented Oct 9, 2024

Tonux599 commented Oct 9, 2024

arhabd commented Oct 9, 2024

krystian-hebel commented Oct 28, 2024

andyhhp commented Oct 28, 2024 •

edited

Loading

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150

Comments

scallyob commented Apr 24, 2024 • edited by marmarek Loading

Qubes OS release

Brief summary

Steps to reproduce

Expected behavior

Actual behavior

krystian-hebel commented Apr 24, 2024 • edited Loading

scallyob commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024

krystian-hebel commented Apr 24, 2024

marmarek commented Apr 24, 2024

krystian-hebel commented Apr 24, 2024

scallyob commented Apr 24, 2024

marmarek commented Apr 24, 2024

scallyob commented Apr 24, 2024 • edited Loading

krystian-hebel commented Apr 25, 2024

krystian-hebel commented Apr 26, 2024 • edited Loading

marmarek commented Apr 26, 2024

marmarek commented Apr 26, 2024

andyhhp commented Apr 26, 2024

scallyob commented Apr 26, 2024

RA-Kooi commented Apr 26, 2024

github-actions bot commented Apr 26, 2024

scallyob commented May 4, 2024 • edited Loading

Tonux599 commented May 5, 2024 • edited Loading

andyhhp commented May 8, 2024

thedeadliestcatch commented May 15, 2024

andyhhp commented May 15, 2024 • edited Loading

Tonux599 commented Oct 9, 2024

marmarek commented Oct 9, 2024

Tonux599 commented Oct 9, 2024

arhabd commented Oct 9, 2024

krystian-hebel commented Oct 28, 2024

andyhhp commented Oct 28, 2024 • edited Loading

scallyob commented Apr 24, 2024 •

edited by marmarek

Loading

krystian-hebel commented Apr 24, 2024 •

edited

Loading

scallyob commented Apr 24, 2024 •

edited

Loading

krystian-hebel commented Apr 26, 2024 •

edited

Loading

scallyob commented May 4, 2024 •

edited

Loading

Tonux599 commented May 5, 2024 •

edited

Loading

andyhhp commented May 15, 2024 •

edited

Loading

andyhhp commented Oct 28, 2024 •

edited

Loading