-
-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No Qubes/VMs starting - libxenlight failed to create new-domain in 4.2.1 #9150
Comments
Good thing I refreshed list of issues, I was about to report the same. I can confirm that VMs were restarted right after upgrade and they worked until following boot. I also checked that after removing all network controllers from This happened on KGPE-D16 with Opteron 6282 SE, with ASUS firmware 3001 (i.e. no coreboot). |
Oh no, I'm on same hardware! |
Kernel downgrade in dom0 and sys-net did not produce different results. |
What can you see in |
There's a lot in there, not sure what to look for. Don't see errors or warnings. |
Errors can be buried quite deep there... look for anything after starting qemu. |
I see it start qemu, with a bunch of options, each time it tries to boot. Nothing obvious to me after that that seems like a problem. |
Indeed nothing obvious there... But one worrying thing is the timing: the "Rescanning PCI Frontend" messages are on stubdomain cleanup, and based on timestamps it's pretty close to starting qemu. AFAIR the startup timeout is 10s, but usually the stubdomain startup takes below 1s. This is pretty old system, it may be that recent workaround for speculative-execution bugs made it significantly slower. Is it with current-testing (in dom0) enabled or not? Best to identify which update specifically broke it. dnf history may help, but my guess is Xen package. There is also Xen update in current-testing since yesterday, maybe this one will help? |
No, just default ones. I also think this may be caused by Xen, that would explain why initial restart of VMs succeeded and only after full reboot they won't come up, will check different versions later. |
No. Tried downgrading to: xen-hvm-stubdom-linux-4.2.9-1.fc37.x86_64.rpm AND xen-hvm-stubdom-linux-full-4.2.9-1.fc37.x86_64.rpm No change. Tried downgrading xen, xen-hypervisor, xen-libs, xen-licenses and xen runtime to 4.17.3-4, but it gave me error:
So not sure how to proceed. |
xen package needs to match exact version of python3-xen - so you need this one too |
Ok, that allowed the downgrade, now everything is booting.
|
The problem is still present in 4.17.4-1 from current-testing. |
All VMs start when booting with I've noticed the same problem (at least I think it's the same, but didn't do as much testing as on KGPE) on HP t630 with slightly newer CPU. In that case, @marmarek is it possible to relax the startup timeout to see if it helps? |
The timeout is hardcoded in libxl (look for |
So, I'm afraid there is not much hope for this old-ish system... The only way to make the system kinda-usable has a tradeoff with security here, by disabling the mitigation for PV domains (which should mean just stubdomains, make sure you don't have any really untrusted PV qubes) with |
Yeah sorry... you need https://www.amd.com/content/dam/amd/en/documents/corporate/cr/speculative-return-stack-overflow-whitepaper.pdf and the update from Feb this year in order to have an AMD CPU not needing this mitigation for safety |
Thanks for looking into this, despite the disappointing conclusion. |
Yes, by not updaitng Xen you will be vulnerable to vulnerabilities in Xen itself as well as being vulnerable to this CPU bug.
Depending on how Xen is started it could differ, but chances are there's a file called [xen]
options= Simply append |
This issue has been closed as "declined." This means that the issue describes a legitimate bug (in the case of bug reports) or proposal (in the case of enhancements and tasks), and it is actionable, at least in principle. Nonetheless, it has been decided that no action will be taken on this issue. Here are some examples of reasons why an issue may be declined:
These are just general examples. If the specific reason for this particular issue being declined has not already been provided, please feel free to leave a comment below asking for an explanation. We respect the time and effort you have taken to file this issue, and we understand that this outcome may be unsatisfying. Please accept our sincere apologies and know that we greatly value your participation and membership in the Qubes community. If anyone reading this believes that this issue was closed in error or that the resolution of "declined" is not accurate, please leave a comment below saying so, and the Qubes team will review this issue again. For more information, see How issues get closed. |
I do not have xen.cfg in /boot (or any of its subdirectories) |
@scallyob you can add
to the end of This is a really unfortunate situation as the KGPE-D16 is the most powerful, binary blob free (when used with Libreboot, old Coreboot, or Dasharo) system that supports Qubes. I understand that AMD is unlikely to push any microcode updates for these CPU's to aid fixing this, so would appreciate if we could further seek any possible resolutions. @marmarek could we potentially seek increasing
Until #4318 is completed this system is all we got if you want Qubes on blob free firmware. |
IBPB (Indirect Branch Prediction Barrier) is the thing AMD retrofitted in microcode for Spectre-v2 defences. It is very expensive. Sadly it's also the only protection against Branch Type Confusion (BTC, marketed as Retbleed), and Speculative Return Stack Overflow (SRSO, marketed as Inception). Even in CPUs newer than the Spectre discovery, it's still expensive, and that's with all the pipeline improvements that the CPU vendors could bear to put in. The performance will be degraded for the lifetime of the VMs. IBPB is issued on every entry into Xen, so that's every interrupt/vmexit (HVM guests) or every syscall/pagefault/etc (PV guests). |
This is affecting Intel too, on a recent NUC system FYI running a 13th gen CPU. |
This specific issue really isn't affecting Intel systems. If you're seeing similar symptoms, it will be a different cause. Please open a new bug. |
@marmarek could this still be tried? There is still a community that exists that value blob free firmware and would like to use Qubes. Even if increasing this timeout results in a very slow system, it should be up to the user if they want to accept that trade-off. |
No, I don't see any sense in that. If something that normally takes about 1s doesn't complete in 30s, it doesn't sound like usable system at all. It means pretty much everything will be 3000% slower. On the other hand, increasing the timeout will affect also users of otherwise perfectly usable system, as in case of some errors they will need to wait longer. |
Thank you for your response. |
sad day for kgpe-d16 |
Some numbers from my HP t630:
So while increasing |
Sorry, but this is the cost of keeping your VM's secrets secret on buggy hardware.
If you've risk assessed, and decided this is acceptable, then fine. But if you care about protecting against these attacks, then the IBPB is needed. |
WARNING: This issues is only above very old AMD CPUs, not supported by AMD anymore. Some workaround listed below have severe security consequences, do not apply them unless you really understand all the implications!
How to file a helpful issue
Qubes OS release
4.2.1
Brief summary
I did a full update on April 23 and rebooted April 24. First reboot since April 1.
Now no Qubes/VMs will start.
Steps to reproduce
Expected behavior
VMs set to autostart start up
Actual behavior
"libxenlight failed to create new-domain" pop ups for sys-net, sys-firewall, etc
qvm-ls - shows all VMs halted
/var/log/libvirt/libxl/libxl-driver.log shows:
This repeats many times with the Domain # changing
(there are also errors related to PCI device, but these are present for previous boots as well and are not new. The errors above do not appear until today in the log.)
The text was updated successfully, but these errors were encountered: