Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KDE frozen on first boot after install / OpenGL causes unrelated applications to crash in dom0? #1680

Closed
edwintorok opened this issue Jan 23, 2016 · 15 comments
Labels
C: kernel help wanted This issue will probably not get done in a timely fashion without help from community contributors. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.

Comments

@edwintorok
Copy link

I installed Qubes 3.1 RC2 on AMD system (see below for hcl output) using legacy boot (haven't tried installing with UEFI) using the 4.1.13-8.pvops.qubes.x86_64 kernel, with KDE+Xfce4 on LVM on a SSD (not encrypted).
On first boot I choose the defaults (create the default VMs, do NOT create usbvm), and logged in to KDE.
As soon as I logged in there was a black screen with just one KDE button on lower left and the entire GUI frozen. I could barely move the mouse (i.e. it pointer lagged like 10-30s), and clicking on the KDE button did nothing. After several minutes the 'Desktop' button appeared on upper right corner, but system still unusable.
The CPU fan was quite audible, so I guess it was using the CPU heavily.

I tried some keyboard shortcuts but nothing worked (Ctrl-Alt-F1, Ctrl-Alt,F2, Ctrl-Alt-Del, Ctrl-Alt-Backspace, Alt-PrintScreen-s). Usually I would've SSH-ed in from another machine, but I didn't have chance to set that up yet (this was still first boot).
I rebooted using the physical reset button, and this time logged in to Xfce.
Everything worked fine here, so I logged out, and logged back in to KDE, which again worked fine.
Back to XFCE again, and 'Qubes VM Manager' didn't want to start, tried it several times from dom0 console too.

From 1st boot: /var/log/messages
Are there any other relevant logfiles I could provide for this?

Qubes release 3.1 (R3.1)

Brand:      ASUSTeK COMPUTER INC.
Model:      M5A99FX PRO R2.0
BIOS:       2501

Xen:        4.6.0
Kernel:     4.1.13-8

RAM:        12186 Mb

CPU:
  AMD FX(tm)-8350 Eight-Core Processor           
Chipset:
  Advanced Micro Devices, Inc. [AMD/ATI] RD890 PCI to PCI bridge (external gfx0 port B) [1002:5a14] (rev 02)
VGA:

  Advanced Micro Devices, Inc. [AMD/ATI] RV730 PRO [Radeon HD 4650] [1002:9498] (prog-if 00 [VGA controller])

Net:
  Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller (rev 09)

SCSI:
  WDC WD7501AALS-0 Rev: 0K05
  OCZ-VERTEX2      Rev: 1.37
  SanDisk SDSSDXPS Rev: 00RL
  WDC WD7501AALS-0 Rev: 0K05

HVM:        Active
I/O MMU:    Active
TPM:        Device not found

Qubes HCL Files are copied to: 'dom0'
    Qubes-HCL-ASUSTeK_COMPUTER_INC.-M5A99FX_PRO_R2.0-20160123-180117.yml    - HCL Info
@marmarek
Copy link
Member

This doesn't look good:

Jan 22 23:55:59 dom0 kernel: [  312.741783] systemd[10818]: segfault at 5ec6 ip 0000000000005ec6 sp 00007ffe10b52368 error 14 in systemd[557233f05000+10d000]
Jan 22 23:55:59 dom0 kernel: [  312.742478] BUG: Bad rss-counter state mm:ffff880005e3db00 idx:1 val:30
Jan 22 23:55:59 dom0 kernel: systemd[10818]: segfault at 5ec6 ip 0000000000005ec6 sp 00007ffe10b52368 error 14 in systemd[557233f05000+10d000]
Jan 22 23:55:59 dom0 kernel: BUG: Bad rss-counter state mm:ffff880005e3db00 idx:1 val:30
Jan 22 23:55:59 dom0 kernel: [  312.747785] BUG: Bad rss-counter state mm:ffff880005e3b800 idx:1 val:4
Jan 22 23:55:59 dom0 kernel: BUG: Bad rss-counter state mm:ffff880005e3b800 idx:1 val:4
Jan 22 23:55:59 dom0 systemd: /usr/lib/systemd/system-generators/systemd-rc-local-generator terminated by signal SEGV.

Not sure what is the cause - it may be a kernel bug, but may be also some Xen bug or even hardware (memory?) problem.

As soon as I logged in there was a black screen with just one KDE button on lower left and the entire GUI frozen.

Disabling or enabling composition may (or may not) help. Alt+Shift+F12 by default. In that system state, probably require some patience to actually being switched...

Back to XFCE again, and 'Qubes VM Manager' didn't want to start, tried it several times from dom0 console too.

qubes-manager has a code to prevent being running in multiple instances. This probably means you have one already running somewhere (maybe hanging or something). If you kill that instance and still have the problem, you probably will get some error message on console during starting the process.

Qubes HCL Files are copied to: 'dom0'
    Qubes-HCL-ASUSTeK_COMPUTER_INC.-M5A99FX_PRO_R2.0-20160123-180117.yml    - HCL Info

That file would be useful.

It's better to write to qubes-users mailing list about problems related to specific hardware - there are more people, especially somebody might have had similar problem.

@edwintorok
Copy link
Author

Qubes-HCL-ASUSTeK_COMPUTER_INC.-M5A99FX_PRO_R2.0-20160123-180117.yml

Interestingly I don't have any 'bad rss counter' entries this afternoon in my dmesg, but when I installed it last night it happened quite a lot.

@edwintorok
Copy link
Author

Logging in to KDE triggers a series of these 'BUG: Bad rss-counter' errors, and pretty much everything segfaults eventually, until the system reboots on its own. Even sudo ls -l crashed with a message about bash memory corruption in libc: https://gist.github.com/edwintorok/179fcb8090f49d8a0163.

Wouldn't be surprised if this is related to OpenGL. AFAIK Xfce doesn't use it and KDE would, so I launched glxgears under Xfce, and did a watch 'dmesg -T|tail'. The crashes started happening soon enough: https://gist.github.com/edwintorok/bee143ef33428084e450.
I stopped glxgears and the crashes stopped happening. So does OpenGL (or its kernel part) corrupt other application's or kernel memory when run under Xen? Could IOMMU be used to somehow limit/diagnose that?
(FWIW OpenGL works fine on debian jessie or debian jessie+backports, the only similar trouble I had with this video card where some GPU hangs long ago with older versions of the drivers, but those would always result in a quite lengthy GPU hang message in dmesg).

@edwintorok edwintorok changed the title KDE frozen on first boot after install KDE frozen on first boot after install / OpenGL causes unrelated applications to crash in dom0? Jan 23, 2016
@edwintorok
Copy link
Author

I wasn't able to reproduce this with Debian + Xen 4.6 and glxgears.
The kernel on Debian is newer than Qubes's but Mesa is a minor version older (10.3.2 vs 10.3.3).
Also on Debian all I've done is run it under Xen as dom0, I didn't start any other VMs (Qubes would always start a sys-net and firewall VM).

$ uname -a
Linux debian 4.3.0-0.bpo.1-amd64 #1 SMP Debian 4.3.3-7~bpo8+1 (2016-01-19) x86_64 GNU/Linux
$ glxinfo|grep OpenGL.*string
OpenGL vendor string: X.Org
OpenGL renderer string: Gallium 0.4 on AMD RV730
OpenGL core profile version string: 3.3 (Core Profile) Mesa 10.3.2
OpenGL core profile shading language version string: 3.30
OpenGL version string: 3.0 Mesa 10.3.2
OpenGL shading language version string: 1.30
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 10.3.2
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.0
$ sudo xl info
host                   : debian
release                : 4.3.0-0.bpo.1-amd64
version                : #1 SMP Debian 4.3.3-7~bpo8+1 (2016-01-19)
machine                : x86_64
nr_cpus                : 8
max_cpu_id             : 7
nr_nodes               : 1
cores_per_socket       : 4
threads_per_core       : 2
cpu_mhz                : 4013
hw_caps                : 178bf3ff:2fd3fbff:00000000:00001700:36983203:00000000:01ebbfff:00000008
virt_caps              : hvm hvm_directio
total_memory           : 12186
free_memory            : 153
sharing_freed_memory   : 0
sharing_used_memory    : 0
outstanding_claims     : 0
free_cpus              : 0
xen_major              : 4
xen_minor              : 6
xen_extra              : .0
xen_version            : 4.6.0
xen_caps               : xen-3.0-x86_64 xen-3.0-x86_32p hvm-3.0-x86_32 hvm-3.0-x86_32p hvm-3.0-x86_64 
xen_scheduler          : credit
xen_pagesize           : 4096
platform_params        : virt_start=0xffff800000000000
xen_changeset          : 
xen_commandline        : placeholder
cc_compiler            : gcc (Debian 5.2.1-23) 5.2.1 20151028
cc_compile_by          : waldi
cc_compile_domain      : debian.org
cc_compile_date        : Sun Nov  1 20:52:41 UTC 2015
xend_config_format     : 4

@andrewdavidwong
Copy link
Member

I'm assuming this issue has been resolved based on the lack of recent activity. If not, please feel free to re-open it.

@Bufil
Copy link

Bufil commented May 31, 2016

I have the same issue.
dom0 kernel: BUG: Bad rss-counter state...

@andrewdavidwong
Copy link
Member

Ok, re-opening since more than one person has this issue. @marmarek, how should this one be labeled?

@marmarek marmarek added C: kernel T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists. labels Jun 1, 2016
@marmarek
Copy link
Member

marmarek commented Jun 1, 2016

@edwintorok @Bufil there is pre-R3.2 test image in #1807 (comment) - it has newer dom0 kernel, and X server (+drivers). It is also possible to install kernel 4.4.x in R3.1, from qubes-dom0-unstable repository.
If you could check if any of those solve the problem, that would be great.

@Bufil
Copy link

Bufil commented Jun 4, 2016

Seems to be a AMD-XEN-KERNEL combination bug.
With the same GPU and an Intel Mainboard this does not happen.
Kernel 4.4.x Changes nothing.

@edwintorok
Copy link
Author

@marmarek: good news! I installed kernel 4.4.10-9 on Qubes R3.1 and it didn't corrupt memory anymore for me:

  • Installed Qubes R3.1 with Xfce (AFAICT it installed with legacy BIOS boot)
  • Kernel 4.1.13-9 shows the memory corruption bug in dmesg and segfaults various applications within a few minutes of launching glxgears: dmesg from kernel 4.1.13-9
  • Kernel 4.4.10-9 was able to run glxgears for 1.5 hours without any corruption messages in dmesg:
    dmesg from kernel 4.4.10-9
  • To double-check I rebooted to 4.1.13-9, within about 2 minutes dmesg has shown the corruption message and applications segfaulted
  • Rebooted again to 4.1.10-9, and run glxgears for 10 minutes: all OK!

I can't say that the corruption bug has been definetely fixed (maybe it is just harder to reproduce, @Bufil said above that it is still an issue), but I wasn't able to reproduce it anymore.

Here is also a diff of dmesg 4.1.13 and 4.4.10: diff of dmesg 4.1.13-9 vs 4.4.10-9

If I want to test your Qubes-DVD-x86_64-20160518.iso should I perform a clean install, or can I use qubes-dom0-update to get the new X server you were refering to?

@Bufil
Copy link

Bufil commented Jun 20, 2016

I made a clean install of Qubes-R3.2-rc1.
After a view minutes, same Problem.
"BUG: Bad rss-counter state ..."

@marmarek
Copy link
Member

If I want to test your Qubes-DVD-x86_64-20160518.iso should I perform a clean install, or can I use qubes-dom0-update to get the new X server you were refering to?

Since R3.2-rc1 is out, better try this one.

@edwintorok
Copy link
Author

@marmarek I just tried R3.2-rc1, and got same results as Bufil: the 'BUG: Bad rss-counter' message is back, but is much harder to reproduce:

I installed R3.2-rc1 (selected KDE+Xfce), and just as I completed the first boot setup it crashed, and I've seen the 'BUG' message on my console. I had to hard reboot, because I was not able to login on any of the consoles or use Ctrl+Alt+Delete even. Here are the logs [*]

After the reboot I tried my usual glxgears test under Xfce and nothing crashed, there was no 'BUG' message. I even logged in to Plasma, repeated the glxgears test: no crash, no BUG message.

[*]: I had to chroot into Qubes to use journalctl to retrieve the logs, since now they are binary, good old /var/log/messages is empty.

@andrewdavidwong andrewdavidwong added the help wanted This issue will probably not get done in a timely fashion without help from community contributors. label Dec 23, 2016
@andrewdavidwong andrewdavidwong added this to the Far in the future milestone Dec 23, 2016
@andrewdavidwong
Copy link
Member

This bug report has seen no activity in a very long time, and it is not assigned to any current release milestone. It looks like it was left open by mistake, so I'm closing it now. However, if anyone is still affected by this bug on a currently-supported release, please leave a comment, and we'll be happy to reopen this. Thank you.

@andrewdavidwong andrewdavidwong removed this from the Release TBD milestone Jul 10, 2023
@edwintorok
Copy link
Author

I no longer have the hardware that I used when opening this bugreport, but it wasn't left open by mistake: I simply haven't found a version of Qubes that'd work on my (desktop AMD) hardware (it does seem to work on an Intel laptop though).
I'll try again with latest Qubes 4.2RC1, but that runs into a different bug (machine hard shutdown during boot if I enable IOMMU in the BIOS, and without IOMMU I don't get a working 'sys-net' by default). I'll try to capture early boot messages and open a new issue: both the hardware, Xen and Qubes has changed a lot since I initially opened this bug so it is very likely an entirely new bug.

@DemiMarie DemiMarie reopened this Aug 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: kernel help wanted This issue will probably not get done in a timely fashion without help from community contributors. T: bug Type: bug report. A problem or defect resulting in unintended behavior in something that exists.
Projects
None yet
Development

No branches or pull requests

5 participants