Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest GRUB update breaks booting #148

Open
robertm98 opened this issue Jun 25, 2024 · 18 comments
Open

Latest GRUB update breaks booting #148

robertm98 opened this issue Jun 25, 2024 · 18 comments

Comments

@robertm98
Copy link

This is a different bug compared to what is described in #147

When the latest updates are applied and a server is then rebooted GRUB will not start and appears to be stuck in a busy loop displaying the following message.
"error: ../../grub-core/commands/efi/tpm.c:150:unknown TPM error"

Secure Boot is disabled and no previous problems.

Steps to reproduce:

Download and install OL 9.4 x86_64
OK for first boot.
Apply updates
Reboot and GRUB will then fail to load with the above error message.

As a cross check a fresh install was done and grub updates were excluded with
exclude=grub*
in the /etc/dnf/dnf.conf file.

The non-grub updates were installed and the server rebooted OK.

@aburmash
Copy link

Hello!
Thanks for the report, in fact last update issued for linked issue has zero code changes, though it MIGHT have regenerated a grub config for you, maybe that is triggering the issue.
Are you seeing any other errors except for unknown TPM error ?
Are you using BTRFS filesystem or/and BTRFS snapshots ?

@aburmash
Copy link

Nevermind, reproduced it, we are going to pull out this update and issue a proper one shortly.

@robertm98
Copy link
Author

Thank you.
For info the filesystem is XFS.
A minor change is the name of lvm group form "ol" to "olb" so as not to clash with the volume group name of the the previous installation on the original drive when I copy files across. I wondered if this could be relevant due to the questions about the filesystem, but from your last reply probably not.
The installation is on a separate SATA drive and all other drives are disconnected.

@aburmash
Copy link

@robertm98 once again thank you very much! I see that it is not related to filesystems, just broken grub config.

@m45733r
Copy link

m45733r commented Jun 25, 2024

same issue here, is there any way to fix broken grub / grub.cfg from within UEFI interactive shell?

@robertm98
Copy link
Author

The only way I think this could be repaired is to do a recovery boot from the installation media. chroot to /mnt/sysroot (I think) then possibly use dnf to do a roll back or edit the config.
@aburmash Would it be possible to get the details of the errors in the config and what needs to be done to make things good, please? What needs editing and then running to apply the config changes.

@aburmash
Copy link

@robertm98 @m45733r i will provide recovery instructions from UEFI shell shortly.

@aburmash
Copy link

aburmash commented Jun 25, 2024

@m45733r

  1. if you have already installed bad update, but did not reboot:
    grub2-mkconfig > /boot/grub2/grub.cfg OR
    grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg
  2. if you can only do stuff from UEFI shell.
  • identify which FS is your ESP partition
    to do that, just check all displayed partitions one by one, ESP is usually FS0
      FS0: Alias(s):HD0a1b:;BLK1:
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(1,GPT,3AF7074E-C0BB-400D-8FC7-E9EC738AA53F,0x800,0x32000)
     BLK0: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)
     BLK2: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(2,GPT,14BE7023-6C02-4573-8891-9F639B9D936A,0x32800,0x400000)
     BLK3: Alias(s):
          PciRoot(0x0)/Pci(0x4,0x0)/Scsi(0x0,0x1)/HD(3,GPT,E700F071-90A5-40BB-8132-52AF688193B7,0x432800,0x5900800)****
fs0:
ls

if you see EFI dir, you are where you need to be

cd EFI/redhat
rm grub.cfg
grubx64.efi

you will be dropped to grub cmdline
ls
it will display list of disks available, there you need to find a disk that has /boot dir or identify /boot partition
run
ls <disk>/ to see which one is that
for example:
ls (hd0,gpt2)/
when you have found the /boot you will see something like

grub> ls (hd0,gpt2)/
./ ../ efi/ grub2/ loader/ vmlinuz-5.14.0-427.16.1.el9_4.x86_64
System.map-5.14.0-427.16.1.el9_4.x86_64 config-5.14.0-427.16.1.el9_4.x86_64
.vmlinuz-5.14.0-427.16.1.el9_4.x86_64.hmac
symvers-5.14.0-427.16.1.el9_4.x86_64.gz
initramfs-5.14.0-427.16.1.el9_4.x86_64.img
vmlinuz-5.15.0-206.153.7.el9uek.x86_64
System.map-5.15.0-206.153.7.el9uek.x86_64 config-5.15.0-206.153.7.el9uek.x86_64
.vmlinuz-5.15.0-206.153.7.el9uek.x86_64.hmac
symvers-5.15.0-206.153.7.el9uek.x86_64.gz
initramfs-5.15.0-206.153.7.el9uek.x86_64.img
initramfs-0-rescue-36703c3cdc50ff74e863e867384f6a8a.img
vmlinuz-0-rescue-36703c3cdc50ff74e863e867384f6a8a
initramfs-5.15.0-206.153.7.el9uek.x86_64kdump.img 

Now you need to check boot info for you kernel
ls (hd0,gpt2)/loader/entries/

grub> ls (hd0,gpt2)/loader/entries/
./ ../ 8c622b7d13354f7fbe5eee50d3f340bd-5.14.0-427.16.1.el9_4.x86_64.conf
8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf
36703c3cdc50ff74e863e867384f6a8a-0-rescue.conf

cat (hd0,gpt2)/loader/entries/8c622b7d13354f7fbe5eee50d3f340bd-5.15.0-206.153.7.el9uek.x86_64.conf You will see something like:

title Oracle Linux Server (5.15.0-206.153.7.el9uek.x86_64 with Unbreakable Ente
rprise Kernel) 9.4
version 5.15.0-206.153.7.el9uek.x86_64
linux /vmlinuz-5.15.0-206.153.7.el9uek.x86_64
initrd /initramfs-5.15.0-206.153.7.el9uek.x86_64.img $tuned_initrd
options root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
grub_users $grub_users
grub_arg --unrestricted
grub_class ol

Now still in grub cmdline run:

linux (hd0,gpt2)/vmlinuz-5.15.0-206.153.7.el9uek.x86_64 root=/dev/mapper/ocivolume-root ro crashkernel=1G-4G:192M,4G-64G:256M,64G-:512M LANG=en_US.UTF-8 console=tty0 console=ttyS0,115200 rd.luks=0 rd.md=0 rd.dm=0 rd.lvm.vg=ocivolume rd.lvm.lv=ocivolume/root rd.net.timeout.dhcp=10 rd.net.timeout.carrier=5 netroot=iscsi:169.254.0.2:::1:iqn.2015-02.oracle.boot:uefi rd.iscsi.param=node.session.timeo.replacement_timeout=6000 net.ifnames=1 nvme_core.shutdown_timeout=10 ipmi_si.tryacpi=0 ipmi_si.trydmi=0 libiscsi.debug_libiscsi_eh=1 loglevel=4 crash_kexec_post_notifiers
initrd (hd0,gpt2)/initramfs-5.15.0-206.153.7.el9uek.x86_64.img
boot

where kernel = kernel form config
options for kernel = options from config
initrd = initrd from config
IMPORTANT: when doing copy/pastes VERIFY that
linux string is a single string, if you have newlines or returns in the buffer - they will NOT be applied.
So when you have full linux string copied - paste it to some file to verify that it is a single string.
do not forget that path is relative to your partition with /boot or /boot partition.
If your /boot is on /root partition, you will need to find the disk with root partition and your paths will be something like
(lvm/volume-root)/boot/

When system is booted run:
grub2-mkconfig > /boot/grub2/grub.cfg
grub2-mkconfig > /boot/efi/EFI/redhat/grub.cfg

@aburmash
Copy link

@robertm98 the problem is that on OL9, config file for grub2 was switched to parent config in /boot/efi/EFI/redhat/grub.cfg that in order loads proper /boot/grub2/grub.cfg config.

For CERTAIN /boot/efi/EFI/redhat/grub.cfg config contents fix that was applied for leapp in-place upgrade instead of correctly updating configs ( or not touching them ), writes /boot/efi/EFI/redhat/grub.cfg into /boot/grub2/grub.cfg and system chainloops.

@m45733r
Copy link

m45733r commented Jun 25, 2024

Thanks for the instructions, some remarks from my expierence:
Running grubx64.efi after grub.cfg was deleted did not automatically put me into grub cmdline but was stuck and I needed to power-cycle the machine.
ls (hd0,gpt1) only shows "Filesystems is fat" or "Filesystem is xfs", not actual contents.
However ls (hd0,gpt2)/loader/entries would only succeed on the right disk and list its contents, and show not found on all others.

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/
Reading your latest comment I tried mkconfig to /boot/efi/EFI/redhat/grub.cfg and it seems to work now!

@aburmash
Copy link

aburmash commented Jun 25, 2024

ls (hd0,gpt1)

yeah, you need slash in the end to display content:
ls (hd0,gpt1)/

boot was successful, but after login + grub2-mkconfig + reboot it would return to grub cmdline again :/

OH! yes, that is because /boot/efi/EFI/redhat/grub.cfg was removed from UEFI shell during recovery.
I've updated my post to reflect that.

@robertm98
Copy link
Author

Thank you.

@m45733r
Copy link

m45733r commented Jun 25, 2024

Im not sure if that is related to the original issue but the only thing that is a bit weird now is that grubby shows:

[root@ol9-machine ~]# grubby --default-kernel
/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# grubby --default-index
3
[root@ol9-machine ~]# grubby --info DEFAULT
index=3
kernel="/boot/vmlinuz-5.15.0-207.156.6.el9uek.x86_64"
args="ro rd.lvm.lv=ol/root rhgb quiet crashkernel=1G-64G:448M,64G-:512M $tuned_params"
root="/dev/mapper/ol-root"
initrd="/boot/initramfs-5.15.0-207.156.6.el9uek.x86_64.img $tuned_initrd"
title="Oracle Linux Server (5.15.0-207.156.6.el9uek.x86_64 with Unbreakable Enterprise Kernel) 9.4"
id="bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64"

And yet, when I reboot it would automatically select index 0 with a kernel that is no longer present in /boot.
So the system is usable but wouldnt survive an automated reboot. See screenshot attached.

[root@ol9-machine ~]# uname -r
5.15.0-207.156.6.el9uek.x86_64
[root@ol9-machine ~]# dnf list installed | grep kernel
kernel.x86_64                         5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-core.x86_64                    5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules.x86_64                 5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-modules-core.x86_64            5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools.x86_64                   5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-tools-libs.x86_64              5.14.0-427.22.1.el9_4               @ol9_baseos_latest
kernel-uek.x86_64                     5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-core.x86_64                5.15.0-207.156.6.el9uek             @ol9_UEKR7
kernel-uek-modules.x86_64             5.15.0-207.156.6.el9uek             @ol9_UEKR7

Any help appreciated.

image

@aburmash
Copy link

can you show please
for x in $(find /boot |grep grubenv); do echo $x; cat $x; done

cat /boot/efi/EFI/redhat/grub.cfg |grep grubenv
cat /boot/grub2/grub.cfg |grep grubenv

@m45733r
Copy link

m45733r commented Jun 25, 2024

Sure, here you go:

/boot/grub2/grubenv
# GRUB Environment Block
# WARNING: Do not edit this file by tools other than grub-editenv!!!
saved_entry=bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64
boot_success=1
boot_indeterminate=0
##################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################################

/boot/efi/EFI/redhat/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

/boot/grub2/grub.cfg

if [ -f ${config_directory}/grubenv ]; then
  load_env -f ${config_directory}/grubenv
elif [ -s $prefix/grubenv ]; then
# The kernelopts variable should be defined in the grubenv file. But to ensure that menu
# without a grubenv file, define a fallback kernelopts variable if this has not been set.
# The kernelopts variable in the grubenv file can be modified using the grubby tool or by
# the kernelopts variable in the grubenv file and the fallback kernelopts variable.

@aburmash
Copy link

OK, everything above looks correct.
Now
ls /boot/loader/entries/

It seems you have some redundant entries there.

@m45733r
Copy link

m45733r commented Jun 25, 2024

[root@ol9-machine grub2]# ls -al /boot/loader/entries/
total 28
drwx------. 2 root root 4096 Jun 25 13:34 .
drwxr-xr-x. 3 root root   21 Oct 17  2022 ..
-rw-r--r--. 1 root root  440 May 22 13:59 495620e0609f491080cb4e769e86283d-0-rescue.conf
-rw-r--r--. 1 root root  381 May 22 13:59 495620e0609f491080cb4e769e86283d-5.14.0-284.30.1.el9_2.x86_64.conf
-rw-r--r--. 1 root root  428 May 22 13:59 495620e0609f491080cb4e769e86283d-5.15.0-200.131.27.el9uek.x86_64.conf
-rw-r--r--. 1 root root  405 May 22 13:59 bda9a182a36740ada28baaa218d5c09d-0-rescue.conf
-rw-r--r--. 1 root root  381 Jun 25 10:18 bda9a182a36740ada28baaa218d5c09d-5.14.0-427.22.1.el9_4.x86_64.conf
-rw-r--r--. 1 root root  424 Jun 25 10:19 bda9a182a36740ada28baaa218d5c09d-5.15.0-207.156.6.el9uek.x86_64.conf

oh, heres the problem - sorry for bothering you - but thanks for pointing me in the right direction. looks like (some script or person) regenerated the machine-id a few weeks ago...

@aburmash
Copy link

For everyone tracking this issue:
grub2 updates that does NOT contain scriptlet bug and, at the same time, resolves the issue for people who had installed broken package, but did not reboot, was published to public repositories:

version is 2.06-80.0.3.el9_4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants
@aburmash @robertm98 @m45733r and others