Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add kernel coredump and analysis on sonic kernel #3276

Closed
wants to merge 10 commits into from

Conversation

sun-siyuan
Copy link

@sun-siyuan sun-siyuan commented Aug 2, 2019

Summary: add kdump package to j2 template and config after

this is the to add kdump to capture the kernel crash core and for further analysis by crash tool

in this PR, contain two part

1, install kdump tool chain to host environment
2, configure kdump tool in both boot up via grub.cfg and system level

test done:

test build process, build sonic-broadcom.bin and sonic-aboot-broadcom.swi

-rwxr-xr-x 1 sun sun 562086493 Jul 28 21:28 sonic-aboot-broadcom.swi
-rw-r--r-- 1 sun sun 215882 Jul 28 21:28 sonic-aboot-broadcom.swi.log
-rwxr-xr-x 1 sun sun 569867542 Jul 28 00:52 sonic-broadcom.bin
-rw-r--r-- 1 sun sun 285585 Jul 28 00:52 sonic-broadcom.bin.log

test image, load sonic-broadcom.bin to switch


Signed-off-by: siyuan sun siyuan.sun@alibaba-inc.com

@msftclas
Copy link

msftclas commented Aug 2, 2019

CLA assistant check
All CLA requirements met.

@lguohan
Copy link
Collaborator

lguohan commented Aug 2, 2019

can you describe the test you have done?

@sun-siyuan
Copy link
Author

sun-siyuan commented Aug 2, 2019 via email

Copy link
Author

@sun-siyuan sun-siyuan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you describe the test you have done?

just add test section to commit comments

files/image_config/platform/rc.local Outdated Show resolved Hide resolved
@pavel-shirshov
Copy link
Contributor

Can you please share console output after you issued commands

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

Thanks

@sun-siyuan
Copy link
Author

Can you please share console output after you issued commands

echo 1 > /proc/sys/kernel/sysrq
echo c > /proc/sysrq-trigger

Thanks

sure, please see below:

[1568872.292681] sysrq: SysRq : Trigger a crash
[1568872.360135] BUG: unable to handle kernel NULL pointer dereference at (null)
[1568872.361626] IP: [] sysrq_handle_crash+0x12/0x20
[1568872.383586] PGD 8000000154beb067 [1568872.384199] PUD 154b30067
PMD 0 [1568872.384888]
[1568872.385263] Oops: 0002 [#1] SMP
[1568872.385880] Modules linked in: fuse veth dummy iptable_raw xt_limit xt_tcpudp xt_conntrack bridge stp llc nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo bf_tun(O) crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon virtio_console evdev ip6table_filter serio_raw button ip6_tables iptable_filter iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack pcspkr iptable_mangle ip_tables x_tables autofs4 loop ext4 crc16 jbd2 crc32c_generic fscrypto ecb mbcache nls_utf8 nls_cp437 nls_ascii vfat fat overlay squashfs ata_generic virtio_blk virtio_net crc32c_intel aesni_intel aes_x86_64 glue_helper lrw gf128mul ablk_helper cryptd psmouse ata_piix ehci_pci uhci_hcd libata ehci_hcd usbcore usb_common virtio_pci scsi_mod virtio_ring virtio i2c_piix4 floppy
[1568872.400602] CPU: 0 PID: 30349 Comm: bash Tainted: G O 4.9.0-8-2-amd64 #1 Debian 4.9.110-3+deb9u6
[1568872.402364] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[1568872.411005] task: ffff97f79a67c000 task.stack: ffffb02508d3c000
[1568872.412108] RIP: 0010:[] [] sysrq_handle_crash+0x12/0x20
[1568872.413698] RSP: 0018:ffffb02508d3fe78 EFLAGS: 00010282
[1568872.414682] RAX: ffffffffba8287f0 RBX: 0000000000000063 RCX: 0000000000000000
[1568872.415993] RDX: 0000000000000000 RSI: ffff97f87fc10648 RDI: 0000000000000063
[1568872.417315] RBP: ffffffffbb0bf2e0 R08: 0000000000000001 R09: 0000000000009868
[1568872.418630] R10: 0000000000000001 R11: 0000000000000001 R12: 0000000000000004
[1568872.419951] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
[1568872.421274] FS: 00007f30474ee700(0000) GS:ffff97f87fc00000(0000) knlGS:0000000000000000
[1568872.422753] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[1568872.423824] CR2: 0000000000000000 CR3: 000000014fc6c000 CR4: 0000000000160670
[1568872.425148] Stack:
[1568872.425571] ffffffffba828f37 0000000000000002 fffffffffffffffb ffffb02508d3ff08
[1568872.427071] 0000000001d1f008 ffffffffba82937b ffff97f86fad03c0 ffffffffba677b00
[1568872.428574] 0000000000000002 ffff97f874433200 ffffffffba608490 ffff97f874433200
[1568872.433765] Call Trace:
[1568872.434268] [] ? __handle_sysrq+0xf7/0x150
[1568872.435332] [] ? write_sysrq_trigger+0x2b/0x30
[1568872.450405] [] ? proc_reg_write+0x40/0x70
[1568872.451539] [] ? vfs_write+0xb0/0x190
[1568872.452532] [] ? SyS_write+0x52/0xc0
[1568872.453722] [] ? do_syscall_64+0x8d/0xf0
[1568872.457141] [] ? entry_SYSCALL_64_after_swapgs+0x58/0xc6
[1568872.458409] Code: 41 5c 41 5d 41 5e 41 5f e9 bc 1f cf ff 66 2e 0f 1f 84 00 00 00 00 00 66 90 0f 1f 44 00 00 c7 05 89 94 a8 00 01 00 00 00 0f ae f8 04 25 00 00 00 00 01 c3 0f 1f 44 00 00 0f 1f 44 00 00 53 8d
[1568872.464225] RIP [] sysrq_handle_crash+0x12/0x20
[1568872.465409] RSP
[1568872.466083] CR2: 0000000000000000
[ 0.000000] do_IRQ: 0.113 No irq handler for vector
[ 4.321878] kdump-tools[323]: Starting kdump-tools: running makedumpfile -c -d 31 /proc/vmcore /var/crash/201908022135/dump-incomplete.
Copying data : [100.0 %]
[ 22.021250] kdump-tools[323]: The kernel version is not supported.
[ 22.023335] kdump-tools[323]: The makedumpfile operation may be incomplete.
[ 22.025515] kdump-tools[323]: The dumpfile is saved to /var/crash/201908022135/dump-incomplete.
[ 22.032202] kdump-tools[323]: makedumpfile Completed.
[ 22.036659] kdump-tools[323]: kdump-tools: saved vmcore in /var/crash/201908022135.
[ 22.432118] kdump-tools[323]: running makedumpfile --dump-dmesg /proc/vmcore /var/crash/201908022135/dmesg.201908022135.
[ 22.436076] kdump-tools[323]: The kernel version is not supported.
[ 22.440348] kdump-tools[323]: The makedumpfile operation may be incomplete.
[ 22.443098] kdump-tools[323]: The dmesg log is saved to /var/crash/201908022135/dmesg.201908022135.
[ 22.445600] kdump-tools[323]: makedumpfile Completed.
[ 22.447797] kdump-tools[323]: kdump-tools: saved dmesg content in /var/crash/201908022135.
[ 22.498068] kdump-tools[323]: Fri, 02 Aug 2019 21:35:39 +0000
[ 22.511355] kdump-tools[323]: logname: no login name

@pavel-shirshov
Copy link
Contributor

@sun-siyuan
Thank you for the output.
But after the crash the system rebooted itself?

Thanks

@sun-siyuan
Copy link
Author

sun-siyuan commented Aug 6, 2019 via email

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the patch.
This patch looks good for me except two things:

  1. 256M of memory would be used for kdump.
  2. Reboot would take longer in case of the kernel crash

Probably we need to put this feature as an option in sonic-buildimage.
But if @lguohan ok we can keep it as it is. I'm ready to approve it.
Thanks

@lguohan
Copy link
Collaborator

lguohan commented Aug 7, 2019

how do we decide to reserve 256M?

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide

I also agree with pavel, we should make it optional.

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please make this feature optional with default no?
See comments from @lguohan

@sun-siyuan
Copy link
Author

how do we decide to reserve 256M?

https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/kernel_administration_guide/kernel_crash_dump_guide

I also agree with pavel, we should make it optional.

we were using 768M according to redhat recommend but consider it could be a waste of running memory, and test with 256M which works fine.

I just update the PR to make it optional

@sun-siyuan
Copy link
Author

Can you please make this feature optional with default no?
See comments from @lguohan

please check the new diff, which make this optional and default NO

@sun-siyuan

This comment has been minimized.

@sun-siyuan
Copy link
Author

@pavel-shirshov please let me know whether you are ok with the new changes

@pavel-shirshov
Copy link
Contributor

retest this please

@sun-siyuan

This comment has been minimized.

@lguohan
Copy link
Collaborator

lguohan commented Aug 15, 2019

besides the build option, I think we should have a command line to enable/disable this kernel crash dump feature.

like

config kdump enable --size=265M
config kdump disable

@sun-siyuan sun-siyuan force-pushed the kdump branch 2 times, most recently from c831d57 to 8d723c5 Compare August 20, 2019 00:47
@lguohan
Copy link
Collaborator

lguohan commented Aug 29, 2019

is there any plan to update the PR based on the feedback?

@sun-siyuan
Copy link
Author

sun-siyuan commented Sep 3, 2019 via email

@sun-siyuan sun-siyuan changed the title apply kdump supported package and config on fsroot enable kernel coredump and analysis on sonic kernel Sep 9, 2019
@sun-siyuan sun-siyuan changed the title enable kernel coredump and analysis on sonic kernel add kernel coredump and analysis on sonic kernel Sep 9, 2019
onie-mk-demo.sh Outdated Show resolved Hide resolved
Copy link
Collaborator

@qiluo-msft qiluo-msft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As comments

@sun-siyuan
Copy link
Author

vsimage failed due to no space error, please fix and trigger retest

WARNING: No swap limit support
WARNING: bridge-nf-call-iptables is disabled
WARNING: bridge-nf-call-ip6tables is disabled
Sending build context to Docker daemon 80.3MB

Step 1/18 : FROM debian:stretch
stretch: Pulling from library/debian
092586df9206: Pulling fs layer
092586df9206: Download complete
092586df9206: Pull complete
Digest: sha256:5fb93ce7a427b7c1c2374d5c29d68a159de7d5e781deeda422f8d51a1a9b6480
Status: Downloaded newer image for debian:stretch
---> cb15ecf641ad
Step 2/18 : RUN find /usr/share/doc -depth ( -type f -o -type l ) ! -name copyright | xargs rm || true
open /var/lib/docker/vfs/dir/3659c51f468f582034afd8032abac00a1480ab66ebeeaf2bfeef2030bc8e739a/usr/sbin/cppw: no space left on device
[ FAIL LOG END ] [ target/docker-base-stretch.gz ]
slave.mk:526: recipe for target 'target/docker-base-stretch.gz' failed
make: *** [target/docker-base-stretch.gz] Error 1
Makefile.work:191: recipe for target 'target/sonic-vs.img.gz' failed
make[1]: *** [target/sonic-vs.img.gz] Error 2
make[1]: Leaving directory '/data/johnar/workspace/vs/buildimage-vs-image-pr@2'
Makefile:6: recipe for target 'target/sonic-vs.img.gz' failed
make: *** [target/sonic-vs.img.gz] Error 2

onie-mk-demo.sh Outdated Show resolved Hide resolved
sun-siyuan and others added 8 commits October 24, 2019 18:09
Summary: add kdump package to j2 template and config after

Test Plan: test with new image, no issue observed

Reviewers: P604087

Subscribers: P604087

Differential Revision: https://aone.alibaba-inc.com/code/D891921
correct crashkernel size to 256M, error introduced by cherry-pick
fix SONIC_ENABLE_KDUMP
@sun-siyuan
Copy link
Author

test failed with below error, need retest

Setting status of 451664d to FAILURE with url https://sonic-jenkins.westus2.cloudapp.azure.com/job/broadcom/job/buildimage-brcm-all-pr/1278/ and message: 'Build finished. No test results found.'
Using context: broadcom
hudson.remoting.ProxyException: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
Perhaps you forgot to surround the code with a step that provides this, such as: node, dockerNode
at org.jenkinsci.plugins.workflow.steps.StepDescriptor.checkContextAvailability(StepDescriptor.java:266)
at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:263)
Caused: hudson.remoting.ProxyException: org.codehaus.groovy.runtime.InvokerInvocationException: org.jenkinsci.plugins.workflow.steps.MissingContextVariableException: Required context class hudson.FilePath is missing
Perhaps you forgot to surround the code with a step that provides this, such as: node, dockerNode
at org.jenkinsci.plugins.workflow.cps.CpsStepContext.replay(CpsStepContext.java:492)
at org.jenkinsci.plugins.workflow.cps.DSL.invokeStep(DSL.java:317)
at org.jenkinsci.plugins.workflow.cps.DSL.invokeDescribable(DSL.java:417)
at org.jenkinsci.plugins.workflow.cps.DSL.invokeMethod(DSL.java:182)

@sun-siyuan
Copy link
Author

retest please

@sun-siyuan
Copy link
Author

please retest

@sun-siyuan
Copy link
Author

retest please

@sun-siyuan
Copy link
Author

retest all please

@sun-siyuan
Copy link
Author

retest this please

@lguohan
Copy link
Collaborator

lguohan commented Mar 20, 2020

core dump added by broadcom

@lguohan lguohan closed this Mar 20, 2020
mssonicbld added a commit that referenced this pull request Jun 6, 2024
…atically (#19223)

#### Why I did it
src/sonic-utilities
```
* 735891cc - (HEAD -> 202311, origin/202311) [Mellanpx] Update SDK Sniffer default folder (#3276) (19 hours ago) [Dror Prital]
```
#### How I did it
#### How to verify it
#### Description for the changelog
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants