-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Configure oom-killer to panic when system is out of memory #2988
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SuvarnaMeenakshi: The title was truncated and the word memory
was added to the start of the description. GitHub truncates titles to a suggested max length when a PR is created. In this case, since only the word memory
was cut off, I suggest adding it back. You can click the "Edit" button to the right of the title and modify the title. When editing, you can make the title longer than the recommended max length (however, if the title is super-long, consider rewording it to make it more concise). You can also edit your description to remove …memory
from the beginning.
retest vs please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please check Joe's comments
Thank you, Updated as per comment. |
…f memory (#2988) - What I did Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic. To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy. - How I did it Added the sysctl variable panic_on_oom and set the value to 2. Setting it to 2 will ensure OOM killer to always do a kernel panic.
…f memory (#2988) - What I did Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic. To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy. - How I did it Added the sysctl variable panic_on_oom and set the value to 2. Setting it to 2 will ensure OOM killer to always do a kernel panic.
…commits * c96a2f84 - Revert "[acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078)" (sonic-net#3092) (6 days ago) [Neetha John] * 80e0b57d - [Copp]Refactor coppmgr tests (sonic-net#3093) (8 days ago) [Sudharsan Dhamal Gopalarathnam] * a4647299 - [portsorch] process only updated APP_DB fields when port is already created (sonic-net#3025) (10 days ago) [Stepan Blyshchak] * 91bacca5 - [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (sonic-net#3055) (2 weeks ago) [Amir] * 04912ad0 - [bulker] add support for neighbor bulking (sonic-net#2768) (2 weeks ago) [Nikola Dancejic] * 9d4a3add - [acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078) (2 weeks ago) [Neetha John] * a13e081f - [Mellanox] Fix inconsistence in the shared headroom pool initialization (sonic-net#3057) (3 weeks ago) [Stephen Sun] * ff2b2b85 - Add basic fabric link monitoring counters and states handling. (sonic-net#2988) (3 weeks ago) [jfeng-arista] * 0c620910 - Add port flap count and last flap timestamp to APPL_DB (sonic-net#3052) (3 weeks ago) [Prince George] * e9931f31 - [EVPN] Skip EVPN routes with invalid VNI or router mac field (sonic-net#3073) (3 weeks ago) [Lior Avramov] * 600d5e80 - Set HOST_TX_READY_NOTIFY attribute only after query capabilities(sonic-net#3070) (3 weeks ago) [noaOrMlnx]
…commits (#18576) * c96a2f84 - Revert "[acl] Add IN_PORTS qualifier for L3 table (#3078)" (#3092) (6 days ago) [Neetha John] * 80e0b57d - [Copp]Refactor coppmgr tests (#3093) (8 days ago) [Sudharsan Dhamal Gopalarathnam] * a4647299 - [portsorch] process only updated APP_DB fields when port is already created (#3025) (10 days ago) [Stepan Blyshchak] * 91bacca5 - [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (#3055) (2 weeks ago) [Amir] * 04912ad0 - [bulker] add support for neighbor bulking (#2768) (2 weeks ago) [Nikola Dancejic] * 9d4a3add - [acl] Add IN_PORTS qualifier for L3 table (#3078) (2 weeks ago) [Neetha John] * a13e081f - [Mellanox] Fix inconsistence in the shared headroom pool initialization (#3057) (3 weeks ago) [Stephen Sun] * ff2b2b85 - Add basic fabric link monitoring counters and states handling. (#2988) (3 weeks ago) [jfeng-arista] * 0c620910 - Add port flap count and last flap timestamp to APPL_DB (#3052) (3 weeks ago) [Prince George] * e9931f31 - [EVPN] Skip EVPN routes with invalid VNI or router mac field (#3073) (3 weeks ago) [Lior Avramov] * 600d5e80 - Set HOST_TX_READY_NOTIFY attribute only after query capabilities(#3070) (3 weeks ago) [noaOrMlnx]
…memory
- What I did
Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic.
To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy.
- How I did it
Added the sysctl variable panic_on_oom and set the value to 2.
Setting it to 2 will ensure OOM killer to always do a kernel panic.
- How to verify it
Add server IP configure for rsyslogd as: .@:514 in /etc/rsyslog.d/99-default.conf.
Server IP can be some dummy IP just to check if the device is sending the packets.
sudo tcpdump -i eth0 -nv port 514
Start dumping the packets at eth0 to check if rsyslogd is sending out the panic message.
To manually trigger OOM, use:
sudo chmod 777 /proc/sysrq-trigger
sudo echo f > /proc/sysrq-trigger
output on screen :
admin@str-s6000-acs-9:~$ sudo echo f > /proc/sysrq-trigger
[ 910.064760] SysRq : Manual OOM execution
[ 910.113950] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
[ 910.113950]
[ 910.229545] CPU: 1 PID: 243 Comm: kworker/1:2 Tainted: G C O 3.16.0-6-amd64 Update README.md #1 Debian 3.16.57-2
[ 910.343096] Hardware name: Dell Inc S6000-ACS/S6000 CPU, BIOS 4.6.5 10/12/2015
[ 910.429527] Workqueue: events moom_callback
[ 910.479643] 0000000000000000 ffffffff81534db1 ffffffff8172be08 ffff880233967da0
[ 910.568560] ffffffff81533408 0000000000000010 ffff880233967db0 ffff880233967d48
[ 910.657480] ffff8802171705d0 ffffffff817304e3 0000000000000007 0000000000000006
[ 910.746400] Call Trace:
[ 910.775637] [] ? dump_stack+0x5d/0x78
[ 910.839178] [] ? panic+0xc6/0x21d
[ 910.898565] [] ? check_panic_on_oom+0x54/0x60
[ 910.970425] [] ? out_of_memory+0x192/0x4f0
[ 911.039174] [] ? __switch_to+0x14a/0x610
[ 911.105835] [] ? process_one_work+0x14c/0x470
[ 911.177701] [] ? worker_thread+0x6b/0x540
[ 911.245400] [] ? __schedule+0x284/0x740
[ 911.311026] [] ? rescuer_thread+0x2d0/0x2d0
[ 911.380811] [] ? kthread+0xd1/0xf0
[ 911.441234] [] ? do_exit+0x847/0xac0
[ 911.503739] [] ? kthread_create_on_node+0x180/0x180
[ 911.581844] [] ? ret_from_fork+0x58/0xa0
[ 911.648505] [] ? kthread_create_on_node+0x180/0x180
[ 911.726619] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
[ 912.826436] Rebooting in 10 seconds..
[ 922.788324] ACPI MEMORY or I/O RESET_REG.
TCPDUMP output:
Message from syslogd@str-s6000-acs-9 at Jun 10 23:00:23 ...
kernel:[ 678.496435] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
Message from syslogd@str-s6000-acs-9 at Jun 10 23:00:23 ...
kernel:[ 678.496435]
- Description for the changelog
- A picture of a cute animal (not mandatory but encouraged)