Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configure oom-killer to panic when system is out of memory #2988

Merged
merged 1 commit into from
Jun 11, 2019
Merged

Configure oom-killer to panic when system is out of memory #2988

merged 1 commit into from
Jun 11, 2019

Conversation

SuvarnaMeenakshi
Copy link
Contributor

…memory

- What I did
Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic.
To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy.

- How I did it
Added the sysctl variable panic_on_oom and set the value to 2.
Setting it to 2 will ensure OOM killer to always do a kernel panic.

- How to verify it

  1. Add server IP configure for rsyslogd as: .@:514 in /etc/rsyslog.d/99-default.conf.
    Server IP can be some dummy IP just to check if the device is sending the packets.
    sudo tcpdump -i eth0 -nv port 514

  2. Start dumping the packets at eth0 to check if rsyslogd is sending out the panic message.

  3. To manually trigger OOM, use:
    sudo chmod 777 /proc/sysrq-trigger
    sudo echo f > /proc/sysrq-trigger

  4. output on screen :
    admin@str-s6000-acs-9:~$ sudo echo f > /proc/sysrq-trigger
    [ 910.064760] SysRq : Manual OOM execution
    [ 910.113950] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled
    [ 910.113950]
    [ 910.229545] CPU: 1 PID: 243 Comm: kworker/1:2 Tainted: G C O 3.16.0-6-amd64 Update README.md #1 Debian 3.16.57-2
    [ 910.343096] Hardware name: Dell Inc S6000-ACS/S6000 CPU, BIOS 4.6.5 10/12/2015
    [ 910.429527] Workqueue: events moom_callback
    [ 910.479643] 0000000000000000 ffffffff81534db1 ffffffff8172be08 ffff880233967da0
    [ 910.568560] ffffffff81533408 0000000000000010 ffff880233967db0 ffff880233967d48
    [ 910.657480] ffff8802171705d0 ffffffff817304e3 0000000000000007 0000000000000006
    [ 910.746400] Call Trace:
    [ 910.775637] [] ? dump_stack+0x5d/0x78
    [ 910.839178] [] ? panic+0xc6/0x21d
    [ 910.898565] [] ? check_panic_on_oom+0x54/0x60
    [ 910.970425] [] ? out_of_memory+0x192/0x4f0
    [ 911.039174] [] ? __switch_to+0x14a/0x610
    [ 911.105835] [] ? process_one_work+0x14c/0x470
    [ 911.177701] [] ? worker_thread+0x6b/0x540
    [ 911.245400] [] ? __schedule+0x284/0x740
    [ 911.311026] [] ? rescuer_thread+0x2d0/0x2d0
    [ 911.380811] [] ? kthread+0xd1/0xf0
    [ 911.441234] [] ? do_exit+0x847/0xac0
    [ 911.503739] [] ? kthread_create_on_node+0x180/0x180
    [ 911.581844] [] ? ret_from_fork+0x58/0xa0
    [ 911.648505] [] ? kthread_create_on_node+0x180/0x180
    [ 911.726619] Kernel Offset: 0x0 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffff9fffffff)
    [ 912.826436] Rebooting in 10 seconds..
    [ 922.788324] ACPI MEMORY or I/O RESET_REG.

TCPDUMP output:
Message from syslogd@str-s6000-acs-9 at Jun 10 23:00:23 ...
kernel:[ 678.496435] Kernel panic - not syncing: Out of memory: compulsory panic_on_oom is enabled

Message from syslogd@str-s6000-acs-9 at Jun 10 23:00:23 ...
kernel:[ 678.496435]

- Description for the changelog

- A picture of a cute animal (not mandatory but encouraged)

Copy link
Contributor

@jleveque jleveque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SuvarnaMeenakshi: The title was truncated and the word memory was added to the start of the description. GitHub truncates titles to a suggested max length when a PR is created. In this case, since only the word memory was cut off, I suggest adding it back. You can click the "Edit" button to the right of the title and modify the title. When editing, you can make the title longer than the recommended max length (however, if the title is super-long, consider rewording it to make it more concise). You can also edit your description to remove …memory from the beginning.

@lguohan
Copy link
Collaborator

lguohan commented Jun 11, 2019

retest vs please

Copy link
Contributor

@pavel-shirshov pavel-shirshov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please check Joe's comments

@SuvarnaMeenakshi SuvarnaMeenakshi changed the title Configure kernel oom-killer to panic when the system is truly out of … Configure oom-killer to panic when system is out of memory Jun 11, 2019
@SuvarnaMeenakshi
Copy link
Contributor Author

@SuvarnaMeenakshi: The title was truncated and the word memory was added to the start of the description. GitHub truncates titles to a suggested max length when a PR is created. In this case, since only the word memory was cut off, I suggest adding it back. You can click the "Edit" button to the right of the title and modify the title. When editing, you can make the title longer than the recommended max length (however, if the title is super-long, consider rewording it to make it more concise). You can also edit your description to remove …memory from the beginning.

Thank you, Updated as per comment.

@lguohan lguohan merged commit 0f665bd into sonic-net:master Jun 11, 2019
lguohan pushed a commit that referenced this pull request Jun 12, 2019
…f memory (#2988)

- What I did
Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic.

To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy.

- How I did it
Added the sysctl variable panic_on_oom and set the value to 2.
Setting it to 2 will ensure OOM killer to always do a kernel panic.
yxieca pushed a commit that referenced this pull request Jun 13, 2019
…f memory (#2988)

- What I did
Currently when the system is under memory pressure, the OOM killer kicks in and kills a rogue process. Killing a rogue process can cause the device to be un-healthy leading to blackholing of the traffic.

To avoid this, configure the OOM to do a kernel panic which will cause the device to reboot and come back up healthy.

- How I did it
Added the sysctl variable panic_on_oom and set the value to 2.
Setting it to 2 will ensure OOM killer to always do a kernel panic.
dgsudharsan added a commit to dgsudharsan/sonic-buildimage that referenced this pull request Apr 5, 2024
…commits

* c96a2f84 - Revert "[acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078)" (sonic-net#3092) (6 days ago) [Neetha John]
* 80e0b57d - [Copp]Refactor coppmgr tests (sonic-net#3093) (8 days ago) [Sudharsan Dhamal Gopalarathnam]
* a4647299 - [portsorch] process only updated APP_DB fields when port is already   created (sonic-net#3025) (10 days ago) [Stepan Blyshchak]
* 91bacca5 - [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (sonic-net#3055) (2 weeks ago) [Amir]
* 04912ad0 - [bulker] add support for neighbor bulking (sonic-net#2768) (2 weeks ago) [Nikola Dancejic]
* 9d4a3add - [acl] Add IN_PORTS qualifier for L3 table (sonic-net#3078) (2 weeks ago) [Neetha John]
* a13e081f - [Mellanox] Fix inconsistence in the shared headroom pool initialization (sonic-net#3057) (3 weeks ago) [Stephen Sun]
* ff2b2b85 - Add basic fabric link monitoring counters and states handling. (sonic-net#2988) (3 weeks ago) [jfeng-arista]
* 0c620910 - Add port flap count and last flap timestamp to APPL_DB (sonic-net#3052) (3 weeks ago) [Prince George]
* e9931f31 - [EVPN] Skip EVPN routes with invalid VNI or router mac field (sonic-net#3073) (3 weeks ago) [Lior Avramov]
* 600d5e80 - Set HOST_TX_READY_NOTIFY attribute only after query capabilities(sonic-net#3070) (3 weeks ago) [noaOrMlnx]
liat-grozovik pushed a commit that referenced this pull request Apr 7, 2024
…commits (#18576)

* c96a2f84 - Revert "[acl] Add IN_PORTS qualifier for L3 table (#3078)" (#3092) (6 days ago) [Neetha John]
* 80e0b57d - [Copp]Refactor coppmgr tests (#3093) (8 days ago) [Sudharsan Dhamal Gopalarathnam]
* a4647299 - [portsorch] process only updated APP_DB fields when port is already   created (#3025) (10 days ago) [Stepan Blyshchak]
* 91bacca5 - [buffermgrd] Move switch-statement outside of if-statement in BufferMgr::doTask (#3055) (2 weeks ago) [Amir]
* 04912ad0 - [bulker] add support for neighbor bulking (#2768) (2 weeks ago) [Nikola Dancejic]
* 9d4a3add - [acl] Add IN_PORTS qualifier for L3 table (#3078) (2 weeks ago) [Neetha John]
* a13e081f - [Mellanox] Fix inconsistence in the shared headroom pool initialization (#3057) (3 weeks ago) [Stephen Sun]
* ff2b2b85 - Add basic fabric link monitoring counters and states handling. (#2988) (3 weeks ago) [jfeng-arista]
* 0c620910 - Add port flap count and last flap timestamp to APPL_DB (#3052) (3 weeks ago) [Prince George]
* e9931f31 - [EVPN] Skip EVPN routes with invalid VNI or router mac field (#3073) (3 weeks ago) [Lior Avramov]
* 600d5e80 - Set HOST_TX_READY_NOTIFY attribute only after query capabilities(#3070) (3 weeks ago) [noaOrMlnx]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants