Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compile jobs running out of memory on WCOSS-Dell #1072

Closed
kgerheiser opened this issue Mar 1, 2022 · 1 comment
Closed

Compile jobs running out of memory on WCOSS-Dell #1072

kgerheiser opened this issue Mar 1, 2022 · 1 comment
Labels
bug Something isn't working

Comments

@kgerheiser
Copy link
Contributor

Description

While integrating the WW3 CMake build into the UFS, @JessicaMeixner-NOAA had an error on WCOSS-Dell because of using too much memory while compiling.

And i think it's related to the issue in #1032 (comment)

@junwang-noaa noted the debug build worked. I think that's because the optimizations the compiler takes use extra memory. It could be that we're close to the limit and adding new features/files takes it over the tipping point.

To Reproduce:

See #1032

Output

Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff85dc254d>] oom_kill_process+0x2cd/0x490
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff85e416cc>] mem_cgroup_oom_synchronize+0x55c/0x590
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff85e40b30>] ? mem_cgroup_charge_common+0xc0/0xc0
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff85dc2e34>] pagefault_out_of_memory+0x14/0x90
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff8637cb15>] mm_fault_error+0x6a/0x157
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff863908d1>] __do_page_fault+0x491/0x500
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff86390975>] do_page_fault+0x35/0x90
Feb 28 17:18:35 v71e7 kernel: WARNING  [<ffffffff8638c778>] page_fault+0x28/0x30
Feb 28 17:18:35 v71e7 kernel: INFO  Task in /lsf/venus/job.81808185.6967.1646067774 killed as a result of limit of /lsf/venus/job.81808185.6967.1646067774
Feb 28 17:18:35 v71e7 kernel: INFO  memory: usage 8388608kB, limit 8388608kB, failcnt 11324
Feb 28 17:18:35 v71e7 kernel: INFO  memory+swap: usage 8388608kB, limit 8388608kB, failcnt 0
Feb 28 17:18:35 v71e7 kernel: INFO  kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
Feb 28 17:18:35 v71e7 kernel: INFO  Memory cgroup stats for /lsf/venus/job.81808185.6967.1646067774: cache:0KB rss:8388608KB rss_huge:63488KB mapped_file:0KB swap:0KB inactive_anon:233692KB active_anon:8134948KB inactive_file:0KB active_file:0KB unevictable:0KB
Feb 28 17:18:35 v71e7 kernel: INFO  [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
Feb 28 17:18:35 v71e7 kernel: INFO  [ 6967] 20941  6967     9889     2332      23        0             0 res
Feb 28 17:18:35 v71e7 kernel: INFO  [ 6968] 20941  6968     2430      358      10        0             0 1646067772.8180
Feb 28 17:18:35 v71e7 kernel: INFO  [ 6972] 20941  6972     2433      382      10        0             0 1646067772.8180
Feb 28 17:18:35 v71e7 kernel: INFO  [ 6975] 20941  6975     2548      530      10        0             0 compile.sh
Feb 28 17:18:35 v71e7 kernel: INFO  [ 7172] 20941  7172     2440      417      11        0             0 bash
Feb 28 17:18:35 v71e7 kernel: INFO  [ 8798] 20941  8798     1210      270       8        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [ 8801] 20941  8801     1276      345       7        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [ 9024] 20941  9024     1703      809       9        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [21865] 20941 21865     2438      355      11        0             0 sh
Feb 28 17:18:35 v71e7 kernel: INFO  [21866] 20941 21866     2470      431      10        0             0 mpiifort
Feb 28 17:18:35 v71e7 kernel: INFO  [21872] 20941 21872    52386      795      23        0             0 ifort
Feb 28 17:18:35 v71e7 kernel: INFO  [21884] 20941 21884  1682984  1670503    3283        0             0 fortcom
Feb 28 17:18:35 v71e7 kernel: INFO  [23172] 20941 23172     1210      261       8        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [23174] 20941 23174     2471      354      10        0             0 sh
Feb 28 17:18:35 v71e7 kernel: INFO  [23175] 20941 23175     2471      434      10        0             0 mpiifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23181] 20941 23181    36003      791      22        0             0 ifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23194] 20941 23194   145217   132546     283        0             0 fortcom
Feb 28 17:18:35 v71e7 kernel: INFO  [23267] 20941 23267     1210      260       7        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [23274] 20941 23274     2471      355      10        0             0 sh
Feb 28 17:18:35 v71e7 kernel: INFO  [23276] 20941 23276     2471      435      10        0             0 mpiifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23287] 20941 23287    36002     1302      23        0             0 ifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23310] 20941 23310   105309    92452     205        0             0 fortcom
Feb 28 17:18:35 v71e7 kernel: INFO  [23331] 20941 23331     1210      260       8        0             0 make
Feb 28 17:18:35 v71e7 kernel: INFO  [23333] 20941 23333     2471      355      11        0             0 sh
Feb 28 17:18:35 v71e7 kernel: INFO  [23334] 20941 23334     2471      434      11        0             0 mpiifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23340] 20941 23340    36003      792      24        0             0 ifort
Feb 28 17:18:35 v71e7 kernel: INFO  [23353] 20941 23353   226882   212725     441        0             0 fortcom
Feb 28 17:18:35 v71e7 kernel: ERR  Memory cgroup out of memory: Kill process 21884 (fortcom) score 798 or sacrifice child
Feb 28 17:18:35 v71e7 kernel: ERR  Killed process 21884 (fortcom), UID 20941, total-vm:6731936kB, anon-rss:6662272kB, file-rss:19740kB, shmem-rss:0kB
@kgerheiser kgerheiser added the bug Something isn't working label Mar 1, 2022
@kgerheiser
Copy link
Contributor Author

The fix in #1032 has resolved this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant