A CI Test vector #265

jjhursey · 2019-12-18T01:39:25Z

No description provided.

jjhursey · 2019-12-18T01:43:54Z

bot:retest

rhc54 · 2019-12-18T13:37:25Z

bot:retest

jjhursey · 2019-12-18T15:01:07Z

bot:retest

jjhursey · 2019-12-18T16:34:31Z

bot:retest

jjhursey · 2019-12-18T16:42:35Z

bot:retest

jjhursey · 2019-12-18T16:46:54Z

bot:retest

jjhursey · 2019-12-18T17:01:19Z

bot:retest

jjhursey · 2019-12-18T21:33:53Z

bot:retest

jjhursey · 2019-12-18T21:50:59Z

bot:retest

jjhursey · 2019-12-19T21:25:52Z

bot:retest

jjhursey · 2019-12-19T22:00:34Z

bot:retest

jjhursey · 2019-12-19T22:13:33Z

bot:retest

jjhursey · 2019-12-19T22:34:22Z

bot:retest

jjhursey · 2019-12-20T02:05:59Z

bot:retest

jjhursey · 2019-12-20T02:56:38Z

bot:retest

jjhursey · 2019-12-20T04:07:26Z

bot:retest

jjhursey · 2019-12-20T18:04:45Z

I've hit a few snags with the CI - some of it infrastructure others bugs in PRRTE. I made a note in the link below of the issues:

Infrastructure for PRRTE CI Tests pmix-tests#49 (comment)

I'll return to this after break.

jjhursey · 2019-12-20T18:05:43Z

bot:retest

jjhursey · 2020-01-06T20:38:21Z

bot:ibm:retest

jjhursey · 2020-01-06T21:54:59Z

Notes on PRRTE issues encountered so far:

Infrastructure for PRRTE CI Tests pmix-tests#49 (comment)
Using prte --hostfile (tested with a hostfile containing 3 nodes with 20 cores each)
- The launch node is not being used as part of the allocation (behaving as if --nolocal is default)
- launch node has 1 detected slot where as the compute nodes have 20 (hostfile did not specify slots, just a list of hosts)
- if the hostfile does not exist then prte will hang trying to abort itself after printing "FORCE TERMINATE ORDERED AT base/ras_base_allocate.c:354 - error Unknown error(1)"
- If there are not enough slots (prun -np 50 hostname with only 41 slots available) then it prints the message with There are not enough slots available but hangs.
Using a hostfile with 10 nodes (20 cores each)
- -npernode 2 will hang the prun, and subsequent prun's will not launch
- Launching across 5 or more nodes hangs.

I'm taking a look into these, but currently, this is what is blocking CI testing. The build looks ok.

@rhc54 FYI these are some of the PRRTE issues I've encountered so far. They should be reproducible in the virtual cluster environment if you wanted to try to reproduce.

jjhursey · 2020-01-08T17:42:04Z

bot:ibm:retest

jjhursey · 2020-01-08T18:30:42Z

bot:ibm:retest

jjhursey · 2020-01-27T20:50:14Z

bot:ibm:retest

jjhursey · 2020-01-27T21:09:24Z

Humm there seems to still be a race in the hello world test program. If I run it across 7 nodes about most of the time it will pass, but sometimes it will hang after displaying the output.
I'm digging into it to see if I can pin-point what's going on.

Last output to the console from the prted on the launch node:

[c712f6n01:155903] [[63621,0],0] state:base:cleanup_node on proc [[63621,2],34]
[c712f6n01:155903] [[63621,0],0] ACTIVATE JOB [63621,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:771
[c712f6n01:155903] [[63621,0],0] ACTIVATING JOB [63621,2] STATE NORMALLY TERMINATED PRI 4
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm:check_job_complete on job [63621,2]
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n02
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n03
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n04
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n05
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n06
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n07
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm:check_job_completed state is terminated - activating notify
[c712f6n01:155903] [[63621,0],0] ACTIVATE JOB [63621,2] STATE NOTIFY COMPLETED AT state_dvm.c:598
c712f6n02
[c712f6n01:155903] [[63621,0],0] ACTIVATING JOB [63621,2] STATE NOTIFY COMPLETED PRI 4

So it looks like the prted did the right thing, but maybe prun is missing the signal? (I'm still looking)

jjhursey · 2020-01-28T03:18:40Z

It took some digging but I think I figured out the problem.

Symptom:

Fast running jobs will sometimes hang in prun because it never receives the termination event (PMIX_ERR_JOB_TERMINATED)
prun will PMIx_Spawn then immediately PMIx_Register_event_handler for the PMIX_ERR_JOB_TERMINATED event on the job it just started. Any delay between these two activities.

Reproducer:

Add a sleep(1) between the PMIx_Spawn and PMIx_Register_event_handler calls in prun. It will hang 100% of the time.

Analysis:

The prte daemon on the head node will cache the PMIX_ERR_JOB_TERMINATED and try to deliver it to all of the listeners when the job completes. At this point prun has not registered for the event (it's sleeping). So the event is cached in a call to notify_event_cache which puts it in the pmix_globals.notifications hotel.
When an event is added to the hotel it also has a timeout event associated with it. When the timer goes off then it is evicted from the hotel. Thus keeping events for a few minutes then cleaning them out.
- pmix_globals.event_eviction_time here default to 120 seconds
Watching the daemon process under gdb you can see it inject the event into the cache then less than a second later call local_eviction_callback (eviction callback on the event) to kick it out of the hotel.
When the prun process does finally register for the event it has already been evicted.
Note that longer running jobs will not show this problem because prun will have registered for the event by the time the processes terminate - thus not needing the cache.

Root cause:

The pmix_globals.event_eviction_time here default to 120 seconds
That value is passed to pmix_hotel_init here
However pmix_hotel_init interprets this as 120 usec not 120 sec here

Solutions:

Multiply the 120 by 1000000 so that the calculation will cause pmix_hotel_init does the right thing. This sets the upper bound for this value to ~4294 sec.
Alternatively, we interpret the eviction_timeout as seconds and set microseconds to 0.

My inclination is to go with (1) since that gives us a maximum of ~71 minutes that we hold on to events, which should be ok for the near term. Additionally, I could change the type from uint32_t to uint64_t which would give plenty of room. In openpmix/openpmix#1608 I applied such a fix for discussion.

jjhursey · 2020-01-28T03:22:17Z

In working on the bug commented above I ran into an odd race condition on the prun tool side. If I run prun --map-by ppr:5:node /bin/true in a loop of 400 (or more) iterations every once in a great while it will hang with the stack below. It only happened to be once at 360 of 400 iterations. After killing the prun for that iteration the rest ran fine.

Stack trace of the prun process receiving the event signal.

shell$ gstack 87590
Thread 4 (Thread 0x3fffb14cf190 (LWP 87591)):
#0  0x00003fffb18e9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fffb213b18c in epoll_dispatch (base=0x10023afaad0, tv=<optimized out>) at epoll.c:462
#2  0x00003fffb212c180 in event_base_loop (base=0x10023afaad0, flags=<optimized out>) at event.c:1947
#3  0x00003fffb244a0f0 in progress_engine (obj=0x10023af9830) at runtime/prrte_progress_threads.c:106
#4  0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fffb06bf190 (LWP 87592)):
#0  0x00003fffb19c2da8 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00003fffb19bf310 in pthread_cond_broadcast@@GLIBC_2.17 () from /lib64/libpthread.so.0
#2  0x00000000100055a8 in evhandler (evhdlr_registration_id=1, status=-145, source=0x3fffa802da04, info=0x3fffa8000fe0, ninfo=6, results=0x0, nresults=0, cbfunc=0x3fffb21de830 <progress_local_event_hdlr>, cbdata=0x3fffa802d930) at prun.c:534
#3  0x00003fffb21e0c24 in pmix_invoke_local_event_hdlr (chain=0x3fffa802d930) at event/pmix_event_notification.c:790
#4  0x00003fffb229e14c in pmix_tool_notify_recv (peer=0x10023b57530, hdr=0x3fffa802d8f4, buf=0x3fffb06be3f0, cbdata=0x0) at tool/pmix_tool.c:173
#5  0x00003fffb2338354 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x3fffa802d820) at base/ptl_base_sendrecv.c:784
#6  0x00003fffb212b918 in event_process_active_single_queue (base=0x10023b24070, activeq=0x10023b24640, max_to_process=2147483647, endtime=0x0) at event.c:1646
#7  0x00003fffb212c404 in event_process_active (base=0x10023b24070) at event.c:1738
#8  event_base_loop (base=0x10023b24070, flags=<optimized out>) at event.c:1961
#9  0x00003fffb229a5fc in progress_engine (obj=0x10023b23f40) at runtime/pmix_progress_threads.c:232
#10 0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#11 0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fffa7c4f190 (LWP 87593)):
#0  0x00003fffb18da988 in select () from /lib64/libc.so.6
#1  0x00003fffb2339c44 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fffb263bec0 (LWP 87590)):
#0  0x00003fffb19be7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x000000001000bb18 in prun (argc=4, argv=0x3fffc9d63558) at prun.c:1400
#2  0x0000000010002ae0 in main (argc=4, argv=0x3fffc9d63558) at main.c:13

It is difficult to reproduce, but I might look into it a bit more this week.

rhc54 · 2020-01-28T04:31:03Z

Impressive work chasing this down! I'm actually seeing a problem that is likely related to the eviction time issue, so this will help a lot. See my comment on your proposed fix.

jjhursey · 2020-01-28T16:38:04Z

I tracked down the tool hang noted above. The fix is in #319

jjhursey · 2020-01-28T18:52:36Z

bot:ibm:retest

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey · 2020-01-30T15:58:03Z

I'm working on the IOF issue. A few things I've learned/fixed so far

The prun process is sending a 0 size buffer via the stdin channel to the local prte. Not always, but often. With a bit of printf debugging we can see it in the trace below:

09:09:46 --------------------- Execution (hostname): 2
09:09:46 3298cfee28d3
09:09:46 [3298cfee28d3:00129] [[43677,0],0] iof:hnp pushing stdin for process [[43677,4],0]: size 0
09:09:46 [3298cfee28d3:00129] [[43677,0],0] write:output setting up to write 0 bytes to stdin for [[43677,4],0] on fd 34
09:09:46 [3298cfee28d3:00129] [[43677,0],0] write:output adding write event (fd 34)

Most of the time prte just processes it as normal, but every now and again it'll get in just the right state to segv. I think that I have tracked the segv to this line of code. I'm still investigating.
On the prun side it will emit an error like:

[warn] Epoll MOD(1) on fd 15 failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor

if it is sending stdin up to the prte while in PMIx_tool_finalize. As would be the case if we are streaming data (stdbuf -i0 prun -np 1 sleep 10 < /dev/random). At the point of finalize in the tool we should detach the file descriptor connecting the tool and RM so a poll on it will not fail when it is disconnected. I'll look into that as well.

In summary - I'm still digging, but have some footprints to follow :)

jjhursey · 2020-01-30T16:46:38Z

I figured out the prted segv issue and it is a race condition.

prte can get caught in a race between delivering stdin and the process terminating which can cause a segv (that we see in CI). Here is a trace:

10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0 - debug (valid) (valid)
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (hnp_close) closing connection to process [[41634,18],0]
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0 - Local send
10:31:49 [75fd37ef5719:00131] *** Process received signal ***

You can see that thread A is processing the push_stdin while thread B is processing the PRRTE_PROC_STATE_IOF_COMPLETE state and calls prrte_iof.close(proc, PRRTE_IOF_STDALL) (pointing to hnp_close). Thread B will set proct->stdinev to NULL without checking for ongoing activity on that file descriptor. Thread A could go through all of the sanity checks to protect itself inside push_stdin, but still have proct->stdinev set to NULL by Thread B at just the wrong time to cause it to segv when trying to access a member of that structure - namely proct->stdinev->wev.

I'm tinkering with some solution options at the moment.

jjhursey · 2020-02-03T22:47:41Z

bot:ibm:retest

jjhursey · 2020-02-03T22:54:28Z

bot:ibm:retest

jjhursey · 2020-02-04T02:01:50Z

bot:ibm:retest

jjhursey · 2020-02-04T02:45:27Z

bot:ibm:scale:retest
bot:ibm:nodes:3:test
bot:ibm:scale:17:test

jjhursey · 2020-02-04T16:36:05Z

bot:ibm:scale:retest
bot:ibm:nodes:3:test
bot:ibm:scale:17:test

jjhursey · 2020-02-04T18:04:42Z

bot:ibm:retest

jjhursey · 2020-02-04T18:08:30Z

bot:ibm:pgi:retest

jjhursey · 2020-02-04T19:01:00Z

bot:ibm:scale:retest
bot:ibm:nodes:4:test
bot:ibm:scale:17:test

jjhursey · 2020-02-04T19:05:11Z

bot:ibm:pgi:retest
bot:ibm:xl:retest

jjhursey · 2020-02-04T19:24:24Z

PGI failure is known. I'm going to disable that build for now. Otherwise, we should be good here.

  CCLD     libpmi2.la
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmi.la] Error 1
make[2]: *** Waiting for unfinished jobs....
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmix.la] Error 1
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmi2.la] Error 1
make[2]: Leaving directory `/workspace/openpmix-src/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/workspace/openpmix-src/src'
make: *** [all-recursive] Error 1

rhc54 · 2020-02-04T19:27:57Z

Feel free to turn it "on" whenever you judge it ready - much thanks!

jjhursey · 2020-02-04T19:29:47Z

IBM CI is now globally active on PRRTE (GNU and XL compilers). No special label required.
The normal re-test commands will work. There are some additional scale related parameters that I'll document in the wiki (you can see some of them in this PR).

Closing this PR since we don't need it anymore.

jjhursey added ⚠️ WIP-DNM! CITEST A CI Test PR labels Dec 18, 2019

jjhursey force-pushed the ci-test branch from 9f06852 to 993ca29 Compare December 18, 2019 01:39

openpmix deleted a comment from ibm-ompi Dec 18, 2019

jjhursey force-pushed the ci-test branch 2 times, most recently from ec5e5ec to 0375228 Compare January 6, 2020 20:20

jjhursey force-pushed the ci-test branch from 0375228 to 1637c66 Compare January 8, 2020 17:28

jjhursey force-pushed the ci-test branch from 1637c66 to e2844a7 Compare January 8, 2020 21:15

jjhursey mentioned this pull request Jan 28, 2020

Fix eviction time for cached events openpmix/openpmix#1608

Closed

jjhursey mentioned this pull request Jan 28, 2020

Adjust hotel timeout to be in whole seconds openpmix/openpmix#1609

Merged

A test

624f41c

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>

jjhursey force-pushed the ci-test branch 2 times, most recently from 34c0cf1 to 8f6fb99 Compare January 30, 2020 15:38

jjhursey force-pushed the ci-test branch from ed37976 to 624f41c Compare January 30, 2020 18:50

jjhursey mentioned this pull request Jan 30, 2020

Fix IOF race condition with stdin #320

Merged

jjhursey closed this Feb 4, 2020

jjhursey deleted the ci-test branch February 4, 2020 19:29

A CI Test vector #265

A CI Test vector #265

Conversation

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

rhc54 commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 18, 2019

jjhursey commented Dec 19, 2019

jjhursey commented Dec 19, 2019

jjhursey commented Dec 19, 2019

jjhursey commented Dec 19, 2019

jjhursey commented Dec 20, 2019

jjhursey commented Dec 20, 2019

jjhursey commented Dec 20, 2019

jjhursey commented Dec 20, 2019

jjhursey commented Dec 20, 2019

jjhursey commented Jan 6, 2020

jjhursey commented Jan 6, 2020

jjhursey commented Jan 8, 2020

jjhursey commented Jan 8, 2020

jjhursey commented Jan 27, 2020

jjhursey commented Jan 27, 2020 • edited Loading

jjhursey commented Jan 28, 2020

jjhursey commented Jan 28, 2020

rhc54 commented Jan 28, 2020

jjhursey commented Jan 28, 2020

jjhursey commented Jan 28, 2020

jjhursey commented Jan 30, 2020

jjhursey commented Jan 30, 2020

jjhursey commented Feb 3, 2020

jjhursey commented Feb 3, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Feb 4, 2020

rhc54 commented Feb 4, 2020

jjhursey commented Feb 4, 2020

jjhursey commented Jan 27, 2020 •

edited

Loading