Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A CI Test vector #265

Closed
wants to merge 1 commit into from
Closed

A CI Test vector #265

wants to merge 1 commit into from

Conversation

jjhursey
Copy link
Member

No description provided.

@jjhursey
Copy link
Member Author

bot:retest

15 similar comments
@rhc54
Copy link
Contributor

rhc54 commented Dec 18, 2019

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey
Copy link
Member Author

I've hit a few snags with the CI - some of it infrastructure others bugs in PRRTE. I made a note in the link below of the issues:

I'll return to this after break.

@jjhursey
Copy link
Member Author

bot:retest

@jjhursey jjhursey force-pushed the ci-test branch 2 times, most recently from ec5e5ec to 0375228 Compare January 6, 2020 20:20
@jjhursey
Copy link
Member Author

jjhursey commented Jan 6, 2020

bot:ibm:retest

@jjhursey
Copy link
Member Author

jjhursey commented Jan 6, 2020

Notes on PRRTE issues encountered so far:

  • Infrastructure for PRRTE CI Tests pmix-tests#49 (comment)
  • Using prte --hostfile (tested with a hostfile containing 3 nodes with 20 cores each)
    • The launch node is not being used as part of the allocation (behaving as if --nolocal is default)
    • launch node has 1 detected slot where as the compute nodes have 20 (hostfile did not specify slots, just a list of hosts)
    • if the hostfile does not exist then prte will hang trying to abort itself after printing "FORCE TERMINATE ORDERED AT base/ras_base_allocate.c:354 - error Unknown error(1)"
    • If there are not enough slots (prun -np 50 hostname with only 41 slots available) then it prints the message with There are not enough slots available but hangs.
  • Using a hostfile with 10 nodes (20 cores each)
    • -npernode 2 will hang the prun, and subsequent prun's will not launch
    • Launching across 5 or more nodes hangs.

I'm taking a look into these, but currently, this is what is blocking CI testing. The build looks ok.

@rhc54 FYI these are some of the PRRTE issues I've encountered so far. They should be reproducible in the virtual cluster environment if you wanted to try to reproduce.

@jjhursey
Copy link
Member Author

jjhursey commented Jan 8, 2020

bot:ibm:retest

1 similar comment
@jjhursey
Copy link
Member Author

jjhursey commented Jan 8, 2020

bot:ibm:retest

@jjhursey
Copy link
Member Author

bot:ibm:retest

@jjhursey
Copy link
Member Author

jjhursey commented Jan 27, 2020

Humm there seems to still be a race in the hello world test program. If I run it across 7 nodes about most of the time it will pass, but sometimes it will hang after displaying the output.
I'm digging into it to see if I can pin-point what's going on.

Last output to the console from the prted on the launch node:

[c712f6n01:155903] [[63621,0],0] state:base:cleanup_node on proc [[63621,2],34]
[c712f6n01:155903] [[63621,0],0] ACTIVATE JOB [63621,2] STATE NORMALLY TERMINATED AT base/state_base_fns.c:771
[c712f6n01:155903] [[63621,0],0] ACTIVATING JOB [63621,2] STATE NORMALLY TERMINATED PRI 4
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm:check_job_complete on job [63621,2]
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n02
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n03
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n04
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n05
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n06
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm releasing procs from node c712f6n07
c712f6n01
[c712f6n01:155903] [[63621,0],0] state:dvm:check_job_completed state is terminated - activating notify
[c712f6n01:155903] [[63621,0],0] ACTIVATE JOB [63621,2] STATE NOTIFY COMPLETED AT state_dvm.c:598
c712f6n02
[c712f6n01:155903] [[63621,0],0] ACTIVATING JOB [63621,2] STATE NOTIFY COMPLETED PRI 4

So it looks like the prted did the right thing, but maybe prun is missing the signal? (I'm still looking)

@jjhursey
Copy link
Member Author

It took some digging but I think I figured out the problem.

Symptom:

  • Fast running jobs will sometimes hang in prun because it never receives the termination event (PMIX_ERR_JOB_TERMINATED)
  • prun will PMIx_Spawn then immediately PMIx_Register_event_handler for the PMIX_ERR_JOB_TERMINATED event on the job it just started. Any delay between these two activities.

Reproducer:

  • Add a sleep(1) between the PMIx_Spawn and PMIx_Register_event_handler calls in prun. It will hang 100% of the time.

Analysis:

  • The prte daemon on the head node will cache the PMIX_ERR_JOB_TERMINATED and try to deliver it to all of the listeners when the job completes. At this point prun has not registered for the event (it's sleeping). So the event is cached in a call to notify_event_cache which puts it in the pmix_globals.notifications hotel.
  • When an event is added to the hotel it also has a timeout event associated with it. When the timer goes off then it is evicted from the hotel. Thus keeping events for a few minutes then cleaning them out.
    • pmix_globals.event_eviction_time here default to 120 seconds
  • Watching the daemon process under gdb you can see it inject the event into the cache then less than a second later call local_eviction_callback (eviction callback on the event) to kick it out of the hotel.
  • When the prun process does finally register for the event it has already been evicted.
  • Note that longer running jobs will not show this problem because prun will have registered for the event by the time the processes terminate - thus not needing the cache.

Root cause:

  • The pmix_globals.event_eviction_time here default to 120 seconds
  • That value is passed to pmix_hotel_init here
  • However pmix_hotel_init interprets this as 120 usec not 120 sec here

Solutions:

  1. Multiply the 120 by 1000000 so that the calculation will cause pmix_hotel_init does the right thing. This sets the upper bound for this value to ~4294 sec.
  2. Alternatively, we interpret the eviction_timeout as seconds and set microseconds to 0.

My inclination is to go with (1) since that gives us a maximum of ~71 minutes that we hold on to events, which should be ok for the near term. Additionally, I could change the type from uint32_t to uint64_t which would give plenty of room. In openpmix/openpmix#1608 I applied such a fix for discussion.

@jjhursey
Copy link
Member Author

In working on the bug commented above I ran into an odd race condition on the prun tool side. If I run prun --map-by ppr:5:node /bin/true in a loop of 400 (or more) iterations every once in a great while it will hang with the stack below. It only happened to be once at 360 of 400 iterations. After killing the prun for that iteration the rest ran fine.

Stack trace of the prun process receiving the event signal.

shell$ gstack 87590
Thread 4 (Thread 0x3fffb14cf190 (LWP 87591)):
#0  0x00003fffb18e9178 in epoll_wait () from /lib64/libc.so.6
#1  0x00003fffb213b18c in epoll_dispatch (base=0x10023afaad0, tv=<optimized out>) at epoll.c:462
#2  0x00003fffb212c180 in event_base_loop (base=0x10023afaad0, flags=<optimized out>) at event.c:1947
#3  0x00003fffb244a0f0 in progress_engine (obj=0x10023af9830) at runtime/prrte_progress_threads.c:106
#4  0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#5  0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 3 (Thread 0x3fffb06bf190 (LWP 87592)):
#0  0x00003fffb19c2da8 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00003fffb19bf310 in pthread_cond_broadcast@@GLIBC_2.17 () from /lib64/libpthread.so.0
#2  0x00000000100055a8 in evhandler (evhdlr_registration_id=1, status=-145, source=0x3fffa802da04, info=0x3fffa8000fe0, ninfo=6, results=0x0, nresults=0, cbfunc=0x3fffb21de830 <progress_local_event_hdlr>, cbdata=0x3fffa802d930) at prun.c:534
#3  0x00003fffb21e0c24 in pmix_invoke_local_event_hdlr (chain=0x3fffa802d930) at event/pmix_event_notification.c:790
#4  0x00003fffb229e14c in pmix_tool_notify_recv (peer=0x10023b57530, hdr=0x3fffa802d8f4, buf=0x3fffb06be3f0, cbdata=0x0) at tool/pmix_tool.c:173
#5  0x00003fffb2338354 in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x3fffa802d820) at base/ptl_base_sendrecv.c:784
#6  0x00003fffb212b918 in event_process_active_single_queue (base=0x10023b24070, activeq=0x10023b24640, max_to_process=2147483647, endtime=0x0) at event.c:1646
#7  0x00003fffb212c404 in event_process_active (base=0x10023b24070) at event.c:1738
#8  event_base_loop (base=0x10023b24070, flags=<optimized out>) at event.c:1961
#9  0x00003fffb229a5fc in progress_engine (obj=0x10023b23f40) at runtime/pmix_progress_threads.c:232
#10 0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#11 0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x3fffa7c4f190 (LWP 87593)):
#0  0x00003fffb18da988 in select () from /lib64/libc.so.6
#1  0x00003fffb2339c44 in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00003fffb19b8b94 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fffb18e85f4 in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x3fffb263bec0 (LWP 87590)):
#0  0x00003fffb19be7fc in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x000000001000bb18 in prun (argc=4, argv=0x3fffc9d63558) at prun.c:1400
#2  0x0000000010002ae0 in main (argc=4, argv=0x3fffc9d63558) at main.c:13

It is difficult to reproduce, but I might look into it a bit more this week.

@rhc54
Copy link
Contributor

rhc54 commented Jan 28, 2020

Impressive work chasing this down! I'm actually seeing a problem that is likely related to the eviction time issue, so this will help a lot. See my comment on your proposed fix.

@jjhursey
Copy link
Member Author

I tracked down the tool hang noted above. The fix is in #319

@jjhursey
Copy link
Member Author

bot:ibm:retest

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey jjhursey force-pushed the ci-test branch 2 times, most recently from 34c0cf1 to 8f6fb99 Compare January 30, 2020 15:38
@jjhursey
Copy link
Member Author

I'm working on the IOF issue. A few things I've learned/fixed so far

  • The prun process is sending a 0 size buffer via the stdin channel to the local prte. Not always, but often. With a bit of printf debugging we can see it in the trace below:
09:09:46 --------------------- Execution (hostname): 2
09:09:46 3298cfee28d3
09:09:46 [3298cfee28d3:00129] [[43677,0],0] iof:hnp pushing stdin for process [[43677,4],0]: size 0
09:09:46 [3298cfee28d3:00129] [[43677,0],0] write:output setting up to write 0 bytes to stdin for [[43677,4],0] on fd 34
09:09:46 [3298cfee28d3:00129] [[43677,0],0] write:output adding write event (fd 34)
  • Most of the time prte just processes it as normal, but every now and again it'll get in just the right state to segv. I think that I have tracked the segv to this line of code. I'm still investigating.
  • On the prun side it will emit an error like:
[warn] Epoll MOD(1) on fd 15 failed. Old events were 6; read change was 0 (none); write change was 2 (del); close change was 0 (none): Bad file descriptor

if it is sending stdin up to the prte while in PMIx_tool_finalize. As would be the case if we are streaming data (stdbuf -i0 prun -np 1 sleep 10 < /dev/random). At the point of finalize in the tool we should detach the file descriptor connecting the tool and RM so a poll on it will not fail when it is disconnected. I'll look into that as well.

In summary - I'm still digging, but have some footprints to follow :)

@jjhursey
Copy link
Member Author

I figured out the prted segv issue and it is a race condition.

prte can get caught in a race between delivering stdin and the process terminating which can cause a segv (that we see in CI). Here is a trace:

10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0 - debug (valid) (valid)
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (hnp_close) closing connection to process [[41634,18],0]
10:31:49 [75fd37ef5719:00131] [[41634,0],0] iof:hnp (push_stdin) pushing stdin for process [[41634,18],0]: size 0 - Local send
10:31:49 [75fd37ef5719:00131] *** Process received signal ***

You can see that thread A is processing the push_stdin while thread B is processing the PRRTE_PROC_STATE_IOF_COMPLETE state and calls prrte_iof.close(proc, PRRTE_IOF_STDALL) (pointing to hnp_close). Thread B will set proct->stdinev to NULL without checking for ongoing activity on that file descriptor. Thread A could go through all of the sanity checks to protect itself inside push_stdin, but still have proct->stdinev set to NULL by Thread B at just the wrong time to cause it to segv when trying to access a member of that structure - namely proct->stdinev->wev.

I'm tinkering with some solution options at the moment.

@jjhursey
Copy link
Member Author

jjhursey commented Feb 3, 2020

bot:ibm:retest

2 similar comments
@jjhursey
Copy link
Member Author

jjhursey commented Feb 3, 2020

bot:ibm:retest

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:retest

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:scale:retest
bot:ibm:nodes:3:test
bot:ibm:scale:17:test

1 similar comment
@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:scale:retest
bot:ibm:nodes:3:test
bot:ibm:scale:17:test

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:retest

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:pgi:retest

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:scale:retest
bot:ibm:nodes:4:test
bot:ibm:scale:17:test

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:pgi:retest
bot:ibm:xl:retest

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

PGI failure is known. I'm going to disable that build for now. Otherwise, we should be good here.

  CCLD     libpmi2.la
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmi.la] Error 1
make[2]: *** Waiting for unfinished jobs....
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmix.la] Error 1
pgcc-Error-Unknown switch: -pthread
make[2]: *** [libpmi2.la] Error 1
make[2]: Leaving directory `/workspace/openpmix-src/src'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/workspace/openpmix-src/src'
make: *** [all-recursive] Error 1

@rhc54
Copy link
Contributor

rhc54 commented Feb 4, 2020

Feel free to turn it "on" whenever you judge it ready - much thanks!

@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

IBM CI is now globally active on PRRTE (GNU and XL compilers). No special label required.
The normal re-test commands will work. There are some additional scale related parameters that I'll document in the wiki (you can see some of them in this PR).

Closing this PR since we don't need it anymore.

@jjhursey jjhursey closed this Feb 4, 2020
@jjhursey jjhursey deleted the ci-test branch February 4, 2020 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants