Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Infrastructure for PRRTE CI Tests #49

Merged
merged 1 commit into from
Feb 4, 2020
Merged

Conversation

jjhursey
Copy link
Member

  • Define a few CI tests for PRRTE
  • See README.md for how to drop in a new test.

@jjhursey jjhursey added the WIP Work In Progress (Do not merge) label Dec 20, 2019
prrte/cycle/run.sh Outdated Show resolved Hide resolved
@jjhursey
Copy link
Member Author

@rhc These are the tests I've started with. I have a bit more to do in this PR before it is ready to merge. I'll return to it once I'm back from break.

I did notice a couple of failures with the current PRRTE master that need some looking into that can be recreated with these test cases.

  1. in the hello_world example I try to access PMIX_LOCAL_RANK from a process on a non-launcher node and PMIx is returning not found. I commented it out in the current state of the test, but I think we should add it back in once it is fixed in PRRTE.
  2. The hello_world example will often hang in the CI environment. I'm not sure why at the moment. It's not 100% of the time though. Because of this I am not going to enable it for PRRTE yet - instead I'll leave it as is until I can investigate when I get back.
  3. The cycle example usually fails after about 10 iterations with the prted crashing. I'm not sure what's happening there.

I'll pick up on this when I get back from break.

@jjhursey
Copy link
Member Author

jjhursey commented Jan 8, 2020

PR openpmix/prrte#296 fixes the PMIX_LOCAL_RANK. I'll re-push the hello world with this value in it.

@jjhursey
Copy link
Member Author

jjhursey commented Jan 9, 2020

Ci is passing now for both of these tests. I'll work on finishing up the test case infrastructure and add a few more tests to the mix. Then I think this is ready to go.

@jjhursey jjhursey added the PRRTE_CI Trigger PRRTE CI on this PR label Jan 22, 2020
@jjhursey
Copy link
Member Author

bot:ibm:retest

2 similar comments
@jjhursey
Copy link
Member Author

bot:ibm:retest

@jjhursey
Copy link
Member Author

bot:ibm:retest

@jjhursey
Copy link
Member Author

bot:ibm:retest

@jjhursey jjhursey removed the WIP Work In Progress (Do not merge) label Jan 28, 2020
@jjhursey
Copy link
Member Author

bot:ibm:retest

In the prior CI test the prte daemon crashed after emitting these warnings a few times:

--------------------- Execution (hostname): 16
[warn] Epoll ADD(4) on fd 34 failed. Old events were 0; read change was 0 (none); write change was 1 (add); close change was 0 (none): Bad file descriptor
[331e97292fa3:00130] PRRTE ERROR: Bad parameter in file base/iof_base_output.c at line 267

Then this stack trace:

--------------------- Execution (hostname): 84
[331e97292fa3:00130] *** Process received signal ***
[331e97292fa3:00130] Signal: Segmentation fault (11)
[331e97292fa3:00130] Signal code: Address not mapped (1)
[331e97292fa3:00130] Failing at address: 0x30
[331e97292fa3:00130] [ 0] [0x3fffb4890478]
[331e97292fa3:00130] [ 1] /workspace/exports/pmix/lib/pmix/mca_bfrops_v4.so(mca_bfrops_v4_component+0x108)[0x3fffb3d601c0]
[331e97292fa3:00130] [ 2] /workspace/exports/prrte/lib/libprrte.so.2(pmix_server_stdin_fn+0x70)[0x3fffb47930f0]
[331e97292fa3:00130] [ 3] /workspace/exports/pmix/lib/libpmix.so.0(+0xd11d4)[0x3fffb46011d4]
[331e97292fa3:00130] [ 4] 331e97292fa3
/workspace/exports/pmix/lib/libpmix.so.0(+0xb3244)[0x3fffb45e3244]
[331e97292fa3:00130] [ 5] /workspace/exports/pmix/lib/libpmix.so.0(+0xb3b60)[0x3fffb45e3b60]
[331e97292fa3:00130] [ 6] /workspace/exports/pmix/lib/libpmix.so.0(pmix_ptl_base_process_msg+0x3a4)[0x3fffb46b3a7c]
[331e97292fa3:00130] [ 7] /opt/hpc/local/libevent/lib/libevent-2.1.so.6(+0x2b918)[0x3fffb449b918]
[331e97292fa3:00130] [ 8] /opt/hpc/local/libevent/lib/libevent-2.1.so.6(event_base_loop+0x504)[0x3fffb449c404]
[331e97292fa3:00130] [ 9] /workspace/exports/pmix/lib/libpmix.so.0(+0xe5d50)[0x3fffb4615d50]
[331e97292fa3:00130] [10] /lib64/libpthread.so.0(+0x8cd4)[0x3fffb4278cd4]
[331e97292fa3:00130] [11] /lib64/libc.so.6(clone+0xe4)[0x3fffb41a7e94]
[331e97292fa3:00130] *** End of error message ***
[331e97292fa3:00645] PMIX ERROR: UNPACK-PAST-END in file common/pmix_iof.c at line 1116
./run.sh: line 43:   130 Segmentation fault      (core dumped) prte --hostfile $CI_HOSTFILE

@jjhursey
Copy link
Member Author

Same failure in CI. I'll try to investigate why prte is falling over.

@jjhursey
Copy link
Member Author

I got a core from the failure - backtrace below. CI used an optimized build of PRRTE, so the core is more challenging to use. I am not able to reproduce in a debug build. I suspect that the bad FD error that proceeds it is then causing the stdio write to access bad memory.

(gdb) thread apply all bt

Thread 4 (Thread 0x3fffadaaf1b0 (LWP 135)):
#0  0x00003fffaf3a9f28 in select () from /lib64/libc.so.6
#1  0x00003fffadab9a34 in listen_thread () from /workspace/exports/prrte/lib/prrte/mca_oob_tcp.so
#2  0x00003fffaf488cd4 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fffaf3b7e94 in clone () from /lib64/libc.so.6

Thread 3 (Thread 0x3fffae32f1b0 (LWP 134)):
#0  0x00003fffaf3a9f28 in select () from /lib64/libc.so.6
#1  0x00003fffaf8c536c in listen_thread (obj=0x0) at base/ptl_base_listener.c:214
#2  0x00003fffaf488cd4 in start_thread () from /lib64/libpthread.so.0
#3  0x00003fffaf3b7e94 in clone () from /lib64/libc.so.6

Thread 2 (Thread 0x3fffafb069a0 (LWP 131)):
#0  0x00003fffaf48e92c in pthread_cond_wait@@GLIBC_2.17 () from /lib64/libpthread.so.0
#1  0x00003fffafa56f0c in prrte_state_base_track_procs () from /workspace/exports/prrte/lib/libprrte.so.2
#2  0x00003fffaf6ab918 in event_process_active_single_queue (base=0x10023d01eb0, activeq=0x10023d02580, max_to_process=2147483647, endtime=0x0)
    at event.c:1646
#3  0x00003fffaf6ac404 in event_process_active (base=0x10023d01eb0) at event.c:1738
#4  event_base_loop (base=0x10023d01eb0, flags=<optimized out>) at event.c:1961
#5  0x0000000010002728 in main ()

Thread 1 (Thread 0x3fffaebff1b0 (LWP 133)):
#0  0x00003fffad0941ac in push_stdin () from /workspace/exports/prrte/lib/prrte/mca_iof_hnp.so
#1  0x00003fffaf9a30f0 in pmix_server_stdin_fn () from /workspace/exports/prrte/lib/libprrte.so.2
#2  0x00003fffaf8111d4 in pmix_server_iofstdin (peer=0x3fffa031e4f0, buf=0x3fffaebfe410, cbfunc=0x3fffaf7dfb34 <op_cbfunc>, 
    cbdata=0x3fffa007df10) at server/pmix_server_ops.c:3853
#3  0x00003fffaf7f3244 in server_switchyard (peer=0x3fffa031e4f0, tag=104, buf=0x3fffaebfe410) at server/pmix_server.c:3704
#4  0x00003fffaf7f3b60 in pmix_server_message_handler (pr=0x3fffa031e4f0, hdr=0x3fffa031ecb4, buf=0x3fffaebfe410, cbdata=0x0)
    at server/pmix_server.c:3750
#5  0x00003fffaf8c3a7c in pmix_ptl_base_process_msg (fd=-1, flags=4, cbdata=0x3fffa031ebe0) at base/ptl_base_sendrecv.c:784
#6  0x00003fffaf6ab918 in event_process_active_single_queue (base=0x10023d8d540, activeq=0x10023d415a0, max_to_process=2147483647, endtime=0x0)
    at event.c:1646
#7  0x00003fffaf6ac404 in event_process_active (base=0x10023d8d540) at event.c:1738
#8  event_base_loop (base=0x10023d8d540, flags=<optimized out>) at event.c:1961
#9  0x00003fffaf825d50 in progress_engine (obj=0x10023d8d4f0) at runtime/pmix_progress_threads.c:232
#10 0x00003fffaf488cd4 in start_thread () from /lib64/libpthread.so.0
#11 0x00003fffaf3b7e94 in clone () from /lib64/libc.so.6

I'm still digging.

@jjhursey
Copy link
Member Author

bot:ibm:retest

1 similar comment
@jjhursey
Copy link
Member Author

bot:ibm:retest

@rhc54
Copy link
Contributor

rhc54 commented Jan 29, 2020

How are you using IOF? For some reason, it appears you have asked PMIx to forward the stdin from some process - not clear what program you are running.

@jjhursey
Copy link
Member Author

It's the cycle test, so it's just running prun -np 1 hostname over and over again. So the core is from prte forwarding stdin (which there shouldn't be any) from prun. Unless Jenkins is somehow injecting into the stream... It's strange.

I just confirmed that the --enable-debug build passes fine even under ci report here. But it seems to fail routinely in a non-debug build. However, outside of CI I've not been able to reproduce it manually. I'll keep trying to narrow it down.

bot:ibm:retest (back to a non-debg build)

@jjhursey
Copy link
Member Author

bot:ibm:retest

@jjhursey jjhursey marked this pull request as ready for review January 31, 2020 22:49
@jjhursey
Copy link
Member Author

jjhursey commented Feb 4, 2020

bot:ibm:retest

 * `hello_world` test which runs `hostname` and a Hello World PMIx client
 * `cycle` which tests running a bunch of jobs against the same
   PRRTE server set including a single node `hostname` and an
   init/finalize PMIx application across all slots.

Signed-off-by: Joshua Hursey <jhursey@us.ibm.com>
@jjhursey jjhursey merged commit 82d4f34 into openpmix:master Feb 4, 2020
@jjhursey jjhursey deleted the prrte-tests branch February 4, 2020 20:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PRRTE_CI Trigger PRRTE CI on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants