cfe/SCH deadlocks on exit on Linux #701

excaliburtb · 2020-05-13T18:10:33Z

using modules
95f34d2 cfe
c2bcebbc4d7e60a41b604e9acfc8af3c60b8536a osal
37ee8eb2d7ce006dc1570b920ae75a7ac5f89d27 psp

there seems to be a deadlock upon exit for timers being used by SCH.

See stacktrace

Thread 2 (Thread 0xef3ffb40 (LWP 19797)):
#0  0xf7766430 in __kernel_vsyscall ()
#1  0xf773e436 in __pause_nocancel () from /lib/libpthread.so.0
#2  0xf7734995 in __pthread_mutex_lock_full () from /lib/libpthread.so.0
#3  0x0807bbbe in OS_BinSemGive_Impl (sem_id=4) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-binsem.c:250
#4  0x0807558c in OS_BinSemGive (sem_id=262148) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-binsem.c:187
#5  0xf7750628 in SCH_MinorFrameCallback (TimerId=589826) at /home/tbrain/cert_testbed/apps/sch_g/fsw/src/sch_custom.c:442
#6  0x0807b3a8 in OS_Timer_NoArgCallback (objid=589826, arg=0xf77503fe <SCH_MinorFrameCallback>) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-time.c:227
#7  0x0807b072 in OS_TimeBase_CallbackThread (timebase_id=524290) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-timebase.c:526
#8  0x0807df44 in OS_TimeBasePthreadEntry (arg=0x80002) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-timebase.c:305
#9  0xf7736bbc in start_thread () from /lib/libpthread.so.0
#10 0xf76550de in clone () from /lib/libc.so.6

Thread 1 (Thread 0xf7555700 (LWP 19780)):
#0  0xf7766430 in __kernel_vsyscall ()
#1  0xf773497f in __pthread_mutex_lock_full () from /lib/libpthread.so.0
#2  0x0807dc46 in OS_TimeBaseLock_Impl (local_id=2) at /home/tbrain/cert_testbed/osal/src/os/posix/src/os-impl-timebase.c:108
#3  0x0807b63a in OS_TimerDelete (timer_id=589826) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-time.c:422
#4  0x08075ab8 in OS_CleanUpObject (object_id=589826, arg=0xffc049e8) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-common.c:263
#5  0x08078877 in OS_ForEachObject (creator_id=0, callback_ptr=0x8075a1c <OS_CleanUpObject>, callback_arg=0xffc049e8) at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-idmap.c:1015
#6  0x08075b0a in OS_DeleteAllObjects () at /home/tbrain/cert_testbed/osal/src/os/shared/src/osapi-common.c:299
#7  0x08074ebe in OS_Application_Run () at /home/tbrain/cert_testbed/psp/fsw/pc-linux/src/cfe_psp_start.c:458
#8  0x080801d1 in main (argc=1, argv=0xffc04b64) at /home/tbrain/cert_testbed/osal/src/bsp/pc-linux/src/bsp_start.c:198

The text was updated successfully, but these errors were encountered:

skliper · 2020-05-13T18:24:02Z

Is this just an order thing? Shouldn't applications get deleted before the timers?

EDIT - I see what you were saying now.. the callback needs to get unregistered

jphickey · 2020-05-13T19:41:40Z

Is this reproducible or is it a race condition during shutdown? If a thread is canceled while it is holding a lock, this type of thing can happen. That's the risk with any sort of forced exit situation, which is why its preferable to get tasks to self-shutdown rather than forcibly delete them.

excaliburtb · 2020-05-13T20:01:57Z

right.. as far as I can tell, Ctrl-C'ing the process is immediately killing the apps which prevents them from doing any clean shutdown which means cfe needs to do the cleanup. However, this behavior hasn't been a problem for the SCH code base for many versions of cfe. The question is, what changed? What should the app do? what should the cfe/osal/psp do?

excaliburtb · 2020-05-13T20:02:58Z

this is an intermittent problem but occurs often enough that it isn't rare for it to occur

jphickey · 2020-05-13T20:43:06Z

However, this behavior hasn't been a problem for the SCH code base for many versions of cfe

Is this to say you are finding this more frequently occurring in the latest baseline vs. older baselines?

If I'm interpreting correctly you are running the latest bleeding-edge baseline - which would have changed the CTRL+C handling to being treated as an exception and thereby flowing through the ER log/processor reset sequence. This still will do a forced delete of all tasks but it will possibly change the timing of when that occurs, and maybe order of operations? But that would have only changed in the most recent baseline.

excaliburtb · 2020-05-13T20:46:16Z

as far as I know, it never occurred in the older baselines. and, yes, I am working with the bleeding edge master branches. (see initial comment for hashes).

jphickey · 2020-05-18T11:22:41Z

I am looking into this one, but unable to replicate the issue as I'm not sure what version/config of SCH is used here. However it could be simply that the OS_ForEachObject, which drives the cleanup operations, finds the tasks and semaphores before the timers.

jphickey · 2020-05-18T12:41:57Z

@excaliburtb Is the backtrace posted in the initial summary showing every thread that still existed in the process or just the ones that were "stuck"?

In particular I'm wondering about the task which runs SCH_AppMain, which is not shown above. This would normally be inside a pthread_condwait() call, but may have gotten woken up due to the SIGINT. My hypothesis is that maybe it got woken up, but was deleted before it could release the lock.

skliper · 2020-06-05T13:40:04Z

Resolved by nasa/osal#470

skliper added the bug label May 16, 2020

jphickey self-assigned this May 18, 2020

This was referenced May 18, 2020

Fix #293, Expand API for object queries nasa/osal#469

Merged

Binary Semaphore locked after thread cancellation nasa/osal#470

Closed

Order of operations on OS_DeleteAllObjects nasa/osal#471

Closed

Fix #470, Binary sem task delete issues nasa/osal#472

Merged

skliper added this to the 6.8.0 milestone May 19, 2020

skliper closed this as completed Jun 5, 2020

skliper mentioned this issue May 20, 2022

Stopping an APP that has a locked mutex using CFE_ES_StopAppCmd BUG #2107

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cfe/SCH deadlocks on exit on Linux #701

cfe/SCH deadlocks on exit on Linux #701

excaliburtb commented May 13, 2020

skliper commented May 13, 2020 •

edited

Loading

jphickey commented May 13, 2020

excaliburtb commented May 13, 2020

excaliburtb commented May 13, 2020

jphickey commented May 13, 2020

excaliburtb commented May 13, 2020

jphickey commented May 18, 2020

jphickey commented May 18, 2020

skliper commented Jun 5, 2020

cfe/SCH deadlocks on exit on Linux #701

cfe/SCH deadlocks on exit on Linux #701

Comments

excaliburtb commented May 13, 2020

skliper commented May 13, 2020 • edited Loading

jphickey commented May 13, 2020

excaliburtb commented May 13, 2020

excaliburtb commented May 13, 2020

jphickey commented May 13, 2020

excaliburtb commented May 13, 2020

jphickey commented May 18, 2020

jphickey commented May 18, 2020

skliper commented Jun 5, 2020

skliper commented May 13, 2020 •

edited

Loading