CFE intermittently showing invalid message ID errors #1073

jphickey · 2021-01-06T21:57:49Z

Describe the bug
When leaving CFE running for long periods of time, I am seeing occasional errors related to bad message IDs/Commands appear.

For instance:
EVS Port1 66/1/CFE_EVS 5: Invalid command packet, Message ID = 0x00001810

I was actually able to catch one of these in a core dump, and I can confirm that SBBufPtr->Msg looks like a perfectly well-formed message with MsgID of 0x1810. This particular example was caught at

cFE/fsw/cfe-core/src/evs/cfe_evs_task.c

Line 378 in 56397a3

"Invalid command packet, Message ID = 0x%08X",

.

(gdb) print /x *((CFE_MSG_CommandHeader_t*)SBBufPtr)
$2 = {Msg = {CCSDS = {Pri = {StreamId = {0x18, 0x10}, Sequence = {0xc0, 0x0}, Length = {0x0, 0x1}}}, Byte = {0x18, 0x10, 0xc0, 0x0, 0x0, 0x1}}, Sec = {FunctionCode = 0x0, Checksum = 0x0}}

What is particularly mysterious is that 0x1810 is CFE_TIME_TONE_CMD_MID ... These MIDs should be delivered only to CFE_TIME - EVS does not subscribe to these - yet for some reason it appears to be delivered to EVS....

To Reproduce
Build CFE in default debug config (SIMULATION=native, etc). Start up CFE core and just let it run. It may take several hours for the first error to appear. CFE appears to continue running normally afterwards though.

Expected behavior
EVS (or other apps!) should not randomly see messages it didn't subscribe to.

System observed on:
Ubuntu 20.04 (native)

Additional context
Not sure when this started because everything seems fine when debugging CFE for shorter time periods. But over the last couple weeks I've noticed these messages randomly appearing in terminal windows where I've started CFE and forgotten about it (overnight or longer).

Although I cite EVS here (because its the one I actually caught in a core file) this phenomenon occurs in other apps too (SB, TIME, etc) all randomly getting a delivered message ID they didn't subscribe to.

Reporter Info
Joseph Hickey, Vantage Systems, Inc.

The text was updated successfully, but these errors were encountered:

jphickey · 2021-01-06T22:02:38Z

I am starting to dig deeper into this - don't have any more info just yet - but was wondering if perhaps other users have noticed this too? Given the intermittent nature and that it takes hours to reproduce this, it might be a while before I can track it down sufficiently.

Seems like it could be a race condition where the buffer is perhaps reused before it was freed, or the routing in SB delivered it to the wrong place somehow.

astrogeco · 2021-01-06T22:15:25Z

Might be worth sending a message to the mailing list. I'll post on the internal MS Teams channel.

jphickey · 2021-01-06T22:21:26Z

I am pretty sure it is something fairly recently added. As I said I cannot pinpoint exactly when (yet) but I do frequently leave CFE running in one of the many terminal windows buried on my desktop - and this didn't happen before. So only users of bleeding edge dev builds might see this - Not official release users.

And for users that do have a build of the latest code and have the ability to let it run for an extended period - please try it and comment if you see this too. I am wondering if it might be affected by multi-core systems or something like that.

skliper · 2021-01-07T13:42:04Z

Ran overnight (SIMULATION=native on Ubuntu 18.04 VM) w/ no occurrences yet. Just the flywheel toggle.

skliper · 2021-01-07T14:22:55Z

I'll do a debug build with the CFE_TIME_TONE_CMD_MID sent at a high rate, maybe some extra reporting to see if I can increase the frequency of the bug.

jphickey · 2021-01-07T14:40:10Z

I also ran two instances overnight (in VM) - one was a vanilla/unmodified "main" build and the other had debug instrumentation added. Since last night the unmodified build has accumulated 3 errors:

1980-012-14:11:19.65671 ES Startup: CFE_ES_Main entering OPERATIONAL state
EVS Port1 66/1/CFE_TIME 21: Stop FLYWHEEL
EVS Port1 66/1/CI_LAB_APP 4: CI: invalid command packet,MID = 0x1810
EVS Port1 66/1/CFE_SB 43: Invalid Cmd, Unexpected Msg Id: 0x1810
EVS Port1 66/1/SAMPLE_APP 5: SAMPLE: invalid command packet,MID = 0x1810

Of course the build with debug instrumentation added has accumulated none yet :-/....

skliper · 2021-01-07T15:10:57Z

Typical. I updated HK messages to "1 Hz" in sch_lab and then I made "1 Hz" 100 Hz from the time service timer config, no errors yet but hopefully I can trigger one... all we send is HK messages and the time at tone message (that I can think of) and it happens in a burst. I haven't reviewed priorities in a while...

jphickey · 2021-01-08T14:56:44Z

OK - I was able to catch several of these incidents and get core files of the system state when it happened, along with one where I had also started the perf log. With another day of runtime, both of my builds eventually accumulated some errors.

This core files confirm that SCH_LAB was actively running and generating the SEND_HK requests, while CFE_TIME "tone task" was also actively running to generate the CFE_TIME_TONE_CMD_MID (0x1810) message. Ultimately it looks like the same buffer was issued to both messages at the same time.

After doing some inspection on CFE SB, it appears to be missing locks here. In particular getting the buffer from the pool used to be done while holding the SB global lock prior to the recent SB refactor, but in the current CFE_SB_TransmitMsg it is not locked. Incidentally the inverse operation (CFE_SB_DecrBufUseCnt) is still done under lock.

I also see other areas of SB that are accessing globals and therefore should be locked, too... will need to a scrub of this code and make sure it has all the locks it needs.

Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.

Fix #1073, software bus locking

jphickey self-assigned this Jan 6, 2021

jphickey added the bug label Jan 6, 2021

skliper added this to the 7.0.0 milestone Jan 7, 2021

This was referenced Jan 8, 2021

CFE_SB_ReceiveBuffer() -- switch timeout to int32 #1078

Closed

CFE SB Pipes not safe across multiple tasks #1079

Open

Use Resource ID type for SB Pipe ID #985

Closed

jphickey mentioned this issue Jan 13, 2021

Fix #1073, software bus locking #1092

Merged

This was referenced Jan 13, 2021

Inconsistent Pipe ID reporting in SB events #1093

Open

Possible race conditions in SB Pipe operations #1096

Closed

astrogeco added a commit that referenced this issue Jan 25, 2021

Merge pull request #1092 from jphickey/fix-1073-sb-locking

95ec2da

Fix #1073, software bus locking

astrogeco mentioned this issue Jan 25, 2021

cFE Integration candidate: 2021-01-19 #1109

Merged

astrogeco closed this as completed in #1109 Jan 27, 2021

skliper mentioned this issue Mar 30, 2021

Eliminating recursive locks in SB could allow for using more efficient resource #948

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CFE intermittently showing invalid message ID errors #1073

CFE intermittently showing invalid message ID errors #1073

jphickey commented Jan 6, 2021

jphickey commented Jan 6, 2021

astrogeco commented Jan 6, 2021

jphickey commented Jan 6, 2021

skliper commented Jan 7, 2021

skliper commented Jan 7, 2021

jphickey commented Jan 7, 2021

skliper commented Jan 7, 2021

jphickey commented Jan 8, 2021

CFE intermittently showing invalid message ID errors #1073

CFE intermittently showing invalid message ID errors #1073

Comments

jphickey commented Jan 6, 2021

jphickey commented Jan 6, 2021

astrogeco commented Jan 6, 2021

jphickey commented Jan 6, 2021

skliper commented Jan 7, 2021

skliper commented Jan 7, 2021

jphickey commented Jan 7, 2021

skliper commented Jan 7, 2021

jphickey commented Jan 8, 2021