-
Notifications
You must be signed in to change notification settings - Fork 202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CFE intermittently showing invalid message ID errors #1073
Comments
I am starting to dig deeper into this - don't have any more info just yet - but was wondering if perhaps other users have noticed this too? Given the intermittent nature and that it takes hours to reproduce this, it might be a while before I can track it down sufficiently. Seems like it could be a race condition where the buffer is perhaps reused before it was freed, or the routing in SB delivered it to the wrong place somehow. |
Might be worth sending a message to the mailing list. I'll post on the internal MS Teams channel. |
I am pretty sure it is something fairly recently added. As I said I cannot pinpoint exactly when (yet) but I do frequently leave CFE running in one of the many terminal windows buried on my desktop - and this didn't happen before. So only users of bleeding edge dev builds might see this - Not official release users. And for users that do have a build of the latest code and have the ability to let it run for an extended period - please try it and comment if you see this too. I am wondering if it might be affected by multi-core systems or something like that. |
Ran overnight (SIMULATION=native on Ubuntu 18.04 VM) w/ no occurrences yet. Just the flywheel toggle. |
I'll do a debug build with the |
I also ran two instances overnight (in VM) - one was a vanilla/unmodified "main" build and the other had debug instrumentation added. Since last night the unmodified build has accumulated 3 errors:
Of course the build with debug instrumentation added has accumulated none yet :-/.... |
Typical. I updated HK messages to "1 Hz" in sch_lab and then I made "1 Hz" 100 Hz from the time service timer config, no errors yet but hopefully I can trigger one... all we send is HK messages and the time at tone message (that I can think of) and it happens in a burst. I haven't reviewed priorities in a while... |
OK - I was able to catch several of these incidents and get core files of the system state when it happened, along with one where I had also started the perf log. With another day of runtime, both of my builds eventually accumulated some errors. This core files confirm that SCH_LAB was actively running and generating the SEND_HK requests, while CFE_TIME "tone task" was also actively running to generate the CFE_TIME_TONE_CMD_MID (0x1810) message. Ultimately it looks like the same buffer was issued to both messages at the same time. After doing some inspection on CFE SB, it appears to be missing locks here. In particular getting the buffer from the pool used to be done while holding the SB global lock prior to the recent SB refactor, but in the current I also see other areas of SB that are accessing globals and therefore should be locked, too... will need to a scrub of this code and make sure it has all the locks it needs. |
Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.
Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.
Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.
Significant refactor of many SB API calls to address inconsistencies with respect to locking and unlocking of global data structures. First this updates the definition of CFE_SB_PipeId_t to use the CFE_ES_ResourceID_t base type, and a new ID range. Notably this prevents direct access to the CFE_SB.PipeTbl global, forcing code to go through the proper lookup routine, which should only be done while locked. All API implementations follow the same general pattern: - Initial checks/queries while unlocked - Lock SB global - Lookups and/or modifications to the pipe table/routing info - Unlock SB global - Invoke other subsystems (e.g. OSAL) - Re-lock SB global (if needed) do final update, and unlock again - Send all events All error counters should be updated at the end, while still locked. All event processing is deferred to the end of each function, after all other processing is done.
Describe the bug
When leaving CFE running for long periods of time, I am seeing occasional errors related to bad message IDs/Commands appear.
For instance:
EVS Port1 66/1/CFE_EVS 5: Invalid command packet, Message ID = 0x00001810
I was actually able to catch one of these in a core dump, and I can confirm that
SBBufPtr->Msg
looks like a perfectly well-formed message with MsgID of 0x1810. This particular example was caught atcFE/fsw/cfe-core/src/evs/cfe_evs_task.c
Line 378 in 56397a3
What is particularly mysterious is that 0x1810 is
CFE_TIME_TONE_CMD_MID
... These MIDs should be delivered only to CFE_TIME - EVS does not subscribe to these - yet for some reason it appears to be delivered to EVS....To Reproduce
Build CFE in default debug config (SIMULATION=native, etc). Start up CFE core and just let it run. It may take several hours for the first error to appear. CFE appears to continue running normally afterwards though.
Expected behavior
EVS (or other apps!) should not randomly see messages it didn't subscribe to.
System observed on:
Ubuntu 20.04 (native)
Additional context
Not sure when this started because everything seems fine when debugging CFE for shorter time periods. But over the last couple weeks I've noticed these messages randomly appearing in terminal windows where I've started CFE and forgotten about it (overnight or longer).
Although I cite EVS here (because its the one I actually caught in a core file) this phenomenon occurs in other apps too (SB, TIME, etc) all randomly getting a delivered message ID they didn't subscribe to.
Reporter Info
Joseph Hickey, Vantage Systems, Inc.
The text was updated successfully, but these errors were encountered: