Improve `CFE_TIME_GetReference` error handling #1544

skliper · 2021-05-17T21:32:41Z

Is your feature request related to a problem? Please describe.
Hard coding Retry count to 4 is a bit "magic", not clear how much margin this has, not clear statistically how may conflicts are occurring, no error reporting, no way for the calling routine to take action or track, not clear even if these values are always required to be in sync or what happens if they aren't, not obvious what is intended here.

Describe the solution you'd like
Clarify design, track/monitor/report performance/errors. If there's uses where it's critical, may need to deconflict (protect query from update). Maybe provide API that's slower but always correct, vs faster but possibly invalid if that's really a need.

Describe alternatives you've considered
None

Additional context
Code review

Requester Info
Jacob Hageman - NASA/GSFC

jphickey · 2021-05-18T13:35:28Z

The intent of this loop is to be able to read the "ReferenceState" structure in a lock-less manner. This avoids the possibility of a task switch which could happen if a lock/mutex was used here. It also allows multiple tasks to read the structure at the same time.

The "VersionCounter" within the structure is incremented every time the structure is updated by the tone task or command. The code is checking that, after copying all data out, that the "VersionCounter" within the global data structure still matches the expected value. If it does not match, this means that the structure was updated during read, and the data might not be consistent, so it reads it again.

The "RetryCount" simply puts a reasonable limit on the number of times this will happen.

Important to note that the possibility of a concurrent update is already very low - with the buffering done now, its quite unlikely that it will happen even once. So in reality the number of retries just has to be more than 1 to handle this remote possibility, but if it retries more than once and the version counter is still not matching, then something is likely broken.

The possibility is very remote because there are multiple copies of the state structure stored in the global, for persistence across updates. In order to get to the point of overwriting the existing entry, it would require CFE_TIME_REFERENCE_BUF_DEPTH (4) updates to occur between the time the caller started the call to CFE_TIME_GetReference and the time it finished reading the data (a fairly short bit of code). For tasks which change this structure, the worst case is that the "CFE_TIME_Local1HzStateMachine" may do up to 3 updates to the global structure in rapid succession. So it would require this to occur, in addition to some other command and/or the CFE_TIME_ToneUpdate to run, at this exact time, to get to the point of needing a retry. For this unlikely combination to happen multiple times in a row would be excessively unlikely, to the point that if it does happen, it is almost certainly broken.

So to summarize:

The likelihood of needing a single retry is already low, as it implies multiple background updates occurred exactly at the time this task is attempting to read the structure.
But the likelihood is not zero, it is theoretically possible, hence why this "retry" exists.
So at least 2 attempts should be made, just to cover this possibility.
The current "hard-coded" limit of 4 is simply double this theoretical possibility - if it is ever reached, then something is assuredly broken.

skliper · 2021-05-18T13:43:55Z

Sounds like any time it takes more than 2 attempts there should be some indication to the user there's abnormal behavior, and if all 4 attempts are exhausted there should be an indication of the system being broken/unhealthy.

jphickey · 2021-05-18T14:50:26Z

Yes, the main issue with this code is that a failure is not really reported at all. If the retry limit is reached then at least a syslog is warranted, consider also an error event (as I would consider this a significant error, something is fundamentally broken).

Use one of the unused time state bits to indicate if an error has occurred where CFE_TIME_GetReference was not able to get a consistent copy of the reference state data. In a functional system this should never occur - there should be at most one retry, which only happens in the event there was a burst of updates (more than 4) concurrently with reading the structure. The previous implementation did not report or handle the condition at all, this at least sets a TLM status bit and returns a reference struct filled with all zeros.

Fix #1544, add time get reference error bit

skliper added enhancement CFS-43 labels May 17, 2021

skliper added this to the 7.0.0 milestone May 17, 2021

skliper assigned jphickey May 17, 2021

jphickey mentioned this issue Jun 1, 2021

Fix #1544, add time get reference error bit #1593

Merged

astrogeco added a commit that referenced this issue Jun 10, 2021

Merge pull request #1593 from jphickey/fix-1544-time-referr-bit

65573bf

Fix #1544, add time get reference error bit

astrogeco mentioned this issue Jun 15, 2021

cFE Integration candidate: 2021-06-15 #1619

Merged

astrogeco closed this as completed in #1619 Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve `CFE_TIME_GetReference` error handling #1544

Improve `CFE_TIME_GetReference` error handling #1544

skliper commented May 17, 2021

jphickey commented May 18, 2021

skliper commented May 18, 2021

jphickey commented May 18, 2021

Improve CFE_TIME_GetReference error handling #1544

Improve CFE_TIME_GetReference error handling #1544

Comments

skliper commented May 17, 2021

jphickey commented May 18, 2021

skliper commented May 18, 2021

jphickey commented May 18, 2021

Improve `CFE_TIME_GetReference` error handling #1544

Improve `CFE_TIME_GetReference` error handling #1544