Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve CFE_TIME_GetReference error handling #1544

Closed
skliper opened this issue May 17, 2021 · 3 comments · Fixed by #1593 or #1619
Closed

Improve CFE_TIME_GetReference error handling #1544

skliper opened this issue May 17, 2021 · 3 comments · Fixed by #1593 or #1619
Assignees
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented May 17, 2021

Is your feature request related to a problem? Please describe.
Hard coding Retry count to 4 is a bit "magic", not clear how much margin this has, not clear statistically how may conflicts are occurring, no error reporting, no way for the calling routine to take action or track, not clear even if these values are always required to be in sync or what happens if they aren't, not obvious what is intended here.

Describe the solution you'd like
Clarify design, track/monitor/report performance/errors. If there's uses where it's critical, may need to deconflict (protect query from update). Maybe provide API that's slower but always correct, vs faster but possibly invalid if that's really a need.

Describe alternatives you've considered
None

Additional context
Code review

Requester Info
Jacob Hageman - NASA/GSFC

@jphickey
Copy link
Contributor

The intent of this loop is to be able to read the "ReferenceState" structure in a lock-less manner. This avoids the possibility of a task switch which could happen if a lock/mutex was used here. It also allows multiple tasks to read the structure at the same time.

The "VersionCounter" within the structure is incremented every time the structure is updated by the tone task or command. The code is checking that, after copying all data out, that the "VersionCounter" within the global data structure still matches the expected value. If it does not match, this means that the structure was updated during read, and the data might not be consistent, so it reads it again.

The "RetryCount" simply puts a reasonable limit on the number of times this will happen.

Important to note that the possibility of a concurrent update is already very low - with the buffering done now, its quite unlikely that it will happen even once. So in reality the number of retries just has to be more than 1 to handle this remote possibility, but if it retries more than once and the version counter is still not matching, then something is likely broken.

The possibility is very remote because there are multiple copies of the state structure stored in the global, for persistence across updates. In order to get to the point of overwriting the existing entry, it would require CFE_TIME_REFERENCE_BUF_DEPTH (4) updates to occur between the time the caller started the call to CFE_TIME_GetReference and the time it finished reading the data (a fairly short bit of code). For tasks which change this structure, the worst case is that the "CFE_TIME_Local1HzStateMachine" may do up to 3 updates to the global structure in rapid succession. So it would require this to occur, in addition to some other command and/or the CFE_TIME_ToneUpdate to run, at this exact time, to get to the point of needing a retry. For this unlikely combination to happen multiple times in a row would be excessively unlikely, to the point that if it does happen, it is almost certainly broken.

So to summarize:

  • The likelihood of needing a single retry is already low, as it implies multiple background updates occurred exactly at the time this task is attempting to read the structure.
  • But the likelihood is not zero, it is theoretically possible, hence why this "retry" exists.
  • So at least 2 attempts should be made, just to cover this possibility.
  • The current "hard-coded" limit of 4 is simply double this theoretical possibility - if it is ever reached, then something is assuredly broken.

@skliper
Copy link
Contributor Author

skliper commented May 18, 2021

Sounds like any time it takes more than 2 attempts there should be some indication to the user there's abnormal behavior, and if all 4 attempts are exhausted there should be an indication of the system being broken/unhealthy.

@jphickey
Copy link
Contributor

Yes, the main issue with this code is that a failure is not really reported at all. If the retry limit is reached then at least a syslog is warranted, consider also an error event (as I would consider this a significant error, something is fundamentally broken).

jphickey added a commit to jphickey/cFE that referenced this issue Jun 1, 2021
Use one of the unused time state bits to indicate if an error has
occurred where CFE_TIME_GetReference was not able to get a consistent
copy of the reference state data.

In a functional system this should never occur - there should be at
most one retry, which only happens in the event there was a burst
of updates (more than 4) concurrently with reading the structure.

The previous implementation did not report or handle the condition
at all, this at least sets a TLM status bit and returns a reference
struct filled with all zeros.
jphickey added a commit to jphickey/cFE that referenced this issue Jun 1, 2021
Use one of the unused time state bits to indicate if an error has
occurred where CFE_TIME_GetReference was not able to get a consistent
copy of the reference state data.

In a functional system this should never occur - there should be at
most one retry, which only happens in the event there was a burst
of updates (more than 4) concurrently with reading the structure.

The previous implementation did not report or handle the condition
at all, this at least sets a TLM status bit and returns a reference
struct filled with all zeros.
astrogeco added a commit that referenced this issue Jun 10, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants