Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race Condition in CFE_AppCreate() function #279

Closed
skliper opened this issue Sep 30, 2019 · 23 comments
Closed

Race Condition in CFE_AppCreate() function #279

skliper opened this issue Sep 30, 2019 · 23 comments
Labels
Milestone

Comments

@skliper
Copy link
Contributor

skliper commented Sep 30, 2019

This has been split to a separate ticket from #173.

Per email from Preston Faiks on 2018-06-04, there is an actual observed race condition issue with CFE_ES_AppCreate() out in the field:

When ES is loading and starting apps, one app might fail initialization and call CFE_ES_ExitApp() If that occurs, its app state will be set to CFE_ES_APP_STATE_STOPPED.

When apps are scanned, it will be removed from the app table and that table entry set to not in use.

As ES continues to load apps, it will make use on the now unused app table entry. It will not change the app state in the entry until it has successfully loaded the app into memory.

The process of loading an app into memory can cause the task to pend on file system (or network file system) and allow other tasks to run.

As that app continues to be loaded, another app scan can occur and detect the app entry as both in use and stopped, and will unload it.

When ES finishes loading the app, it will spawn a task at an entry point which was just unloaded by the scanning task, causing it to execute from unloaded memory and crash.

I have reviewed this code again and the race condition risk described is definitely still present in the current development branch, but this isn't the only example. There are other similar race conditions that are possible regarding the use if the RecordUsed boolean field.

Having an observed failure should escalate this in priority now.

@skliper skliper added this to the 6.7.0 milestone Sep 30, 2019
@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Imported from trac issue 248. Created by jphickey on 2018-09-18T16:40:13, last modified: 2019-07-03T13:15:28

@skliper skliper self-assigned this Sep 30, 2019
@skliper skliper added the bug label Sep 30, 2019
@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2018-09-18 16:49:51:

Historical notes

At the time this bug was described, there was an existing enhancement ticket (#173) to change the overall app loading architecture. The original plan was to simply address this race condition as part of the larger overall changeset.

Further discussion on this topic revealed a few issues with that original plan:

  1. The CCB might not have time / resources to fundamentally change app loading logic as part of the next CFE release.

  2. The fundamental issue is really with respect to handing of the RecordUsed field and the fact that table entries only have two states (free / active) and probably need at least 3 states (free/reserved/active). The specific described race described here might exist elsewhere in CFE, whenever this table allocation pattern is used, and even a full restructuring of AppLoading might not fix the other cases.

It was therefore determined at the 2018-09-18 CCB that this bug needs its own solution, separate from any refactoring of the app loading logic.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2018-09-18 17:06:09:

Proposed Fixes for this issue ticket

The current thinking is to examine all cases where CFE ES manages its internal tables using a single RecordUsed boolean to track whether each entry is used or not.

We might have a problem any time this general pattern is employed:

  1. A request to allocate a new entry comes in
  2. Global lock taken
  3. Table searched for entry where RecordUsed==FALSE
  4. Table entry set RecordUsed=TRUE
  5. Global lock Released
  6. Other initialization work is done
  7. Table entry metadata updated

The immediate issue is that anything accessing the same table between 5-7 (such as the scan function in this case) it may misinterpret this to be a normal entry.

The work item here is to examine all tables in CFE which employ a RecordUsed field which is a boolean and uses this initialization pattern. For those cases which are potentially subject to a similar race condition, do the following:

  • Always ensure that a table entry is fully reset (by memset() or similar) before re-using it, to avoid the possibility of stale fields getting propagated to a new entry.
  • Evaluate changing RecordUsed flag to a RecordState marker with at least 3 states, allowing for a "reserved" state which is separately identifiable from an active record.
  • Evaluate combining the RecordUsed flag with the other existing states (i.e. the application table has several separate states, and it may be beneficial to simplify here).

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sseeger on 2018-10-11 14:42:25:

I have two proposals to fix this:

The easy way is to set StateRecord.AppState = CFE_ES_AppStateRUNNING when RecordUsed is set to TRUE. This will prevent CFE_ES_ScanAppTable() from processing it. We would then set the state back to CFE_ES_AppStateERROR if an error occurs and record is freed.

A cleaner approach is to get rid of RecordUsed. We could create another state CFE_ES_AppStateUNUSED. This fixes the problem and frees up memory.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by cdknight on 2018-10-23 14:23:19:

Related: #278

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sseeger on 2018-10-30 13:17:25:

The AppTable fix is in [changeset:90c30887e1ce6d2c75762a2096aee7ed6d0efd01]

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-02-25 15:41:41:

2018/11/13 CCB: Mentioned this ticket was ready for review. If so please update status to "work complete".

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by sseeger on 2019-02-27 08:59:04:

The current candidate for a fix is changeset:4fc775a16f8c6c2e8e049e995d3e6627aec2795c

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-03-15 16:59:32:

Needs rebase to development branch and squash of commits prior to CCB code review.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-03-25 10:27:33:

Rebased, see [changeset:ee218c7] for review.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-03-25 11:19:08:

I have reviewed the rebased commit and I think this still needs some additional cleanup.

  • My suggestion of renaming the field was mainly intended as a debugging exercise and not necessarily something to be permanent.
  • I do not think we should add the new function CFE_ES_AppRecordUsed(). The return value of TRUE in case the AppId is out of range is misleading. While this is "safer" for the case of a new allocation, for every other use case it is more dangerous and will cause the code to actually try to read the contents of the record. But everywhere this is called the range was already checked so this test is redundant.
  • I also do not like that the enum we are using to record the state of the entire record is actually in a sub-structure now. We should move this to the top level where the original boolean was.

I'll submit a commit that has my suggestions to illustrate what I'm talking about.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-03-25 16:16:58:

I have pushed commit [changeset:47014d5f] that has what I'm talking about. Note this is based on development, not the previous changeset.

In reviewing this I did notice that CFE_ES_ScanAppTable() is executed //without// locking the global ES table. This does several non-atomic operations, including read-modify-write to different elements within each record. This is probably another bug but I did not fix it here. Probably worth discussing.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-03-25 17:02:40:

Note: Submitted issue #295 about CFE_ES_ScanAppTable.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-03-27 12:40:17:

Code reviewed in CCB, 3/27/19

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-03-27 14:24:46:

Summary of some follow-on discussion:

  • It was noted in CCB review that the AppId index in some places is not explicitly range-checked
  • This is deemed "OK" for those places where the function is internally used only and not part of the public API, and it is known that all places from which it is called have already sanitized the value. (e.g. CFE_ES_SetAppState)
  • However, inspection revealed two ES function calls which are public API and do not contain the requisite range check.

Additional follow-on tickets #299 and #300 were submitted to fix these cases, and a task to perform a more thorough check of the code base prior to the next release.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-01 11:30:33:

CCB 3/27/19 - approved for merge to IC branch

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-04-01 11:37:10:

This is merged to the ic-ccb-20190327 branch for acceptance testing

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-18 17:03:23:

Last I heard this is still pending a unit test update, is that the current status?

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-04-19 08:56:26:

I updated the unit tests.

Unit test updates in commit: [changeset:d17f439]
Updated merge to ic-ccb-20190327 branch in commit: [changeset:8f205cd]

Bamboo build results: https://babelfish.arc.nasa.gov/bamboo/browse/CFS-CFSCFE141-26

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-04-24 12:22:19:

Approved for merge per CCB 4/24/2019

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-04-24 13:15:49:

Per CCB approval.

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jphickey on 2019-04-24 13:16:39:

Merged to "development" branch

@skliper
Copy link
Contributor Author

skliper commented Sep 30, 2019

Trac comment by jhageman on 2019-07-03 13:15:28:

Moved closed tickets targeting 6.6.1 to 6.7.0 end of summer release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant