Automatic suppression of flooding events #2095

jhnphm · 2022-04-22T16:36:31Z

Is your feature request related to a problem? Please describe.
We've observed that overly verbose event messages can take down a processor, especially if the generation of events can induce the production of more events due to increased SB/SBNg traffic. Some ability to rate-limit events regardless
of filter state seems to be desirable, as the default for many apps is often unfiltered, and during an event storm it may not be possible to command an event to be filtered.

In addition, there are concerns about critical event messages being lost amongst a flood of less critical events, especially when not commandable due to being in a temporary loss-of-signal state.

There's also the desire to automatically re-enable an event if it is no longer flooding, and the proposed solution would have that characteristic. This is important if the event notifying that the filter has been activated gets lost due to flooding.
Events could still be permanently suppressed w/ the existing binary filters.

Describe the solution you'd like
A filter that is applied per-app, per event type (priority). A token counter for (app, event type) is incremented every time an event message arrives for (app, event type). Meanwhile, the token counter is decremented at a fixed rate.
If the token counter exceeds a certain threshold, incrementing stops and all incoming events for that (app, event type) are discarded until the counter is below a threshold (could be the same as the suppress threshold, then it'll
effectively rate limit after allowing a small burst).

Describe alternatives you've considered
To simplify the implementation, this could be done globally for all events instead of on a per-app basis. There would be the problem of a misbehaved app flooding out another app sending more critical event messages, but this could
be mitigated somewhat by doing this per-event type or prioritizing by event type.

A more granular solution would filter only excessively verbose events by (app event ID) but this would probably require registration of all events and enforcement. It appears that not all apps register all events w/ EVS (i.e. SBNg).

Another solution that only addresses the auto-reenabling of formerly flooding events would be to add another filter type besides binary filter that uses the above mentioned token scheme. This doesn't help too much since the filter
type is set by the app during registration, if events are registered at all.

Additional context
Initial driver was SBNg doesn't come w/ default filters for floodable events, and one of our other apps also sends off a lot of events on startup. Obvious solution is to modify these apps to setup default filters, but this doesn't address reenabling the events automatically when the transients go away, nor does it address unexpectedly flooding messages due to unforeseen edge cases.

Requester Info
John N Pham, Northrop Grumman

skliper · 2022-04-22T16:48:06Z

My preferred approach would be first off fix apps that can flood. The more autonomous filtering we add in the flight logic the more complex managing events from the ground gets (not to mention the impacts of adding complexity to the software). I'm not opposed to putting in a very simple failsafe in EVS to avoid a flood (quarantine an app for X seconds or something) but any app that can flood is broken in my mind. It's basically eating up resources in a bad way which preferably is fixed at the source vs bandaiding downstream. SBN has historically had this issue (I'd consider it critical to fix!)

jhnphm · 2022-04-22T16:57:38Z

Quarantining an app entirely for X seconds would alleviate most of our concerns w/o being too complicated. We can probably do w/o the per-event-type granularity to simplify things.

astrogeco · 2022-04-27T16:22:36Z

@jphickey can you review?

jphickey · 2022-04-27T19:57:11Z

I do tend to agree that the root cause is badly designed apps that can flood in the first place. Note that EVS already has a feature to prevent circular/recursive events where processing one event causes an event to be generated again, but this is limited to the processing of that event. If the app is actually sending the event multiple times, then EVS doesn't squelch it.

That being said, a simple rate limit shouldn't be too hard to do, but currently the only reference of time in EVS is the HK wakeup event. Which might be good enough - at each HK wakeup it could issue a "credit" for some number of events that each app is allowed to send. Each event actually sent decrements the credit by 1, and if this reaches 0, then the app is prevented from sending events until the next HK wakeup when it gets more credit. One would need to cap the max credit (for apps which might be silent for long periods) and also count any messages that got squelched due to this, so one knows there is a flooding problem.

chillfig · 2022-05-04T14:13:56Z

@jhnphm @jphickey Do you want to discuss this at today's cFS Community CCB meeting?

jhnphm · 2022-05-04T15:36:45Z

@chillfig How do I call into the CCB, and can I invite some of my colleagues?

astrogeco · 2022-05-04T15:40:06Z

@chillfig How do I call into the CCB, and can I invite some of my colleagues?

Just sent you a message on MSTeams

astrogeco · 2022-05-04T16:56:15Z

CCB:2022-05-04

If we want to limit events, maybe only do it in "debug" mode
Is it worth adding complexity to the framework to handle a "broken" app?
Some use cases: corrupted peripherals or radiation event error messages
App Design Note: don't create an event for every communication. Use a "summary event" and create a telemetry point that keeps track of instances. Events are expensive.
Idea: App can only send NUM_EVENTS per EVS cycle. Could make it configurable or maybe on an app-by-app basis. Keep counters internally

astrogeco · 2022-05-04T17:07:10Z

@tngo67 look out for a meeting invite about this

tngo67 · 2022-05-04T17:45:18Z

@astrogeco Copy that. I will pull in additional developers from my team to weigh in.

jwilmot · 2022-05-04T18:02:39Z

Filtering was the intended way to handle flooding. All applications should have registered a filter table so you can have robust behavior. I wanted this to be a requirement and was overruled. EVS could just install a default filter table for apps that don't register. That would be pretty straight forward.

skliper · 2022-05-04T19:09:32Z

@jwilmot I've typically seen event flooding when modifying filtering, which isn't completely solved by registering more default filters. Rather than relying on specific filtering that if changed can render a system inoperable, it seems more robust to fix the app such that it won't flood even if the filters are modified. The EVS limit proposed here is that failsafe for flooding edge cases, but really an app that can nominally produce more events than the system can handle should be fixed. Otherwise modifying filters should be considered a critical (could render system inoperable) command.

skliper · 2022-05-10T12:46:38Z

Also, maybe CFE_EVS_ENABLE_EVENT_TYPE_CC should really just be avoided for type DEBUG (until the apps that flood are fixed), in favor of using CFE_EVS_ENABLE_APP_EVENT_TYPE_CC for just the app of interest. That would at least narrow down the issue, unless the app of interest is the flooding app, but again... fix the app!

jwilmot · 2022-05-10T13:35:37Z

I agree that EVS should probably have a failsafe such that no more than X (configurable) can be sent per second. That would handle misconfiguration, bad apps, and operator error as @skliper has seen. To keep it simple, X could be global across all events on a given cFE instantiation. I still like the idea of having a default filter for apps that don't register that limits to the first few and then one per 128(?) as a reminder. Apps that can flood should always have filters included and be checked at design and code walkthough reviews.

skliper · 2022-05-10T14:04:46Z

@jwilmot - why not just fix the app to not flood with CFE_EVS_SendEvent calls vs adding autonomous filter complexity? Then the ground can manage filters however they want and it reduces the chance that a ground command to modify a filter could result in rendering the system inoperable. If an app sends an event in a tight loop it's bad whatever default filter is used. Events should be treated like tlm, make them useful/informative and don't send an unlimited amount from the app per cycle. Note all apps that send events register with EVS... but if they were to register a default filter for every event that will expand the memory requirements of EVS (some apps have 100+ events), and the calls load the system even if filtered. Eliminating the problem at the source avoids expanding the memory footprint and could reduce system loading while reducing the likelihood of inadvertently rendering the system inoperable.

skliper · 2022-05-10T14:20:41Z

If we are going to do a "global" EVS limit, then I like @jphickey's suggestion of "credit". Could start up with enough credit to output all the startup events (like the evs log limit), but then only add X credits per cycle. It's still definitely a "fail-safe" that shouldn't be relied on since it's indiscriminate so important messages could be dropped, or we could still allow ERROR/CRITICAL events or have different limits. This would allow for a short term 0-loss... but still keep the system operable.

…ent suppression

Fix #2095, Automatic suppression of flooding events

skliper added the enhancement label Apr 22, 2022

skliper added this to the Draco milestone Apr 22, 2022

skliper added the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Apr 22, 2022

astrogeco removed the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label May 4, 2022

jhnphm mentioned this issue May 25, 2022

Fix #2095, Automatic suppression of flooding events #2109

Closed

2 tasks

skliper added a commit to skliper/cFE that referenced this issue Jun 9, 2022

Fix nasa#2095, Format cleanup of auto event suppression

40a8779

skliper mentioned this issue Jun 9, 2022

Fix #2095, Automatic suppression of flooding events #2117

Merged

2 tasks

dmknutsen assigned semaldona Jul 6, 2022

semaldona added a commit to skliper/cFE that referenced this issue Jul 27, 2022

Fix nasa#2095, Update auto event suppression unit and functional tests

a26fc78

semaldona added a commit to skliper/cFE that referenced this issue Jul 27, 2022

Fix nasa#2095, Format cleanup of auto event suppression tests

270e17d

semaldona added a commit to skliper/cFE that referenced this issue Jul 29, 2022

Fix nasa#2095, Update platform cfg comments and unit test for auto ev…

1b5a3ab

…ent suppression

dzbaker added a commit that referenced this issue Aug 4, 2022

Merge pull request #2117 from skliper/fix2095-auto_event_suppression

946ec31

Fix #2095, Automatic suppression of flooding events

dzbaker mentioned this issue Aug 4, 2022

cFE Integration Candidate: Caelum-rc4+dev15 #2130

Merged

dzbaker closed this as completed in 6d8c972 Aug 4, 2022

dzbaker closed this as completed in #2130 Aug 4, 2022

skliper added the draco-rc3 label Aug 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic suppression of flooding events #2095

Automatic suppression of flooding events #2095

jhnphm commented Apr 22, 2022 •

edited

Loading

skliper commented Apr 22, 2022

jhnphm commented Apr 22, 2022

astrogeco commented Apr 27, 2022

jphickey commented Apr 27, 2022

chillfig commented May 4, 2022

jhnphm commented May 4, 2022 via email •

edited

Loading

astrogeco commented May 4, 2022

astrogeco commented May 4, 2022 •

edited

Loading

astrogeco commented May 4, 2022

tngo67 commented May 4, 2022

jwilmot commented May 4, 2022

skliper commented May 4, 2022

skliper commented May 10, 2022

jwilmot commented May 10, 2022

skliper commented May 10, 2022

skliper commented May 10, 2022

Automatic suppression of flooding events #2095

Automatic suppression of flooding events #2095

Comments

jhnphm commented Apr 22, 2022 • edited Loading

skliper commented Apr 22, 2022

jhnphm commented Apr 22, 2022

astrogeco commented Apr 27, 2022

jphickey commented Apr 27, 2022

chillfig commented May 4, 2022

jhnphm commented May 4, 2022 via email • edited Loading

astrogeco commented May 4, 2022

astrogeco commented May 4, 2022 • edited Loading

astrogeco commented May 4, 2022

tngo67 commented May 4, 2022

jwilmot commented May 4, 2022

skliper commented May 4, 2022

skliper commented May 10, 2022

jwilmot commented May 10, 2022

skliper commented May 10, 2022

skliper commented May 10, 2022

jhnphm commented Apr 22, 2022 •

edited

Loading

jhnphm commented May 4, 2022 via email •

edited

Loading

astrogeco commented May 4, 2022 •

edited

Loading