Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic suppression of flooding events #2095

Closed
jhnphm opened this issue Apr 22, 2022 · 16 comments · Fixed by #2117 or #2130
Closed

Automatic suppression of flooding events #2095

jhnphm opened this issue Apr 22, 2022 · 16 comments · Fixed by #2117 or #2130
Assignees
Milestone

Comments

@jhnphm
Copy link

jhnphm commented Apr 22, 2022

Is your feature request related to a problem? Please describe.
We've observed that overly verbose event messages can take down a processor, especially if the generation of events can induce the production of more events due to increased SB/SBNg traffic. Some ability to rate-limit events regardless
of filter state seems to be desirable, as the default for many apps is often unfiltered, and during an event storm it may not be possible to command an event to be filtered.

In addition, there are concerns about critical event messages being lost amongst a flood of less critical events, especially when not commandable due to being in a temporary loss-of-signal state.

There's also the desire to automatically re-enable an event if it is no longer flooding, and the proposed solution would have that characteristic. This is important if the event notifying that the filter has been activated gets lost due to flooding.
Events could still be permanently suppressed w/ the existing binary filters.

Describe the solution you'd like
A filter that is applied per-app, per event type (priority). A token counter for (app, event type) is incremented every time an event message arrives for (app, event type). Meanwhile, the token counter is decremented at a fixed rate.
If the token counter exceeds a certain threshold, incrementing stops and all incoming events for that (app, event type) are discarded until the counter is below a threshold (could be the same as the suppress threshold, then it'll
effectively rate limit after allowing a small burst).

Describe alternatives you've considered
To simplify the implementation, this could be done globally for all events instead of on a per-app basis. There would be the problem of a misbehaved app flooding out another app sending more critical event messages, but this could
be mitigated somewhat by doing this per-event type or prioritizing by event type.

A more granular solution would filter only excessively verbose events by (app event ID) but this would probably require registration of all events and enforcement. It appears that not all apps register all events w/ EVS (i.e. SBNg).

Another solution that only addresses the auto-reenabling of formerly flooding events would be to add another filter type besides binary filter that uses the above mentioned token scheme. This doesn't help too much since the filter
type is set by the app during registration, if events are registered at all.

Additional context
Initial driver was SBNg doesn't come w/ default filters for floodable events, and one of our other apps also sends off a lot of events on startup. Obvious solution is to modify these apps to setup default filters, but this doesn't address reenabling the events automatically when the transients go away, nor does it address unexpectedly flooding messages due to unforeseen edge cases.

Requester Info
John N Pham, Northrop Grumman

@skliper
Copy link
Contributor

skliper commented Apr 22, 2022

My preferred approach would be first off fix apps that can flood. The more autonomous filtering we add in the flight logic the more complex managing events from the ground gets (not to mention the impacts of adding complexity to the software). I'm not opposed to putting in a very simple failsafe in EVS to avoid a flood (quarantine an app for X seconds or something) but any app that can flood is broken in my mind. It's basically eating up resources in a bad way which preferably is fixed at the source vs bandaiding downstream. SBN has historically had this issue (I'd consider it critical to fix!)

@jhnphm
Copy link
Author

jhnphm commented Apr 22, 2022

Quarantining an app entirely for X seconds would alleviate most of our concerns w/o being too complicated. We can probably do w/o the per-event-type granularity to simplify things.

@skliper skliper added this to the Draco milestone Apr 22, 2022
@skliper skliper added the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label Apr 22, 2022
@astrogeco
Copy link
Contributor

@jphickey can you review?

@jphickey
Copy link
Contributor

I do tend to agree that the root cause is badly designed apps that can flood in the first place. Note that EVS already has a feature to prevent circular/recursive events where processing one event causes an event to be generated again, but this is limited to the processing of that event. If the app is actually sending the event multiple times, then EVS doesn't squelch it.

That being said, a simple rate limit shouldn't be too hard to do, but currently the only reference of time in EVS is the HK wakeup event. Which might be good enough - at each HK wakeup it could issue a "credit" for some number of events that each app is allowed to send. Each event actually sent decrements the credit by 1, and if this reaches 0, then the app is prevented from sending events until the next HK wakeup when it gets more credit. One would need to cap the max credit (for apps which might be silent for long periods) and also count any messages that got squelched due to this, so one knows there is a flooding problem.

@chillfig
Copy link
Contributor

chillfig commented May 4, 2022

@jhnphm @jphickey Do you want to discuss this at today's cFS Community CCB meeting?

@jhnphm
Copy link
Author

jhnphm commented May 4, 2022 via email

@astrogeco
Copy link
Contributor

@chillfig How do I call into the CCB, and can I invite some of my colleagues?

Just sent you a message on MSTeams

@astrogeco
Copy link
Contributor

astrogeco commented May 4, 2022

CCB:2022-05-04

  • If we want to limit events, maybe only do it in "debug" mode
  • Is it worth adding complexity to the framework to handle a "broken" app?
  • Some use cases: corrupted peripherals or radiation event error messages
  • App Design Note: don't create an event for every communication. Use a "summary event" and create a telemetry point that keeps track of instances. Events are expensive.
  • Idea: App can only send NUM_EVENTS per EVS cycle. Could make it configurable or maybe on an app-by-app basis. Keep counters internally

@astrogeco
Copy link
Contributor

@tngo67 look out for a meeting invite about this

@astrogeco astrogeco removed the CCB:Ready Ready for discussion at the Configuration Control Board (CCB) label May 4, 2022
@tngo67
Copy link

tngo67 commented May 4, 2022

@astrogeco Copy that. I will pull in additional developers from my team to weigh in.

@jwilmot
Copy link

jwilmot commented May 4, 2022

Filtering was the intended way to handle flooding. All applications should have registered a filter table so you can have robust behavior. I wanted this to be a requirement and was overruled. EVS could just install a default filter table for apps that don't register. That would be pretty straight forward.

@skliper
Copy link
Contributor

skliper commented May 4, 2022

@jwilmot I've typically seen event flooding when modifying filtering, which isn't completely solved by registering more default filters. Rather than relying on specific filtering that if changed can render a system inoperable, it seems more robust to fix the app such that it won't flood even if the filters are modified. The EVS limit proposed here is that failsafe for flooding edge cases, but really an app that can nominally produce more events than the system can handle should be fixed. Otherwise modifying filters should be considered a critical (could render system inoperable) command.

@skliper
Copy link
Contributor

skliper commented May 10, 2022

Also, maybe CFE_EVS_ENABLE_EVENT_TYPE_CC should really just be avoided for type DEBUG (until the apps that flood are fixed), in favor of using CFE_EVS_ENABLE_APP_EVENT_TYPE_CC for just the app of interest. That would at least narrow down the issue, unless the app of interest is the flooding app, but again... fix the app!

@jwilmot
Copy link

jwilmot commented May 10, 2022

I agree that EVS should probably have a failsafe such that no more than X (configurable) can be sent per second. That would handle misconfiguration, bad apps, and operator error as @skliper has seen. To keep it simple, X could be global across all events on a given cFE instantiation. I still like the idea of having a default filter for apps that don't register that limits to the first few and then one per 128(?) as a reminder. Apps that can flood should always have filters included and be checked at design and code walkthough reviews.

@skliper
Copy link
Contributor

skliper commented May 10, 2022

@jwilmot - why not just fix the app to not flood with CFE_EVS_SendEvent calls vs adding autonomous filter complexity? Then the ground can manage filters however they want and it reduces the chance that a ground command to modify a filter could result in rendering the system inoperable. If an app sends an event in a tight loop it's bad whatever default filter is used. Events should be treated like tlm, make them useful/informative and don't send an unlimited amount from the app per cycle. Note all apps that send events register with EVS... but if they were to register a default filter for every event that will expand the memory requirements of EVS (some apps have 100+ events), and the calls load the system even if filtered. Eliminating the problem at the source avoids expanding the memory footprint and could reduce system loading while reducing the likelihood of inadvertently rendering the system inoperable.

@skliper
Copy link
Contributor

skliper commented May 10, 2022

If we are going to do a "global" EVS limit, then I like @jphickey's suggestion of "credit". Could start up with enough credit to output all the startup events (like the evs log limit), but then only add X credits per cycle. It's still definitely a "fail-safe" that shouldn't be relied on since it's indiscriminate so important messages could be dropped, or we could still allow ERROR/CRITICAL events or have different limits. This would allow for a short term 0-loss... but still keep the system operable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment