Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza] Windows Input Operator falls behind reading from channel #36491

Open
dpaasman00 opened this issue Nov 21, 2024 · 3 comments
Open

Comments

@dpaasman00
Copy link
Contributor

Component(s)

pkg/stanza, receiver/windowseventlog

Describe the issue you're reporting

The windowseventlog receiver has a configuration parameter max_reads which determines the max number of events read from the event channel in a poll interval. In cases where the number of events being added to the channel in a poll interval is greater than max_reads the receiver can fall behind. In drastic situations the agent call fall behind severely, which was the case in #36472. In this situation, it's not clear the receiver is falling behind and that's why the newest events aren't being read from the channel.

I'm proposing adding some sort of mechanism for determining when the receiver is maxing out the number of events it can read from the channel. Maybe logging a debug log every time the number of events returned by evtNext() is equal to max_reads in this section of code. Or defining a monotonic cumulative sum metric that gets incremented every time this occurs, instead of a debug log.

Regardless of the mechanism, a way for the receiver to indicate it may be falling behind reading from an event log channel would go a long way in trouble shooting situations where it seems like the receiver is failing.

Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@djaglowski
Copy link
Member

logging a debug log every time the number of events returned by evtNext() is equal to max_reads

This seems like a simple way to warn about the issue. It might make sense to use a higher level such as info, since it may indicate a capacity problem. Warn is arguably appropriate but may be too high since the issue may resolve on its own.

monotonic cumulative sum metric that gets incremented every time this occurs, instead of a debug log

This also seems reasonable. Another option would be a histogram where each data point describes the number of events returned by one call to evtNext().


At a minimum, we should add the log, but either metric seems reasonable too. Curious what @pjanotti thinks

@pjanotti
Copy link
Contributor

pjanotti commented Dec 4, 2024

I think we need both the metric and the log, the former to let alerts to be created, the log to clarify the issue.

@dpaasman00 one thing that is not clear to me from #36472: it seems that the receiver stopped sending events from the channel, is that correct? My expectation would be that it was falling behind but kept sending events for the channel. Can you confirm what is the actual case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants