-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop re-using processors defined in the config #34761
Conversation
After introducing the `SafeProcessor` wrapper in elastic#34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart.
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
Looks like the bug was introduced in #17655 (Apr, 2020) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for tracking this down!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find!
One concern: this is marked as closing #34716, but that issue isn't really about processor reuse, but about the fact that when we detect processor reuse we go into a deadlock. I believe this PR addresses at least one cause of processor reuse, but if there are others we don't know about, the deadlock itself would still be there. This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor
doesn't deadlock when it is triggered.
Should we downgrade the // Run allows to run processor only when `Close` was not called prior
func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
if atomic.LoadUint32(&p.closed) == 1 {
return nil, ErrClosed
} |
Actually, I'm not sure using a log here will help. Process failure already result in just a log failure: beats/libbeat/publisher/pipeline/client.go Lines 97 to 103 in 331f792
The problem is that this log is for every event so it might have been overwhelming both the beat and the the agent log collection. Potentially the monitoring filebeat is receiving this log line for every event ingested by agent. |
/test x-pack/libbeat-goIntegTest |
My preference is to merge this and auto-close the issue. Then we can reopen the issue if we find another instance of this problem. We have no further action to take after merging this PR besides testing and observation. I think reporting this error will be hard, because using a log line will potentially log once per event unless we add a rate limit for that log line or switch to using something more performant but harder to notice like a metrics counter. |
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c)
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) Co-authored-by: Denis <denis.rechkunov@elastic.co>
@faec this is marked as closing #34716 because it's fixing the error reported there. If you manage to reproduce this error again please feel free to re-open it. Could you elaborate more about that deadlock behaviour you're talking about? In my testing I didn't observe any deadlock in Filebeat. And the most important: let's not underestimate the danger of using closed processors. Some of the processors are using background tasks to fetch data like We cannot use closed processors. We should rather produce a fatal error than let customer's data to be corrupted. I doubt there are more occurrences of processors re-use, I checked a lot of places but if there are more – we can easily identify them now thanks to |
#34764) * Stop re-using processors defined in the config (#34761) * Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry (cherry picked from commit 5cfe62c) # Conflicts: # filebeat/channel/runner.go # libbeat/processors/safe_processor.go * Resolve conflicts, fix changelog * Add new line to changelog * Revert comment auto-formatting --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
It's a follow-up to elastic#34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
* Add test for the processor re-use issue (#34870) It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client. (cherry picked from commit 3d917c8) # Conflicts: # filebeat/channel/runner.go # filebeat/channel/runner_test.go * Resolve conflicts --------- Co-authored-by: Denis <denis.rechkunov@elastic.co>
Is there any expected release date? When using elastic for alerting, it generates a lot of false positives. |
@icc-garciaju not sure what alerting has to do with this issue but this fix has been released with 8.7.0 and was backported to the next 7.17. |
Is it included on 8.7.0 of the agent? Is it being backported to 8.6? I was told by elastic support it would be on the next release. |
yes.
It was but I'm not sure if there was another 8.6.x release since then. You can always see the backports in the labels of our PRs.
If the issue you had was about having these errors in logs:
Then it should be fixed in 8.7.0. |
* Stop re-using processors defined in the config After introducing the `SafeProcessor` wrapper in #34647 we started returning errors when a processor is being used after its `Close` function has been called. This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on `Close` but re-used processors somewhere. After a long investigation such code that's re-using processors was finally found. This is the change that removes re-using the processors and instantiates them on each input restart. * Fix linter issues * Add changelog entry
It's a follow-up to #34761 This test makes sure that none of the critical configuration fields are re-used between instances of the pipeline client.
What does this PR do?
After introducing the
SafeProcessor
wrapper in#34647 we started returning errors when a processor is being used after its
Close
function has been called.This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on
Close
but re-used processors somewhere.After a long investigation such code that's re-using processors was finally found.
This is the change that removes re-using the processors and instantiates them on each input restart.
Looks like the bug was introduced in #17655 (Apr, 2020)
Why is it important?
Fixes dropped events, panics and error spam in logs.
Checklist
- [] My code follows the style guidelines of this project- [ ] I have commented my code, particularly in hard-to-understand areas- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.How to test this PR locally
elastic-package stack up
elastic-agent
with this policy:After this change you should not see these errors anymore.
Related issues
SafeProcessor
wrapper #34647