-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Filebeat] Race condition fix in S3 input plugin #14359
[Filebeat] Race condition fix in S3 input plugin #14359
Conversation
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
1 similar comment
Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually? |
@@ -531,6 +531,8 @@ func (c *s3Context) Fail(err error) { | |||
} | |||
|
|||
func (c *s3Context) done() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fail()
calls done()
and holds the lock. That will result in a deadlock. I didn't look at the rest of the file, but wanted to call this out now. I can probably look at the rest of it tomorrow if any assistance is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. That needs to be resolved.
Considering that we just want the latest error for now, we should be fine with the following change as only the counter needs to be protected against race conditions:
func (c *s3Context) Fail(err error) {
// only care about the last error for now
// TODO: add "Typed" error to error for context
c.err = err
c.done()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we have a race in 'done' due to concurrent calls messing with the reference counter, I wonder if Fail can be triggered concurrently as well. Maybe it would be better to have Failed like this (we do run tests with race detector enabled):
func (c *s3Context) Fail(err error) {
// only care about the last error for now
// TODO: add "Typed" error to error for context
c.mux.Lock()
c.err = err
c.mux.Unlock()
c.done()
}
or:
func (c *s3Context) Fail(err error) {
c.setError(err)
c.done()
}
func (c *s3Context) SetError(err error) {
// only care about the last error for now
// TODO: add "Typed" error to error for context
c.mux.Lock()
c.err = err
c.mux.Unlock()
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have been running with the fix provided in the latest commit for about 4 days now. The ingest pipeline is processing about 1.5 million log lines every 5 minutes and so far it has been stable. Without the fix, the software kept crashing after processing a few files from S3.
@darupedk Thanks for fixing and testing this! |
* Added missing critical section to prevent race condition during high load * Updated CHANGELOG.next.asciidoc * Fix deadlock issue
…stic#14402) * Added missing critical section to prevent race condition during high load * Updated CHANGELOG.next.asciidoc * Fix deadlock issue (cherry picked from commit 8531cfb)
During high load, Filebeat may crash due to the error channel
c.errC
having been closed before the last event has been acknowledged. This is caused by a missing critical section indone()
which can result inc.refs
no longer counting as expected.