Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Filebeat] Race condition fix in S3 input plugin #14359

Merged

Conversation

darupedk
Copy link
Contributor

During high load, Filebeat may crash due to the error channel c.errC having been closed before the last event has been acknowledged. This is caused by a missing critical section in done() which can result in c.refs no longer counting as expected.

@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

1 similar comment
@elasticmachine
Copy link
Collaborator

Since this is a community submitted pull request, a Jenkins build has not been kicked off automatically. Can an Elastic organization member please verify the contents of this patch and then kick off a build manually?

@kaiyan-sheng kaiyan-sheng added Team:Integrations Label for the Integrations team bug Filebeat Filebeat needs_backport PR is waiting to be backported to other branches. labels Oct 31, 2019
@@ -531,6 +531,8 @@ func (c *s3Context) Fail(err error) {
}

func (c *s3Context) done() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail() calls done() and holds the lock. That will result in a deadlock. I didn't look at the rest of the file, but wanted to call this out now. I can probably look at the rest of it tomorrow if any assistance is needed.

Copy link
Contributor Author

@darupedk darupedk Oct 31, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. That needs to be resolved.

Considering that we just want the latest error for now, we should be fine with the following change as only the counter needs to be protected against race conditions:

func (c *s3Context) Fail(err error) {
	// only care about the last error for now
	// TODO: add "Typed" error to error for context
	c.err = err
	c.done()
}

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we have a race in 'done' due to concurrent calls messing with the reference counter, I wonder if Fail can be triggered concurrently as well. Maybe it would be better to have Failed like this (we do run tests with race detector enabled):

func (c *s3Context) Fail(err error) {
	// only care about the last error for now
	// TODO: add "Typed" error to error for context
	c.mux.Lock()
	c.err = err
	c.mux.Unlock()
	c.done()
}

or:

func (c *s3Context) Fail(err error) {
	c.setError(err)
	c.done()
}

func (c *s3Context) SetError(err error) {
	// only care about the last error for now
	// TODO: add "Typed" error to error for context
	c.mux.Lock()
	c.err = err
	c.mux.Unlock()
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been running with the fix provided in the latest commit for about 4 days now. The ingest pipeline is processing about 1.5 million log lines every 5 minutes and so far it has been stable. Without the fix, the software kept crashing after processing a few files from S3.

@kaiyan-sheng kaiyan-sheng merged commit 5fb4e5f into elastic:master Nov 6, 2019
@kaiyan-sheng
Copy link
Contributor

@darupedk Thanks for fixing and testing this!
@andrewkroh @urso Thanks for the review. I will do the backports.

@kaiyan-sheng kaiyan-sheng added v7.6.0 and removed needs_backport PR is waiting to be backported to other branches. labels Nov 6, 2019
kaiyan-sheng added a commit that referenced this pull request Nov 7, 2019
* Added missing critical section to prevent race condition during high load

* Updated CHANGELOG.next.asciidoc

* Fix deadlock issue

(cherry picked from commit 5fb4e5f)
kaiyan-sheng added a commit that referenced this pull request Nov 7, 2019
* Added missing critical section to prevent race condition during high load

* Updated CHANGELOG.next.asciidoc

* Fix deadlock issue

(cherry picked from commit 5fb4e5f)
jorgemarey pushed a commit to jorgemarey/beats that referenced this pull request Jun 8, 2020
* Added missing critical section to prevent race condition during high load

* Updated CHANGELOG.next.asciidoc

* Fix deadlock issue
leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
…stic#14402)

* Added missing critical section to prevent race condition during high load

* Updated CHANGELOG.next.asciidoc

* Fix deadlock issue

(cherry picked from commit 8531cfb)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Filebeat Filebeat Team:Integrations Label for the Integrations team v7.5.0 v7.6.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants