Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stop re-using processors defined in the config #34761

Merged
merged 3 commits into from
Mar 7, 2023

Conversation

rdner
Copy link
Member

@rdner rdner commented Mar 7, 2023

What does this PR do?

After introducing the SafeProcessor wrapper in
#34647 we started returning errors when a processor is being used after its Close function has been called.

This led to dropped events and error spam in logs but also confirmed that the root cause of the problem was not just a race condition on Close but re-used processors somewhere.

After a long investigation such code that's re-using processors was finally found.

This is the change that removes re-using the processors and instantiates them on each input restart.

Looks like the bug was introduced in #17655 (Apr, 2020)

Why is it important?

Fixes dropped events, panics and error spam in logs.

Checklist

- [] My code follows the style guidelines of this project
- [ ] I have commented my code, particularly in hard-to-understand areas
- [ ] I have made corresponding changes to the documentation
- [ ] I have made corresponding change to the default configuration files
- [ ] I have added tests that prove my fix is effective or that my feature works

  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

How to test this PR locally

  1. Run elastic-package stack up
  2. Run elastic-agent with this policy:
outputs:
  default:
    type: elasticsearch
    log_level: debug
    enabled: true
    hosts: ["https://127.0.0.1:9200"]
    username: "elastic"
    password: "changeme"
    allow_older_versions: true
    ssl:
      verification_mode: none
    shipper:
      enabled: true

inputs:
  - type: system/metrics
    id: unique-system-metrics-input
    data_stream.namespace: default
    use_output: default
    streams:
      - metricset: cpu
        data_stream.dataset: system.cpu
      - metricset: memory
        data_stream.dataset: system.memory
      - metricset: network
        data_stream.dataset: system.network
      - metricset: filesystem
        data_stream.dataset: system.filesystem
  1. Before this change you'd observe errors like this:
{
  "log.level": "error",
  "@timestamp": "2023-03-02T11:59:42.394Z",
  "message": "Failed to publish event: attempt to use a closed processor",
  "component": {
    "binary": "filebeat",
    "dataset": "elastic_agent.filebeat",
    "id": "filestream-monitoring",
    "type": "filestream"
  },
  "log": {
    "source": "filestream-monitoring"
  },
  "log.logger": "publisher",
  "log.origin": {
    "file.line": 102,
    "file.name": "pipeline/client.go"
  },
  "service.name": "filebeat",
  "ecs.version": "1.6.0"
}

After this change you should not see these errors anymore.

Related issues

After introducing the `SafeProcessor` wrapper in
elastic#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.
@rdner rdner added bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.7.0 Automated backport with mergify labels Mar 7, 2023
@rdner rdner self-assigned this Mar 7, 2023
@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Mar 7, 2023
@rdner rdner marked this pull request as ready for review March 7, 2023 19:10
@rdner rdner requested a review from a team as a code owner March 7, 2023 19:10
@rdner rdner requested review from cmacknz and faec and removed request for a team March 7, 2023 19:10
@elasticmachine
Copy link
Collaborator

Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane)

@elasticmachine
Copy link
Collaborator

elasticmachine commented Mar 7, 2023

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview preview

Expand to view the summary

Build stats

  • Start Time: 2023-03-07T21:34:02.432+0000

  • Duration: 31 min 43 sec

Test stats 🧪

Test Results
Failed 0
Passed 64
Skipped 1
Total 65

💚 Flaky test report

Tests succeeded.

🤖 GitHub comments

Expand to view the GitHub comments

To re-run your PR in the CI, just comment with:

  • /test : Re-trigger the build.

  • /package : Generate the packages and run the E2E tests.

  • /beats-tester : Run the installation tests with beats-tester.

  • run elasticsearch-ci/docs : Re-trigger the docs validation. (use unformatted text in the comment!)

@rdner
Copy link
Member Author

rdner commented Mar 7, 2023

Looks like the bug was introduced in #17655 (Apr, 2020)

@cmacknz cmacknz requested a review from leehinman March 7, 2023 20:05
Copy link
Member

@cmacknz cmacknz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down!

Copy link
Contributor

@faec faec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find!

One concern: this is marked as closing #34716, but that issue isn't really about processor reuse, but about the fact that when we detect processor reuse we go into a deadlock. I believe this PR addresses at least one cause of processor reuse, but if there are others we don't know about, the deadlock itself would still be there. This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor doesn't deadlock when it is triggered.

@cmacknz
Copy link
Member

cmacknz commented Mar 7, 2023

This fix looks good but I'd rather leave the linked issue open until we're sure that SafeProcessor doesn't deadlock when it is triggered.

Should we downgrade the ErrClosed returned by SafeProcessor to just a warning log? That would still let us find problems, without having them be stop the world errors.

// Run allows to run processor only when `Close` was not called prior
func (p *SafeProcessor) Run(event *beat.Event) (*beat.Event, error) {
	if atomic.LoadUint32(&p.closed) == 1 {
		return nil, ErrClosed
	}

@cmacknz
Copy link
Member

cmacknz commented Mar 7, 2023

Actually, I'm not sure using a log here will help. Process failure already result in just a log failure:

event, err = c.processors.Run(event)
publish = event != nil
if err != nil {
// If we introduce a dead-letter queue, this is where we should
// route the event to it.
log.Errorf("Failed to publish event: %v", err)
}

The problem is that this log is for every event so it might have been overwhelming both the beat and the the agent log collection. Potentially the monitoring filebeat is receiving this log line for every event ingested by agent.

@cmacknz
Copy link
Member

cmacknz commented Mar 7, 2023

/test x-pack/libbeat-goIntegTest

@cmacknz cmacknz added the backport-v8.6.0 Automated backport with mergify label Mar 7, 2023
@cmacknz
Copy link
Member

cmacknz commented Mar 7, 2023

My preference is to merge this and auto-close the issue. Then we can reopen the issue if we find another instance of this problem. We have no further action to take after merging this PR besides testing and observation.

I think reporting this error will be hard, because using a log line will potentially log once per event unless we add a rate limit for that log line or switch to using something more performant but harder to notice like a metrics counter.

@cmacknz cmacknz merged commit 5cfe62c into elastic:main Mar 7, 2023
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)
mergify bot pushed a commit that referenced this pull request Mar 7, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

# Conflicts:
#	filebeat/channel/runner.go
#	libbeat/processors/safe_processor.go
cmacknz pushed a commit that referenced this pull request Mar 8, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
cmacknz pushed a commit that referenced this pull request Mar 8, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
@rdner
Copy link
Member Author

rdner commented Mar 8, 2023

@faec this is marked as closing #34716 because it's fixing the error reported there. If you manage to reproduce this error again please feel free to re-open it.

Could you elaborate more about that deadlock behaviour you're talking about? In my testing I didn't observe any deadlock in Filebeat.

And the most important: let's not underestimate the danger of using closed processors. Some of the processors are using background tasks to fetch data like add_kubernetes_metadata does for example. When you close such processors they stop updating their data and would attach outdated data to all events when Run is called after Close. This would silently corrupt our customer's data and we would never know. I'm pretty sure this was already happening because of the bug addressed in this PR.

We cannot use closed processors. We should rather produce a fatal error than let customer's data to be corrupted.

I doubt there are more occurrences of processors re-use, I checked a lot of places but if there are more – we can easily identify them now thanks to SafeProcessor wrapper.

rdner added a commit that referenced this pull request Mar 8, 2023
#34764)

* Stop re-using processors defined in the config (#34761)

* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry

(cherry picked from commit 5cfe62c)

# Conflicts:
#	filebeat/channel/runner.go
#	libbeat/processors/safe_processor.go

* Resolve conflicts, fix changelog

* Add new line to changelog

* Revert comment auto-formatting

---------

Co-authored-by: Denis <denis.rechkunov@elastic.co>
rdner added a commit to rdner/beats that referenced this pull request Mar 21, 2023
It's a follow-up to elastic#34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.
rdner added a commit that referenced this pull request Mar 21, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.
mergify bot pushed a commit that referenced this pull request Mar 21, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)

# Conflicts:
#	filebeat/channel/runner.go
#	filebeat/channel/runner_test.go
mergify bot pushed a commit that referenced this pull request Mar 21, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)
mergify bot pushed a commit that referenced this pull request Mar 21, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)

# Conflicts:
#	filebeat/channel/runner.go
rdner added a commit that referenced this pull request Mar 21, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)

Co-authored-by: Denis <denis.rechkunov@elastic.co>
rdner added a commit that referenced this pull request Mar 21, 2023
* Add test for the processor re-use issue (#34870)

It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)

# Conflicts:
#	filebeat/channel/runner.go

* Resolve conflicts

---------

Co-authored-by: Denis <denis.rechkunov@elastic.co>
rdner added a commit that referenced this pull request Mar 21, 2023
* Add test for the processor re-use issue (#34870)

It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.

(cherry picked from commit 3d917c8)

# Conflicts:
#	filebeat/channel/runner.go
#	filebeat/channel/runner_test.go

* Resolve conflicts

---------

Co-authored-by: Denis <denis.rechkunov@elastic.co>
@icc-garciaju
Copy link

Is there any expected release date? When using elastic for alerting, it generates a lot of false positives.

@rdner
Copy link
Member Author

rdner commented Apr 17, 2023

@icc-garciaju not sure what alerting has to do with this issue but this fix has been released with 8.7.0 and was backported to the next 7.17.

@icc-garciaju
Copy link

Is it included on 8.7.0 of the agent? Is it being backported to 8.6? I was told by elastic support it would be on the next release.
As alerting is based on querying documents for logs or metrics, if the agent stops sending logs, the alerts go off.

@rdner
Copy link
Member Author

rdner commented Apr 17, 2023

@icc-garciaju

Is it included on 8.7.0 of the agent?

yes.

Is it being backported to 8.6

It was but I'm not sure if there was another 8.6.x release since then. You can always see the backports in the labels of our PRs.

if the agent stops sending logs, the alerts go off.

If the issue you had was about having these errors in logs:

Harvester crashed with: harvester panic with: close of closed channel

Then it should be fixed in 8.7.0.

chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
* Stop re-using processors defined in the config

After introducing the `SafeProcessor` wrapper in
#34647 we started returning
errors when a processor is being used after its `Close` function has
been called.

This led to dropped events and error spam in logs but also confirmed
that the root cause of the problem was not just a race condition on
`Close` but re-used processors somewhere.

After a long investigation such code that's re-using processors was
finally found.

This is the change that removes re-using the processors and
instantiates them on each input restart.

* Fix linter issues

* Add changelog entry
chrisberkhout pushed a commit that referenced this pull request Jun 1, 2023
It's a follow-up to #34761

This test makes sure that none of the critical configuration fields
are re-used between instances of the pipeline client.
@rdner rdner deleted the stop-re-using-processors branch October 24, 2024 08:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-7.17 Automated backport to the 7.17 branch with mergify backport-v8.6.0 Automated backport with mergify backport-v8.7.0 Automated backport with mergify bug Team:Elastic-Agent-Data-Plane Label for the Agent Data Plane team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Filebeat monitoring enters infinite error loop for "closed processor"
5 participants