-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leaks for Sum metric exemplars #31683
Comments
Pinging code owners:
See Adding Labels via Comments if you do not have permissions to add labels yourself. |
I'm not super familiar with this connector, but following your comment it looks like the following lines should potentially be added:
To this section: opentelemetry-collector-contrib/connector/spanmetricsconnector/connector.go Lines 301 to 313 in d4458d9
This would need to be confirmed through testing though, and preferably a long running test doing memory analysis on the collector to ensure this actually resolves it. |
We are running, slightly modified (adding exemplars if specific attribute is set on span) version in production with 150k spans per second going through this connector. Before applying fix it got OOM killed (8GB limit in 5 instances total 40GB) in half an hour or so. After applying change, its has been steady 200 - 300 MB for days now. |
Would you be able to share your fix, or even post a PR? |
I will create a PR soon. |
Hi @crobert-1 and @tiithansen! I can confirm that issue really exists. We use helm chart opentelemetry-collector-0.82.0 with values.yml
There are our metrics between 6:00 a.m. and 8:00 a.m. on screenshots below, may be they will be useful. screenshotsCrucial points:
Pod resources Pprof at 6:00 a.m. Pprof at 7:00 a.m. And some otel-collector metrics We'll try to run otel-collector without exemplars generation and I'll be back with feedback about resource usage. |
@tcaty configuring Also, generating delta temporality span metrics and then converting them to cumulative would solve the problem if either of these become available:
The delta span metrics only keep exemplars received since the last |
@tiithansen are you still planning to submit a fix? If not then I can give this a go |
…lushing (#32210) **Description:** Discard counter span metric exemplars after flushing to avoid unbounded memory growth when exemplars are enabled. This is needed because #28671 added exemplars to counter span metrics, but they are not removed after each flush interval like they are for histogram span metrics. Note: this may change behaviour if using the undocumented `exemplars.max_per_data_point` configuration option, since exemplars would no longer be accumulated up until that count. However, i'm unclear on the value of that feature since there's no mechanism to replace old exemplars with newer ones once the maximum is reached. Maybe a follow-up enhancement is only discarding exemplars once the maximum is reached, or using a circular buffer to replace them. That could be useful for pull-based exporters like `prometheusexporter`, as retaining exemplars for longer would decrease the chance of them getting discarded before being scraped. **Link to tracking Issue:** Closes #31683 **Testing:** - Unit tests - Running the collector and setting a breakpoint to verify the exemplars are being cleared in-between flushes. Before the change I could see the exemplar count continually growing **Documentation:** <Describe the documentation added.> Updated the documentation to mention that exemplars are added to all span metrics. Also mentioned when they are discarded
Component(s)
connector/spanmetrics
What happened?
Description
There is a memory leaks if exemplars are enabled. In file connector.go histogram exemplars are reset with every export but sum metrics exemplars are not.
Expected Result
Memory usage should be steady depending how much metrics are being generated.
Actual Result
Memory usage keeps growing until process gets OOM killed in k8s cluster.
Collector version
v0.95.0
Environment information
No response
OpenTelemetry Collector configuration
No response
Log output
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: