Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider removing memory_ballast extension #8343

Closed
mx-psi opened this issue Aug 31, 2023 · 11 comments
Closed

Consider removing memory_ballast extension #8343

mx-psi opened this issue Aug 31, 2023 · 11 comments
Labels
release:required-for-ga Must be resolved before GA release

Comments

@mx-psi
Copy link
Member

mx-psi commented Aug 31, 2023

The memory_ballast extension allows defining a 'heap ballast' to make the behavior of Go's garbage collector more predictable. Quoting this post,

the ballast increases the base size of the heap so that our GC triggers are delayed and the number of GC cycles over time is reduced.

This functionality was added in #45 but a few years and Go versions have passed since and there has since been discussion (e.g. see #7512 (comment) ) on the need of this extension today. In particular, tweaking GOMEMLIMIT and GOGC may be enough to tackle the issues the extension was meant to address.
It also seems like the extension may be risky in some cases and increase memory usage (this may be the cause of #7512).

The most recent issue on the matter upstream is golang/go/issues/42430.

This issue is to decide whether we should remove the memory ballast extension or not.

cc @Aneurysm9 @dmitryax

@dmitryax
Copy link
Member

dmitryax commented Sep 5, 2023

@mx-psi thank you for summarizing the issue.

I don't see any good reason to keep this extension. GOMEMLIMIT env var is much more intuitive to use. The only configuration that the user can potentially miss is size_in_percentage, but I doubt it's something we should keep this extension for.

I'd suggest we just go ahead and deprecate it with clear instructions on how to utilize GOMEMLIMIT and GOGC instead. We can link this issue in the deprecation warning for users to comment if they have any concerns.

Let's discuss it in the next SIG meeting.

@mx-psi
Copy link
Member Author

mx-psi commented Sep 5, 2023

One starting point to ensure we can do this correctly is to replace the memory ballast extension by an equivalent mechanism using GOMEMLIMIT and GOGC in the Collector Helm chart and ensure that it works correctly in that setting. We can use that experience to aid us in writing the documentation.

@MovieStoreGuy
Copy link
Contributor

From Atlassian's experience, we've not needed the ballast extension for a deployment that is strictly observing the host (doesn't expose any endpoints for services to send data, and prom is used to monitor the collector itself).

On average, our deployments will use:

  • 50Mb of memory with a hard limit 300Mb (set by docker aka cgroups)
  • 3% CPU utilisation with a hard limit of 1 full core.

What is being monitored by the collector changes from service to service but roughly 4 - 10 receivers are running and capturing data. (some are internal to us so it didn't make sense to list them out).

We have an internal service that works near similar to the collector which we had messed around with the GOGC value and we noted as a side affect that having control over this either meant high CPU usage no significant reduction in memory usage; the service was spend more time trying to perform the GC operation but not allowing the progress of the business functions. The alternative was high memory usage which eventually lead to OOMs.
However, the go version used here was 1.17 which look like things have improved from what I am reading.

To get back on topic, I think the risk of continued maintenance of the ballast extension is:

  • It is potentially hiding memory leaks in the collector by cleaning up faster
    • I had seen this when components used pointers to the pdata values which meant things were being escaped to the heap when being on the stack was adequate.
  • I think we should leverage the language runtime more here
    • There will be more public documentation / articles that we can reference with understanding how to tune the collector's heap usage.
  • Manually calling the GC via runtime.GC is a hazard due to more frequent Stop The World sweeps which could make the performance problems worse.
    • Consider the ballast called runtime.GC due to going over the memory limit but has happened after the runtime itself had performed a GC operation. Since the method can block the entire application, it means that no connections will be served until it is done and if collector is running in a scenario that it needs the extension then it more than likely experience an unexplainable brief outage.

I believe it served us well, but I see it as a bigger risk keeping it around.

@mx-psi
Copy link
Member Author

mx-psi commented Sep 6, 2023

Here is a user report for GOMEMLIMIT as a replacement for a memory ballast.

An alternative to outright removal would be to modify the current implementation to use debug.SetMemoryLimit. The relationship between ballast size and memory limit value is not clear to me however, and it can be confusing having a memory_ballast extension that does not in fact use a memory ballast.

@mx-psi
Copy link
Member Author

mx-psi commented Sep 13, 2023

Filed open-telemetry/opentelemetry-helm-charts/issues/891 to do #8343 (comment) as decided on the last Collector SIG

@hanswurscht
Copy link

Are there any plans on implementing GOMEMLIMIT for the operator/operator-chart as well?

@mx-psi
Copy link
Member Author

mx-psi commented Oct 6, 2023

@hanswurscht I think we can do something similar to the Helm chart approach, it will further help validate that GOMEMLIMIT is a valid alternative. Could you file an issue on the operator repository and link it to this one?

@mx-psi
Copy link
Member Author

mx-psi commented Nov 6, 2023

Based on the discussion on open-telemetry/opentelemetry-helm-charts#891 and this issue, I filed #8803 to deprecate the memory ballast extension.

@bboreham
Copy link
Contributor

bboreham commented Dec 4, 2023

It is potentially hiding memory leaks in the collector by cleaning up faster

This is a misunderstanding. Go heap profiles only show what survived after the last garbage-collection.

(Also heap ballast serves to slow down garbage collection)

@atoulme atoulme added the release:required-for-ga Must be resolved before GA release label Dec 18, 2023
bogdandrutu added a commit that referenced this issue Dec 19, 2023
**Description:** 

Based on user reports on
open-telemetry/opentelemetry-helm-charts/issues/891 and the discussion
on #8343, we can deprecate the memory ballast extension in favor of
using `GOMEMLIMIT`. This PR:

- Deprecates the memory ballast extension in the README
- Removes references to the memory ballast extension on docs
- Updates k8s example to use `GOMEMLIMIT` with the same approach as in
the Helm chart (80% of memory limit)
- Deprecates the memory ballast extension Go module

Once this PR is accepted,
open-telemetry/opentelemetry-helm-charts/issues/891 can move ahead with
enabling `useGOMEMLIMIT` by default on the Helm chart.

Other issues will be opened for opentelemetry.io, the Opentelemetry
Operator and other parts of the OpenTelemetry project to remove
references to this extension once the PR is merged.

No explicit timeline is given for removal of the extension.

**Link to tracking Issue:** Updates #8343

---------

Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
@mx-psi mx-psi removed the discussion-needed Community discussion needed label Feb 8, 2024
@arielvalentin
Copy link

I have an anecdote for you.

We had been struggling with trying to tune our gateway collectors for a few weeks after we started experimenting with connectors to generate metrics from our traces. Though we did identify some memory leaks, under load the collector would spike its memory usage and eventually cause the memory_limiter to start rejecting spans and metrics.

It would then get stuck in GC loops until our readiness probes failed and k8s would lay them to rest.

After replacing memory_ballast+GOMEMLIMIT with only GOMEMLIMIT we see a drastic reduction of memory of the collectors and increased throughput:

Image

mx-psi added a commit that referenced this issue Sep 4, 2024
…10671)

#### Description
This PR removes the deprecated memory ballast extension and all the
logic in place in memorylimiter and service that was using it.

#### Link to tracking issue
Related to
#8343. I
don't want to close it until the [helm
chart](open-telemetry/opentelemetry-helm-charts#1268)
is updated.

#### Testing
Unit tests

---------

Co-authored-by: Antoine Toulme <antoine@toulme.name>
Co-authored-by: Curtis Robert <crobert@splunk.com>
Co-authored-by: Pablo Baeyens <pablo.baeyens@datadoghq.com>
@mx-psi
Copy link
Member Author

mx-psi commented Sep 4, 2024

This is now done!! 🎉 🎉 Thanks to @TylerHelmuth for making the final push!

@mx-psi mx-psi closed this as completed Sep 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga Must be resolved before GA release
Projects
None yet
Development

No branches or pull requests

7 participants