Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stricter memory limiter #8694

Open
cergxx opened this issue Oct 17, 2023 · 3 comments
Open

Stricter memory limiter #8694

cergxx opened this issue Oct 17, 2023 · 3 comments
Labels

Comments

@cergxx
Copy link

cergxx commented Oct 17, 2023

Is your feature request related to a problem? Please describe.
In my collector I have a linear pipeline: receiver -> memory_limiter -> batch -> exporter, and the exporter has an enabled queue.
The queue size is set to a large number, so that the data is never dropped. Instead, I use the memory limiter to refuse the data when there is a congestion on the exporter side (queue grows in size, memory usage grows along).
Existing memory limiter fails to do it properly – sometimes it does not stop accepting new records fast enough – it waits for the GC to free some memory and the collector faces the OOM.

Describe the solution you'd like
I have 2 proposals:

  1. Add a "strict" mode to the memory limiter, so that it starts refusing incoming data as soon as the memory usage goes above the limit, not after calling the GC.
  2. Add some sort of limiter that would inspect the exporter queue size and refuse data in receiver, when the queue grows over a limit. Or at least to be able to inspect the exporter queue size in my custom receiver/processor and do the refuse part myself.

Describe alternatives you've considered
Proposal 1 – fork the existing memory limiter.
Proposal 2 – I could achieve the same behaviour by implementing such approach:

  • Add one more processor to the pipeline, right before the exporter. This processor would increment a counter of current queue size.
  • Decrement this counter in the exporter when it starts processing the batch.
  • Inspect this counter in the receiver (or add one more processor, before the batch) and return an error there if queue is full.
    But this solution looks too complex.
@mx-psi
Copy link
Member

mx-psi commented Oct 19, 2023

I think #8632 would help address this

@dloucasfx
Copy link
Contributor

I was recently debugging a similar issue and my analysis led me to a slightly different conclusion:

  • The memory limiter is not enough to protect us from exceeding the configure heap usage/limit; especially when there is a burst of data that comes in between 2 polls. This can be problematic when the collector is running in a container with memory limit.
  • The regular GC, which by default runs every time the heap doubles; when paired with the ballast and if the GC from the memory limiter didn't run to reduce the heap, then the regular GC will not run more than once.

In our case, relevant logs look like this:

2023/10/23 15:03:49 settings.go:392: Set config to [/conf/relay.yaml]
2023/10/23 15:03:49 settings.go:445: Set ballast to 1351 MiB
2023/10/23 15:03:49 settings.go:461: Set memory limit to 3686 MiB
2023-10-23T15:03:49.923Z        info    service/telemetry.go:84 Setting up own telemetry...
......
2023-10-23T15:03:49.925Z	info	memorylimiterprocessor@v0.82.0/memorylimiter.go:102	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "limit_mib": 3686, "spike_limit_mib": 737, "check_interval": 2}
.......
2023-10-23T15:03:51.938Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 673}
.......
2023-10-23T15:03:54.102Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1296}
........
2023-10-23T15:03:56.553Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 969}

Pod restarting:

lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2023-10-23T15:03:49Z'
          finishedAt: '2023-10-23T15:03:59Z'

you can see that explicit GC from memory limiter never ran and there was a burst of data that caused high memory usage between 15:03:56.553Z when the memory was 969 MB and before the next memory limiter poll had a chance to run.
In addition, the regular GC will also not run as it will only run when the 969MB + 1351 MiB (ballast) is doubled , which is > than the container limit (set to 4GB)

Proposed solution is to get rid of the ballast and start using GOMEMLIMIT .
Keep the memory limiter and modify it to only be used for stopping/delaying/dropping data as it's GC functionality will no longer be needed.

@mx-psi
Copy link
Member

mx-psi commented Oct 24, 2023

@dloucasfx For the ballast we are considering removal on #8343. I am interested in having end-users test GOMEMLIMIT and provide feedback on the transition, so that we can make an informed decision.The official Helm chart now has a builtin option for this. If you are trying this out, please leave feedback on either #8343 or open-telemetry/opentelemetry-helm-charts/issues/891 so we can move forward with this :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants