Stricter memory limiter #8694

cergxx · 2023-10-17T16:41:10Z

Is your feature request related to a problem? Please describe.
In my collector I have a linear pipeline: receiver -> memory_limiter -> batch -> exporter, and the exporter has an enabled queue.
The queue size is set to a large number, so that the data is never dropped. Instead, I use the memory limiter to refuse the data when there is a congestion on the exporter side (queue grows in size, memory usage grows along).
Existing memory limiter fails to do it properly – sometimes it does not stop accepting new records fast enough – it waits for the GC to free some memory and the collector faces the OOM.

Describe the solution you'd like
I have 2 proposals:

Add a "strict" mode to the memory limiter, so that it starts refusing incoming data as soon as the memory usage goes above the limit, not after calling the GC.
Add some sort of limiter that would inspect the exporter queue size and refuse data in receiver, when the queue grows over a limit. Or at least to be able to inspect the exporter queue size in my custom receiver/processor and do the refuse part myself.

Describe alternatives you've considered
Proposal 1 – fork the existing memory limiter.
Proposal 2 – I could achieve the same behaviour by implementing such approach:

Add one more processor to the pipeline, right before the exporter. This processor would increment a counter of current queue size.
Decrement this counter in the exporter when it starts processing the batch.
Inspect this counter in the receiver (or add one more processor, before the batch) and return an error there if queue is full.
But this solution looks too complex.

The text was updated successfully, but these errors were encountered:

mx-psi · 2023-10-19T15:06:56Z

I think #8632 would help address this

dloucasfx · 2023-10-23T23:37:04Z

I was recently debugging a similar issue and my analysis led me to a slightly different conclusion:

The memory limiter is not enough to protect us from exceeding the configure heap usage/limit; especially when there is a burst of data that comes in between 2 polls. This can be problematic when the collector is running in a container with memory limit.
The regular GC, which by default runs every time the heap doubles; when paired with the ballast and if the GC from the memory limiter didn't run to reduce the heap, then the regular GC will not run more than once.

In our case, relevant logs look like this:

2023/10/23 15:03:49 settings.go:392: Set config to [/conf/relay.yaml]
2023/10/23 15:03:49 settings.go:445: Set ballast to 1351 MiB
2023/10/23 15:03:49 settings.go:461: Set memory limit to 3686 MiB
2023-10-23T15:03:49.923Z        info    service/telemetry.go:84 Setting up own telemetry...
......
2023-10-23T15:03:49.925Z	info	memorylimiterprocessor@v0.82.0/memorylimiter.go:102	Memory limiter configured	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "limit_mib": 3686, "spike_limit_mib": 737, "check_interval": 2}
.......
2023-10-23T15:03:51.938Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 673}
.......
2023-10-23T15:03:54.102Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1296}
........
2023-10-23T15:03:56.553Z	debug	memorylimiterprocessor@v0.82.0/memorylimiter.go:273	Currently used memory.	{"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 969}

Pod restarting:

lastState:
        terminated:
          exitCode: 137
          reason: OOMKilled
          startedAt: '2023-10-23T15:03:49Z'
          finishedAt: '2023-10-23T15:03:59Z'

you can see that explicit GC from memory limiter never ran and there was a burst of data that caused high memory usage between 15:03:56.553Z when the memory was 969 MB and before the next memory limiter poll had a chance to run.
In addition, the regular GC will also not run as it will only run when the 969MB + 1351 MiB (ballast) is doubled , which is > than the container limit (set to 4GB)

Proposed solution is to get rid of the ballast and start using GOMEMLIMIT .
Keep the memory limiter and modify it to only be used for stopping/delaying/dropping data as it's GC functionality will no longer be needed.

mx-psi · 2023-10-24T08:11:42Z

@dloucasfx For the ballast we are considering removal on #8343. I am interested in having end-users test GOMEMLIMIT and provide feedback on the transition, so that we can make an informed decision.The official Helm chart now has a builtin option for this. If you are trying this out, please leave feedback on either #8343 or open-telemetry/opentelemetry-helm-charts/issues/891 so we can move forward with this :)

mx-psi added enhancement New feature or request processor/memory_limiter labels Oct 19, 2023

jmacd mentioned this issue Sep 23, 2024

Concurrent batch processor features → batch processor #11248

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stricter memory limiter #8694

Stricter memory limiter #8694

cergxx commented Oct 17, 2023

mx-psi commented Oct 19, 2023

dloucasfx commented Oct 23, 2023

mx-psi commented Oct 24, 2023

Stricter memory limiter #8694

Stricter memory limiter #8694

Comments

cergxx commented Oct 17, 2023

mx-psi commented Oct 19, 2023

dloucasfx commented Oct 23, 2023

mx-psi commented Oct 24, 2023