Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track down Memory Leak in Teraslice v2.9.2 #3945

Open
godber opened this issue Feb 4, 2025 · 3 comments
Open

Track down Memory Leak in Teraslice v2.9.2 #3945

godber opened this issue Feb 4, 2025 · 3 comments
Assignees

Comments

@godber
Copy link
Member

godber commented Feb 4, 2025

About a month ago we rolled out Teraslice v2.9.2 into our production environment. After about 3 or 4 weeks we started having Teraslice workers OOM, due to a newly introduced but very slow memory leak.

The things that changed in the jobs were:

teraslice:v2.6.3-nodev22.9.0  ---->  teraslice:v2.9.2-nodev22.12.0
kafka:5.1.0                   ---->  kafka:5.4.0
elasticsearch:4.0.3           ---->  elasticsearch:4.0.5

We've been tracking this internally so there's more information that in this issue. I did some internal testing and rolled back the kafka asset to 5.1.0 and it didn't change the situation, the leak still existed (it takes about 18-24 hours to conclusively see the leak).

Here are the relevant versions from the latest releases:

teraslice version base image version terafoundation_kafka_connector version node-rdkafka version librdkafka version node 22 version
v2.12.3 node-base:22.13.0 v1.2.1 v3.2.1 v2.6.1 v22.13.0
v2.12.2 node-base:22.13.0 v1.2.1 v3.2.1 v2.6.1 v22.13.0
v2.12.1 node-base:22.13.0 v1.2.1 v3.2.1 v2.6.1 v22.13.0
v2.12.0 node-base:22.12.0 v1.2.1 v3.2.1 v2.6.1 v22.12.0
v2.11.0 node-base:22.12.0 v1.2.1 v3.2.1 v2.6.1 v22.12.0
v2.10.0 node-base:22.12.0 v1.2.1 v3.2.1 v2.6.1 v22.12.0
v2.9.2 node-base:22.12.0 v1.2.1 v3.2.1 v2.6.1 v22.12.0
v2.9.1 node-base:22.12.0 v1.2.1 v3.2.1 v2.6.1 v22.12.0
v2.9.0 node-base:22.11.0 v1.2.0 v3.2.0 v2.6.0 v22.11.0
v2.8.0 node-base:22.9.0 v1.0.0 v3.1.1 v2.5.3 v22.9.0
v2.7.0 node-base:22.9.0 v1.0.0 v3.1.1 v2.5.3 v22.9.0
v2.6.4 node-base:22.9.0 v1.0.0 v3.1.1 v2.5.3 v22.9.0
v2.6.3 node-base:22.9.0 v1.0.0 v3.1.1 v2.5.3 v22.9.0
@godber
Copy link
Member Author

godber commented Feb 4, 2025

Recall, the versions with the leak were:

  • v2.9.2 - teraslice:v2.9.2, node:v22.12.0, terafoundation_kafka_connector:v1.2.1

We are going to make two release candidates:

  • 2.12.4-rc.0 - teraslice:master, node:v22.9.0, terafoundation_kafka_connector:v1.0.0
    • leak - implies teraslice code in master is the problem
    • no leak - implies kafka or node are the problem
  • 2.12.4-rc.1 - teraslice:master, node:v22.12.0, terafoundation_kafka_connector:v1.0.0
    • leak - implies teraslice code in master or node is problem
    • no leak - implies kafka is the problem

@godber
Copy link
Member Author

godber commented Feb 5, 2025

We're getting this tracked down, we see this in the logs:

(node:7) MaxListenersExceededWarning: Possible EventTarget memory leak detected. 11 abort listeners added to [AbortSignal]. MaxListeners is 10. Use events.setMaxListeners() to increase limit

This PR added the Abort controller, which appears to be leaking event handlers:
#3838

godber pushed a commit that referenced this issue Feb 5, 2025
This PR makes the following changes:

- Create an `abortController` in the core client `handleSendResponse`
function if `sendAbortSignal` is set to true. This is only the case for
when the execution-controller client sends the `worker:slice:complete`
event. On a `server:shutdown` event the `abortController` will abort
awaiting a server response so it can shutdown properly.
- Remove the `abortController` from the execution-controller client
- bump teraslice-messaging from 1.10.3 to 1.10.4

ref: #3945
@godber
Copy link
Member Author

godber commented Feb 6, 2025

This leak should be resolved in Teraslice v2.12.5.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants