Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak with distributed tracing enabled #13990

Closed
baryluk opened this issue Apr 25, 2022 · 15 comments
Closed

Memory leak with distributed tracing enabled #13990

baryluk opened this issue Apr 25, 2022 · 15 comments

Comments

@baryluk
Copy link

baryluk commented Apr 25, 2022

What happened?

Adding --experimental-enable-distributed-tracing works, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.

What did you expect to happen?

Stable memory usage around 1.8-2.0 GB of RSS.

How can we reproduce it (as minimally and precisely as possible)?

Run with --experimental-enable-distributed-tracing for few hours. It is sufficient to enable it on one member.

Anything else we need to know?

The tracing collector endpoint doesn't need to be configured or listening. Having otelcol on 4317 doesn't change anything (beyond actually making tracing work).

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
Go Version: go1.16.6
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd, Kubernetes, OKD / Openshift 4.9, 3 members.

etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978

running in cri-o

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

[root@master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |             NAME             |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 |      false |
| a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 |      false |
| acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 |      false |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+

Relevant log output

No fatal issues in the logs.
@baryluk
Copy link
Author

baryluk commented Apr 25, 2022

Memory usage (master-1 is the member with tracing enabled around 1:00 PM CEST on the graph - middle of the time line there):

Screenshot at 2022-04-25 19-35-34

pprof heap snapshots without and with tracing enabled taken ever 1 hours (starting just after starting etcd) for 7 hours:
etcd_pprof_issue_13990.tar.gz

@eval-exec
Copy link
Contributor

eval-exec commented Apr 27, 2022

What happened?

Adding --experimental-enable-distributed-tracing works, but causes a memory leak, of about 1GB per hour in our setup. Instead of expected ~2GB, it got to about 12GB in 7 hours.

What did you expect to happen?

Stable memory usage around 1.8-2.0 GB of RSS.

How can we reproduce it (as minimally and precisely as possible)?

Run with --experimental-enable-distributed-tracing for few hours. It is sufficient to enable it on one member.

Anything else we need to know?

The tracing collector endpoint doesn't need to be configured or listening. Having otelcol on 4317 doesn't change anything (beyond actually making tracing work).

Etcd version (please run commands below)

$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05
Go Version: go1.16.6
Go OS/Arch: linux/amd64

$ etcdctl version
etcdctl version: 3.5.0
API version: 3.5

Etcd configuration (command line flags or environment variables)

etcd, Kubernetes, OKD / Openshift 4.9, 3 members.

etcd --experimental-enable-distributed-tracing --logger=zap --log-level=info --initial-advertise-peer-urls=https://10.10.0.102:2380 --cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.crt --key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-serving-master-1.example.com.key --trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-serving-ca/ca-bundle.crt --client-cert-auth=true --peer-cert-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com..crt --peer-key-file=/etc/kubernetes/static-pod-certs/secrets/etcd-all-certs/etcd-peer-master-1.example.com.key --peer-trusted-ca-file=/etc/kubernetes/static-pod-certs/configmaps/etcd-peer-client-ca/ca-bundle.crt --peer-client-cert-auth=true --advertise-client-urls=https://10.10.0.102:2379 --listen-client-urls=https://0.0.0.0:2379,unixs://10.10.0.102:0 --listen-peer-urls=https://0.0.0.0:2380 --metrics=extensive --listen-metrics-urls=https://0.0.0.0:9978

running in cri-o

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

[root@master-1 /]# etcdctl member list -w table
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
|        ID        | STATUS  |             NAME             |        PEER ADDRS        |       CLIENT ADDRS       | IS LEARNER |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+
| 10f8cf6269xxx | started | master-2.example.com | https://10.10.0.103:2380 | https://10.10.0.103:2379 |      false |
| a2bbe7149xxx | started | master-1.example.com | https://10.10.0.102:2380 | https://10.10.0.102:2379 |      false |
| acb2c160xxx | started | master-0.example.com | https://10.10.0.101:2380 | https://10.10.0.101:2379 |      false |
+------------------+---------+------------------------------+--------------------------+--------------------------+------------+

Relevant log output

No fatal issues in the logs.
$ etcd --version
etcd Version: 3.5.0
Git SHA: f99cada05

I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?

@baryluk
Copy link
Author

baryluk commented Apr 27, 2022

I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?

No custom commits by me. This is the etcd distributed as part of OKD v1.22.1-1839. I am not familiar with their build to see what changes (if any) were there.

@ahrtr
Copy link
Member

ahrtr commented Apr 27, 2022

I can't find the GIT SHA you provided in etcd-io/etcd git log tree. Is the etcd contains your custom commits?

No custom commits by me. This is the etcd distributed as part of OKD v1.22.1-1839. I am not familiar with their build to see what changes (if any) were there.

It looks like Openshift customized the etcd? @hexfusion could you confirm this?

I just downloaded the official 3.5.0, and did a quick verification below.

$ ./etcd --version
etcd Version: 3.5.0
Git SHA: 946a5a6f2
Go Version: go1.16.3
Go OS/Arch: linux/amd64

$ git log --pretty=oneline | grep 946a5a6f2
946a5a6f25c3b6b89408ab447852731bde6e6289 version: 3.5.0

@ahrtr
Copy link
Member

ahrtr commented Apr 27, 2022

cc @lilic

@hexfusion
Copy link
Contributor

It looks like Openshift customized the etcd? @hexfusion could you confirm this?

I can confirm its not an upstream binary, this is the downstream repo the build comes from[1]. The changes would be minimal to etcd itself. 3.5.0 uses a pretty old version of otel (pre v1) so its possible that they had a bug as well.

[1] https://github.com/openshift/etcd

@stale
Copy link

stale bot commented Jul 31, 2022

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jul 31, 2022
@ahrtr ahrtr added stage/tracked and removed stale labels Jul 31, 2022
@serathius
Copy link
Member

Required to graduate the distributed tracing.

@ahrtr
Copy link
Member

ahrtr commented Sep 8, 2022

We have already bumpped the otel to 1.0.1 in #14312. @baryluk Could you please double check whether you can still see this issue? thx

@baryluk
Copy link
Author

baryluk commented Sep 8, 2022

We have already bumpped the otel to 1.0.1 in #14312. @baryluk Could you please double check whether you can still see this issue? thx

Sure, I can try on Monday to test it.

@serathius
Copy link
Member

ping @baryluk

@baryluk baryluk closed this as not planned Won't fix, can't repro, duplicate, stale Oct 27, 2023
@serathius
Copy link
Member

Hey @baryluk, can you confirm that issue was addressed? You closed the issue as "not planned" so I wanted to double check.

@serathius
Copy link
Member

serathius commented Nov 8, 2023

cc @dashpole

@baryluk
Copy link
Author

baryluk commented Nov 8, 2023

@serathius I was not able to reproduce the issue.

@serathius
Copy link
Member

Great, closing issue as fixed. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants