Sudden high memory usage. #7747

LoonyRules · 2021-10-04T13:23:50Z

NGINX Ingress controller version: Chart Revision: v4.0.3 | Chart App Version: 1.0.2 | Nginx version: 1.19.9
Kubernetes version: 1.20.2-do.0

Environment:

Cloud provider or hardware configuration: Digital Ocean Managed Kubernetes
OS (e.g. from /etc/os-release): Pod runs Alpine Linux v3.14.2
Kernel (e.g. uname -a): 4.19.0-11-amd64
Install tools: N/A
Other: N/A
How was the ingress-nginx-controller installed:
- ingress-nginx | ingress-controller | 2 | 2021-10-01 09:12:17.6727727 +0100 BST | deployed ingress-nginx-4.0.3 | 1.0.2
- values.yaml:

controller:
  admissionWebhooks:
    enabled: false
  config:
    compute-full-forwarded-for: true
    forwarded-for-header: CF-Connecting-IP
    proxy-real-ip-cidr: <cf-cidrs>
    use-forwarded-headers: true
    use-proxy-protocol: true
  extraArgs:
    default-ssl-certificate: <default_cert_location>
  extraInitContainers:
  - command:
    - sh
    - -c
    - sysctl -w net.core.somaxconn=32768; sysctl -w net.ipv4.ip_local_port_range='1024 65000'
    image: alpine:3.13
    name: sysctl
    securityContext:
      privileged: true
  hostPort:
    enabled: true
  kind: DaemonSet
  metrics:
    enabled: true
    serviceMonitor:
      enabled: true
  service:
    type: ClusterIP
  watchIngressWithoutClass: true

Current State of the controller: Perfectly fine until the issue arises.

What happened:
Nginx ingress pods suddenly go from ~500MB of ram usage on average (for multiple days at a time), to slowly spiking up to 1.4GB over the span of a few hours until it gets a SystemOOM killed due to a lack of available memory on the node it was assigned to.

Our connections and other metrics do not spike out of the ordinary during this time. It is just simply memory usage suddenly spiking up and then dropping. There's no pattern to the crashing, we can go days without the issue and then it'll happen all of a sudden. The number of days that pass are typically between a 2-9 day range. Once the issue arises, we can expect the other controllers to follow suit typically an hour or two after one another.

Our pod logs are spammed with request logs so we are unable to see if nginx itself is logging any errors. When we view our pod logs, we are usually only able to see logs in the last 1 minute, sometimes in the last 40 seconds if we were able to perfectly time the saving of logs when it happens. This makes it hard to know if nginx throws errors during the spike and the crash. Any recommendations on changing something to make this easier would be highly appreciated

Just in case this was a core dump issue (#6896) we upgraded to v1.0.2 (https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.2). The issue then showed themselves within 32 hours of the deploy, when we were at 15,000 concurrent open web sockets across 3 controller pods. We typically do not see the issue on a weekend where our peak is 15,000 it's typically around the 10,000 mark of which is a mid-week typical peak.

Controller 1 that died:

Controller 2 & 3 that started following the same patterns:

Note: This has been discussed with @strongjz on the Kubernetes #ingress-nginx-users channel on the Slack: https://kubernetes.slack.com/archives/CANQGM8BA/p1632951733103200

What you expected to happen:
Memory to not randomly spike and crash regardless of if we have 6,000, 10,000 connections or 25,000 connections.

How to reproduce it:
I have no idea how we are even producing the issue itself so I am unable to give exact reproduction steps sadly.

/kind bug

The text was updated successfully, but these errors were encountered:

k8s-ci-robot · 2021-10-04T13:23:57Z

@LoonyRules: This issue is currently awaiting triage.

If Ingress contributors determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

LoonyRules · 2021-10-04T15:34:10Z

I just stumbled across #7647 and I feel like this could be related due to the graphs shared by Ivauvillier in #7647 (comment)

I will review our graphs and wait for the bug to arise again, once it does I'll do more memory tool digging to see if the same trend can be found in both of our issues.

We also use SSL for all of our ingress objects.

strongjz · 2021-10-04T15:42:26Z

Can you test with v1.0.3 that was released today, the base image was updated via #7647 (comment)

LoonyRules · 2021-10-05T08:41:23Z

Upgrade finished just now, will let you know of the results hopefully in the next 7 days. 🤞

BSWANG · 2021-10-09T11:10:05Z

Can you test with v1.0.3 that was released today, the base image was updated via #7647 (comment)

@strongjz But the nginx openssl dynamic link is alpine openssl still in v1.0.3. Not openresty's openssl library:

rikatz · 2021-10-10T19:45:46Z

We didn't merged the one with openresty.

We've found this can be related to ssl_cache builtin, so a new image with that disabled by default will be released.

Next step/release will be to add the openresty patches :)

LoonyRules · 2021-10-12T08:59:45Z

At this moment in time, in the last 7 days we have not had the issue mentioned in the initial post where the memory usage will suddenly start spiking over a timespan of a few hours until it reached 1.4GB of memory usage, causing it to crash due to an OOM error occurring.

After reading both of your comments, I am more concerned and confused as to why this issue has suddenly stopped if the suspected issue has not been resolved in order to fix this issue. It would be typical if the issue were to show itself again after I post this comment and I secret hope that is the case.

-- However --

Since the upgrade to v1.0.2 and v1.0.3 we have noticed that Web Socket Servers (WSS) have been terminating connections due to the Web Socket Clients (WSC) not responding to keep alive packets in time. The client gets 60 seconds to respond before the server closes the connection itself with a custom status code 4003 indicating a keep alive accept response took too long to be received. There is a possibility this has been an issue before the upgrade and that we only just noticed it in the latest upgrades.

This to me, would indicate that the nginx ingress controller has stopped communicating the packets towards the service, making the WSS believe that every user (~15,000 at peaks) have dead connections, so it terminates them. This then causes all 15,000 connections to reconnect and put a decent load on the nginx pods.

I have checked nginx ingress pod logs when this occurs and there are absolutely zero errors or warning logs printed by the process itself. Note that our 0.95 and 0.99 latency percentiles constantly spike to 10 seconds at least once an hour regardless of the load we have on our nginx ingress pods, of which to me, does not feel normal either - I would expect this to happen when 15,000 people are reconnecting though, of which it does.

When this issue happens, it'll occur 3 times typically within an hour or so. One for each ingress pod we have. Memory Usage and CPU Usage typically sits around ~500MB and ~0.2vCPU so we should have plenty of resources available for it to use.

Not sure what to do about this issue - shall I create another issue here on github as this could be a separate issue and does anyone have any idea of what is happening here or a way to look into this more in-depth?

Lyt99 · 2021-10-14T06:29:17Z

@rikatz we've tried to set 'ssl-session-cache' to false of version 0.44.0 in ConfigMap and run it for a few days, and we have no memory issue now (although with higher cpu usage). So I think it can be concluded, there's something in ssl session cache to make this issue happen.

LoonyRules · 2021-10-15T09:28:51Z

-- Upgrading to https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.4 to see if this has any affect on our initial issue. As for the connections drops mentioned in #7747 (comment) I still have no clue why it's happening and we have even increased the proxy-connect-timeout and proxy-read-timeout values to ensure it wasn't the default values of 60s causing the issue. I'm not sure what else to look at in that regards.

boomaker · 2021-10-21T09:18:18Z

I've upgraded to v1.0.4 and issue persists:

bash-5.1$ /nginx-ingress-controller --version
-------------------------------------------------------------------------------
NGINX Ingress controller
  Release:       v1.0.4
  Build:         9b78b6c197b48116243922170875af4aa752ee59
  Repository:    https://github.com/kubernetes/ingress-nginx
  nginx version: nginx/1.19.9

-------------------------------------------------------------------------------

bash-5.1$ cat nginx.conf |grep ssl_session_cache
        ssl_session_cache shared:SSL:10m;

Mem: 15922652K used, 471644K free, 36092K shrd, 449876K buff, 4172056K cached
CPU:  38% usr   8% sys   0% nic  48% idle   1% io   0% irq   3% sirq
Load average: 2.11 2.52 3.15 8/1695 58168
  PID  PPID USER     STAT   VSZ %VSZ CPU %CPU COMMAND
58134    28 www-data R    3668m  21%   1   3% nginx: worker process
58101    28 www-data S    3668m  21%   1   3% nginx: worker process
   28     6 www-data S    3652m  21%   0   0% nginx: master process /usr/local/nginx/sbin/nginx -c /etc/nginx/nginx.conf
58167    28 www-data S    3650m  21%   1   0% nginx: cache manager process
    6     1 www-data S     795m   5%   2   1% /nginx-ingress-controller --publish-service=kube-system/ingress-nginx-controller --election-id=ingress-controller-leader --controller-class=k8s.io/ingress-nginx --configmap=kube-syste
57329 57322 www-data S     258m   2%   1   0% bash
58168 57329 www-data R     257m   2%   0   0% top
57322     0 www-data S     257m   2%   1   0% sh -c clear; (bash || ash || sh)
    1     0 www-data S      204   0%   2   0% /usr/bin/dumb-init -- /nginx-ingress-controller --publish-service=kube-system/ingress-nginx-controller --election-id=ingress-controller-leader --controller-class=k8s.io/ingress-nginx

I'd rather not totally disable ssl-session-cache since we have an heavy load, and that could drastically impact CPU.

LoonyRules · 2021-11-18T10:55:17Z

Thought I would provide an update seeing as it's been a while.

The issue regarding memory usage spiking and then crashing due to no memory being left available seems to be resolved since our upgrade to 1.0.4, however, there's no guarantee this version fixed the issue directly. We perhaps just didn't run older versions (eg: 1.0.2) long enough to find out if it was still an issue in those. We will continue to keep as up-to-date as possible with the releases and issues outlined in the conversations made here to see if there's any kind of regression in future updates.

We do still have an issue where some connections are being dropped from time to time, specifically due to the keep alive packets coming into the websocket-server way too late, causing us to clean up the connections and force close them. We are not sure which software is at fault here (ingress-nginx or Digital Ocean Load Balancer) as when the issue occurs, there could be a handful of connections from each nginx pod and load balancer that get dropped. So it's not an issue where all connections on a specific ingress-nginx pod or a load balancer are dropped, it's just a random handful that's almost always equal. So if 75 people disconnected then it's typically always 25 from each load balancer & ingress-nginx pod that it occurs to. We are going to add some more statistics like geo data to see if it could be due to some internet routing based upon tunnelling for specific users and see where that leads us next -- but I think it is safe to say that it is unlikely to be an nginx issue.

-- Due to this, I am willing to Close this issue as "Potentially Resolved" but would love others input as the base issue could be affecting other users on a much wider, and frequent basis.

hicaistar · 2022-02-10T09:02:12Z

@rikatz we've tried to set 'ssl-session-cache' to false of version 0.44.0 in ConfigMap and run it for a few days, and we have no memory issue now (although with higher cpu usage). So I think it can be concluded, there's something in ssl session cache to make this issue happen.

I've met the same issue, does 'ssl-session-cache'=false still work?

k8s-triage-robot · 2022-05-11T09:49:46Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2022-06-10T10:27:07Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

LoonyRules · 2022-06-25T11:56:47Z

I'm going to save the triage robot some time and close this issue now as I haven't came across this issue since we updated to 1.0.4 as mentioned here. If anyone has issues in the future you should make your own issue and link back to this one if you feel like it's the same issue.

LoonyRules added the kind/bug Categorizes issue or PR as related to a bug. label Oct 4, 2021

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Oct 4, 2021

k8s-ci-robot added the needs-priority label Oct 4, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 11, 2022

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 10, 2022

LoonyRules closed this as completed Jun 25, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sudden high memory usage. #7747

Sudden high memory usage. #7747

LoonyRules commented Oct 4, 2021

k8s-ci-robot commented Oct 4, 2021

LoonyRules commented Oct 4, 2021

strongjz commented Oct 4, 2021

LoonyRules commented Oct 5, 2021

BSWANG commented Oct 9, 2021

rikatz commented Oct 10, 2021

LoonyRules commented Oct 12, 2021 •

edited

Loading

Lyt99 commented Oct 14, 2021

LoonyRules commented Oct 15, 2021

boomaker commented Oct 21, 2021

LoonyRules commented Nov 18, 2021

hicaistar commented Feb 10, 2022

k8s-triage-robot commented May 11, 2022

k8s-triage-robot commented Jun 10, 2022

LoonyRules commented Jun 25, 2022

Sudden high memory usage. #7747

Sudden high memory usage. #7747

Comments

LoonyRules commented Oct 4, 2021

k8s-ci-robot commented Oct 4, 2021

LoonyRules commented Oct 4, 2021

strongjz commented Oct 4, 2021

LoonyRules commented Oct 5, 2021

BSWANG commented Oct 9, 2021

rikatz commented Oct 10, 2021

LoonyRules commented Oct 12, 2021 • edited Loading

Lyt99 commented Oct 14, 2021

LoonyRules commented Oct 15, 2021

boomaker commented Oct 21, 2021

LoonyRules commented Nov 18, 2021

hicaistar commented Feb 10, 2022

k8s-triage-robot commented May 11, 2022

k8s-triage-robot commented Jun 10, 2022

LoonyRules commented Jun 25, 2022

LoonyRules commented Oct 12, 2021 •

edited

Loading