-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sudden high memory usage. #7747
Comments
@LoonyRules: This issue is currently awaiting triage. If Ingress contributors determines this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
I just stumbled across #7647 and I feel like this could be related due to the graphs shared by I will review our graphs and wait for the bug to arise again, once it does I'll do more memory tool digging to see if the same trend can be found in both of our issues. We also use SSL for all of our ingress objects. |
Can you test with v1.0.3 that was released today, the base image was updated via #7647 (comment) |
Upgrade finished just now, will let you know of the results hopefully in the next 7 days. 🤞 |
@strongjz But the nginx openssl dynamic link is alpine openssl still in v1.0.3. Not openresty's openssl library: |
We didn't merged the one with openresty. We've found this can be related to ssl_cache builtin, so a new image with that disabled by default will be released. Next step/release will be to add the openresty patches :) |
At this moment in time, in the last 7 days we have not had the issue mentioned in the initial post where the memory usage will suddenly start spiking over a timespan of a few hours until it reached 1.4GB of memory usage, causing it to crash due to an OOM error occurring. After reading both of your comments, I am more concerned and confused as to why this issue has suddenly stopped if the suspected issue has not been resolved in order to fix this issue. It would be typical if the issue were to show itself again after I post this comment and I secret hope that is the case. -- However -- Since the upgrade to v1.0.2 and v1.0.3 we have noticed that Web Socket Servers (WSS) have been terminating connections due to the Web Socket Clients (WSC) not responding to keep alive packets in time. The client gets 60 seconds to respond before the server closes the connection itself with a custom status code This to me, would indicate that the nginx ingress controller has stopped communicating the packets towards the service, making the WSS believe that every user (~15,000 at peaks) have dead connections, so it terminates them. This then causes all 15,000 connections to reconnect and put a decent load on the nginx pods. I have checked nginx ingress pod logs when this occurs and there are absolutely zero errors or warning logs printed by the process itself. Note that our 0.95 and 0.99 latency percentiles constantly spike to 10 seconds at least once an hour regardless of the load we have on our nginx ingress pods, of which to me, does not feel normal either - I would expect this to happen when 15,000 people are reconnecting though, of which it does. When this issue happens, it'll occur 3 times typically within an hour or so. One for each ingress pod we have. Memory Usage and CPU Usage typically sits around ~500MB and ~0.2vCPU so we should have plenty of resources available for it to use. Not sure what to do about this issue - shall I create another issue here on github as this could be a separate issue and does anyone have any idea of what is happening here or a way to look into this more in-depth? |
@rikatz we've tried to set 'ssl-session-cache' to false of version 0.44.0 in ConfigMap and run it for a few days, and we have no memory issue now (although with higher cpu usage). So I think it can be concluded, there's something in ssl session cache to make this issue happen. |
-- Upgrading to https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.4 to see if this has any affect on our initial issue. As for the connections drops mentioned in #7747 (comment) I still have no clue why it's happening and we have even increased the proxy-connect-timeout and proxy-read-timeout values to ensure it wasn't the default values of 60s causing the issue. I'm not sure what else to look at in that regards. |
I've upgraded to v1.0.4 and issue persists:
I'd rather not totally disable ssl-session-cache since we have an heavy load, and that could drastically impact CPU. |
Thought I would provide an update seeing as it's been a while. The issue regarding memory usage spiking and then crashing due to no memory being left available seems to be resolved since our upgrade to 1.0.4, however, there's no guarantee this version fixed the issue directly. We perhaps just didn't run older versions (eg: 1.0.2) long enough to find out if it was still an issue in those. We will continue to keep as up-to-date as possible with the releases and issues outlined in the conversations made here to see if there's any kind of regression in future updates. We do still have an issue where some connections are being dropped from time to time, specifically due to the keep alive packets coming into the websocket-server way too late, causing us to clean up the connections and force close them. We are not sure which software is at fault here (ingress-nginx or Digital Ocean Load Balancer) as when the issue occurs, there could be a handful of connections from each nginx pod and load balancer that get dropped. So it's not an issue where all connections on a specific ingress-nginx pod or a load balancer are dropped, it's just a random handful that's almost always equal. So if 75 people disconnected then it's typically always 25 from each load balancer & ingress-nginx pod that it occurs to. We are going to add some more statistics like geo data to see if it could be due to some internet routing based upon tunnelling for specific users and see where that leads us next -- but I think it is safe to say that it is unlikely to be an nginx issue. -- Due to this, I am willing to Close this issue as "Potentially Resolved" but would love others input as the base issue could be affecting other users on a much wider, and frequent basis. |
I've met the same issue, does 'ssl-session-cache'=false still work? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
I'm going to save the triage robot some time and close this issue now as I haven't came across this issue since we updated to 1.0.4 as mentioned here. If anyone has issues in the future you should make your own issue and link back to this one if you feel like it's the same issue. |
NGINX Ingress controller version: Chart Revision: v4.0.3 | Chart App Version: 1.0.2 | Nginx version: 1.19.9
Kubernetes version: 1.20.2-do.0
Environment:
Cloud provider or hardware configuration: Digital Ocean Managed Kubernetes
OS (e.g. from /etc/os-release): Pod runs Alpine Linux v3.14.2
Kernel (e.g.
uname -a
): 4.19.0-11-amd64Install tools: N/A
Other: N/A
How was the ingress-nginx-controller installed:
ingress-nginx | ingress-controller | 2 | 2021-10-01 09:12:17.6727727 +0100 BST | deployed ingress-nginx-4.0.3 | 1.0.2
What happened:
Nginx ingress pods suddenly go from ~500MB of ram usage on average (for multiple days at a time), to slowly spiking up to 1.4GB over the span of a few hours until it gets a SystemOOM killed due to a lack of available memory on the node it was assigned to.
Our connections and other metrics do not spike out of the ordinary during this time. It is just simply memory usage suddenly spiking up and then dropping. There's no pattern to the crashing, we can go days without the issue and then it'll happen all of a sudden. The number of days that pass are typically between a 2-9 day range. Once the issue arises, we can expect the other controllers to follow suit typically an hour or two after one another.
Our pod logs are spammed with request logs so we are unable to see if nginx itself is logging any errors. When we view our pod logs, we are usually only able to see logs in the last 1 minute, sometimes in the last 40 seconds if we were able to perfectly time the saving of logs when it happens. This makes it hard to know if nginx throws errors during the spike and the crash. Any recommendations on changing something to make this easier would be highly appreciated
Just in case this was a core dump issue (#6896) we upgraded to v1.0.2 (https://github.com/kubernetes/ingress-nginx/releases/tag/controller-v1.0.2). The issue then showed themselves within 32 hours of the deploy, when we were at 15,000 concurrent open web sockets across 3 controller pods. We typically do not see the issue on a weekend where our peak is 15,000 it's typically around the 10,000 mark of which is a mid-week typical peak.
Controller 1 that died:
Controller 2 & 3 that started following the same patterns:
Note: This has been discussed with @strongjz on the Kubernetes #ingress-nginx-users channel on the Slack: https://kubernetes.slack.com/archives/CANQGM8BA/p1632951733103200
What you expected to happen:
Memory to not randomly spike and crash regardless of if we have 6,000, 10,000 connections or 25,000 connections.
How to reproduce it:
I have no idea how we are even producing the issue itself so I am unable to give exact reproduction steps sadly.
/kind bug
The text was updated successfully, but these errors were encountered: