-
Notifications
You must be signed in to change notification settings - Fork 25
Adjust default memory/cpu requests and limits #115
Conversation
There's been evidence that the Cloud Controller API and registry-buddy components have been getting OOMKilled under typical usage in production environments. This change adjusts the resource limits to be more in line with what our monit limits historically were in the capi-release BOSH release. E.g. https://github.com/cloudfoundry/capi-release/blob/6f9b4d5ec3ba46adf39d7de2d60e1b58a4511784/jobs/cloud_controller_ng/spec#L893 This change also adjusts some of the CPU limits to be more generous for critical processes such as nginx and the cf-api-controllers reconciler. [#177468691](https://www.pivotaltracker.com/story/show/177468691) Authored-by: Tim Downey <tdowney@vmware.com>
@@ -40,10 +40,10 @@ spec: | |||
resources: | |||
requests: | |||
cpu: 500m | |||
memory: 300Mi | |||
memory: 500Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In an environment with no load Cloud Controller is using ~400Mi of RAM. For decent quality of service in light environments it should have at least 500Mi. In practice I've seen it reach a steady state of ~3.5Gi of RAM on CF for VMs environments with occasional surges past 4Gi, but these were typically anomalous cases in very large scale environments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have any recommendations somewhere about increasing requests for larger environments where we can mention the ~3.5GB need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We vaguely hint at frequent restarts being due to "memory pressure" on this doc: https://docs.cloudfoundry.org/running/managing-cf/scaling-cloud-controller-k8s.html#cf-api-server
But AFAIK I'm not aware of any "production cluster sizing" guidance apart from what's mentioned in https://github.com/cloudfoundry/cf-for-k8s/blob/819bc1c981153df6801d07e747b6ef4809189d24/docs/platform_operators/scaling.md#scaling-cf-for-k8s.
It's also a pretty hand wavy number based on what I've seen in typical CF for VMs environments with 4GB (swap enabled) VMs given to the Cloud Controllers. It's possible the number might bloat higher on a larger VM or it's possible that the usage patterns are different enough on CF for K8s that doesn't reach those levels often.
Does anyone know what the minimum node memory size we recommend is? As long as its 8-16GB I feel like there should be enough headroom for this to be OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ahh neat okay the docs are accessible from cf-for-k8s 👍 . According to our deploy docs, we require a minimum of 15GB of memory (https://github.com/cloudfoundry/cf-for-k8s/blob/819bc1c981153df6801d07e747b6ef4809189d24/docs/deploy.md).
@@ -68,10 +68,10 @@ spec: | |||
resources: | |||
requests: | |||
cpu: 100m | |||
memory: 300Mi | |||
memory: 250Mi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The local worker doesn't do much in cf-for-k8s. On CF for VMs it is ferrying large files to the blobstore, but in cf-for-k8s it's just moving a file on disk to a mount that the registry-buddy
has access to. I've lowered both the limit and requests to account for this.
cpu: 1500m | ||
memory: 1Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this CPU limit must have been a typo before. I've adjusted it to a more reasonable multi-core limit so that it can get some concurrency if it needs to. I've also adjusted the memory limit to 1Gi.
The registry-buddy converts a zip file containing package source code into a tar file/single-layer OCI image and sends it to a container registry. For apps with large individual files (e.g. a Java app with jars) we've seen memory surge past the 150Mi this had originally and the cf-api-servers
Pod would get OOMKilled. I'm considering 2Gi
to account for the default max package size, though I do not think it should actually use that much memory based on my understanding of the code:
func WriteZipToTar(tw TarWriter, srcZip, basePath string, uid, gid int, mode int64, normalizeModTime bool, fileFilter func(string) bool) error { |
Thoughts on this?
limits: | ||
cpu: 500m | ||
cpu: 1000m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increased this since nginx is in the critical path for API server requests. It's a concurrent web server so this could technically be increased further if we'd like.
@@ -29,7 +29,7 @@ spec: | |||
memory: 300Mi | |||
limits: | |||
cpu: 1000m | |||
memory: 1Gi | |||
memory: 4Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's unlikely that the database migrations use more than 4Gi -- my gut tells me it would probably be ~2Gi max on a large env -- but it would be awful if migrations got OOMKilled. Increased this to 4Gi since that is around what it would use on BOSH (kinda -- monit would have allowed it to temporarily burst higher for a few cycles).
limits: | ||
cpu: 1000m | ||
memory: 1Gi | ||
memory: 4Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Clock performs a periodic bulk sync of apps and CF tasks that are running on the platform. While doing this it loads information about all the running LRPs into memory to compare against what's in CCDB. This can be pretty memory intensive on large environments so we should be more generous here.
@@ -43,11 +43,11 @@ spec: | |||
value: #@ data.values.workloads_namespace | |||
resources: | |||
limits: | |||
cpu: 100m | |||
memory: 200Mi | |||
cpu: 1000m |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This controller is responsible for staging (it watches kpack build/images) and some route reconciliation. We should definitely let it use at least a full CPU if not more. I think this must have been a typo for it to be so low.
cpu: 100m | ||
memory: 200Mi | ||
cpu: 1000m | ||
memory: 1Gi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The SharedInformer
cache for this controller needs memory to be able to load information about all the CRDs its watching. For larger environments with a lot of routes and/or kpack builds/images it's very likely for it to get OOMKilled with a limit at 200mb.
I added some comments explaining some of the choices I made when tweaking the numbers. There are some containers that may benefit from being given even higher CPU limits, so left some comments for discussion on those as well. |
Seems reasonable to me. I don't have any comments about that code you linked because I wasn't really seeing how that was mentioning anything about size. |
The registry-buddy code I linked is only relevant because it is reading (potentially large) files into memory and writing them into a tar file. My understanding is that it should be a buffered copy, though, and should actually use that much memory. Maybe https://github.com/cloudfoundry/capi-k8s-release/blob/5878960c0ad229fbd5832c728e3addf686e2f18f/src/registry-buddy/package_upload/package_upload.go is the culprit. I'm not that familiar with how EDIT
Maybe that's the source of the registry-buddy's mysterious memory usage. 🤔 💭 |
oh interesting! I wonder why the conversion between zip/tar needs to be done.. I guess the question has to do more with compression vs archiving though UPDATE |
Yeah, for more context there around why it is a I think potentially in this code we could have written the It looks like in cf-for-k8s the max package size is hardcoded to be ~1GB so I think this limit should be fine. We've got one local worker container so the |
There's been evidence that the Cloud Controller API and registry-buddy components have been getting OOMKilled under typical usage in production environments. This change adjusts the resource limits to be more in line with what our monit limits historically were in the capi-release BOSH release. E.g. https://github.com/cloudfoundry/capi-release/blob/6f9b4d5ec3ba46adf39d7de2d60e1b58a4511784/jobs/cloud_controller_ng/spec#L893
Some memory requests were adjusted up and down to better account for typical/min usage of these components.
This change also adjusts some of the CPU limits to be more generous for
critical processes such as nginx and the cf-api-controllers reconciler.
Overall there is a
+210Mi
increase in memory requests (only+10
if you ignore the short-lived ccdb migrationJob
).If operators want to tweak these values further they can follow this approach in our vertical scaling docs.
Tracker story: #177468691