-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Resource quota enforcement webhook #544
Resource quota enforcement webhook #544
Conversation
@khogeland can you rebase this PR to the latest master to pick up the merged webhook changes? |
} | ||
|
||
// TODO: There appears to be a deadlock in cache.WaitForCacheSync. Possibly related? https://github.com/kubernetes/kubernetes/issues/71450 | ||
// For now, return immediately. There will be a short window after startup where quota calcuation is incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to hunt this down and file an issue against k8s (or determine if it's been fixed in a later version).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also delete the binary spark-on-k8s-operator
.
* Cert configuration and reloading * Add support for strict webhook error handling * Improve webhook error handling * Don't deregister the webhook when failure policy is strict * standard error message capitalization * have the webhook parse its own configuration from flags * clean up cert provider code * Add explanation for skipping deregistration * Resource Quota enforcement webhook * Fix bad merge * Cleanup, fixes * Cleanup * Document the quota enforcer
* Cert configuration and reloading * Add support for strict webhook error handling * Improve webhook error handling * Don't deregister the webhook when failure policy is strict * standard error message capitalization * have the webhook parse its own configuration from flags * clean up cert provider code * Add explanation for skipping deregistration * Resource Quota enforcement webhook * Fix bad merge * Cleanup, fixes * Cleanup * Document the quota enforcer
Presented for comment 👀 Will update with documentation. I've based this PR on one of my other PR branches that this depends on, so here is the isolated diff: khogeland/spark-on-k8s-operator@webhook-enhancements...khogeland:resource-quota-enforcement
This is a validating webhook for enforcing Resource Quotas at the SparkApplication level. This serves two main purposes: preventing deadlocks (see this JIRA item) and providing early backpressure to users who are hitting their quota limits.
It operates in basically the same way that the k8s pod-level quota controller does, using an informer to keep a running tally of (expected) resource usage in the background, with the validating webhook comparing the requested resources to the most recent usage. It calculates the expected usage for a given SparkApplication using the same logic that Spark uses internally, and also counts non-Spark pods in each namespace to allow for simple mixed workloads.
FWIW, this has been tested in several high-throughput clusters (although is unfortunately disabled due to a bug in the k8s version we're on).