Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Resource quota enforcement webhook #544

Merged
merged 22 commits into from
Aug 28, 2019

Conversation

khogeland
Copy link
Contributor

@khogeland khogeland commented Jul 4, 2019

Presented for comment 👀 Will update with documentation. I've based this PR on one of my other PR branches that this depends on, so here is the isolated diff: khogeland/spark-on-k8s-operator@webhook-enhancements...khogeland:resource-quota-enforcement

This is a validating webhook for enforcing Resource Quotas at the SparkApplication level. This serves two main purposes: preventing deadlocks (see this JIRA item) and providing early backpressure to users who are hitting their quota limits.

It operates in basically the same way that the k8s pod-level quota controller does, using an informer to keep a running tally of (expected) resource usage in the background, with the validating webhook comparing the requested resources to the most recent usage. It calculates the expected usage for a given SparkApplication using the same logic that Spark uses internally, and also counts non-Spark pods in each namespace to allow for simple mixed workloads.

FWIW, this has been tested in several high-throughput clusters (although is unfortunately disabled due to a bug in the k8s version we're on).

@liyinan926
Copy link
Collaborator

@khogeland can you rebase this PR to the latest master to pick up the merged webhook changes?

pkg/webhook/resourceusage/enforcer.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/enforcer.go Show resolved Hide resolved
pkg/webhook/resourceusage/enforcer.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/enforcer.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/handlers.go Show resolved Hide resolved
pkg/webhook/resourceusage/util.go Show resolved Hide resolved
pkg/webhook/resourceusage/watcher.go Outdated Show resolved Hide resolved
pkg/webhook/webhook.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/watcher.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/handlers.go Outdated Show resolved Hide resolved
}

// TODO: There appears to be a deadlock in cache.WaitForCacheSync. Possibly related? https://github.com/kubernetes/kubernetes/issues/71450
// For now, return immediately. There will be a short window after startup where quota calcuation is incorrect.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try to hunt this down and file an issue against k8s (or determine if it's been fixed in a later version).

Copy link
Collaborator

@liyinan926 liyinan926 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also delete the binary spark-on-k8s-operator.

main.go Show resolved Hide resolved
pkg/webhook/resourceusage/handlers.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/handlers.go Outdated Show resolved Hide resolved
pkg/webhook/resourceusage/util.go Outdated Show resolved Hide resolved
pkg/webhook/webhook.go Outdated Show resolved Hide resolved
pkg/webhook/webhook.go Outdated Show resolved Hide resolved
pkg/webhook/webhook.go Outdated Show resolved Hide resolved
pkg/webhook/webhook.go Outdated Show resolved Hide resolved
main.go Show resolved Hide resolved
@liyinan926 liyinan926 merged commit edcf4cd into kubeflow:master Aug 28, 2019
akhurana001 pushed a commit to lyft/spark-on-k8s-operator that referenced this pull request May 5, 2020
* Cert configuration and reloading

* Add support for strict webhook error handling

* Improve webhook error handling

* Don't deregister the webhook when failure policy is strict

* standard error message capitalization

* have the webhook parse its own configuration from flags

* clean up cert provider code

* Add explanation for skipping deregistration

* Resource Quota enforcement webhook

* Fix bad merge

* Cleanup, fixes

* Cleanup

* Document the quota enforcer
jbhalodia-slack pushed a commit to jbhalodia-slack/spark-operator that referenced this pull request Oct 4, 2024
* Cert configuration and reloading

* Add support for strict webhook error handling

* Improve webhook error handling

* Don't deregister the webhook when failure policy is strict

* standard error message capitalization

* have the webhook parse its own configuration from flags

* clean up cert provider code

* Add explanation for skipping deregistration

* Resource Quota enforcement webhook

* Fix bad merge

* Cleanup, fixes

* Cleanup

* Document the quota enforcer
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants