-
Notifications
You must be signed in to change notification settings - Fork 340
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
after concurrent create 100 PVCs external-provisioner crash #322
Comments
cc @jsafrane do we really need default worker thread count to be so high? |
100 was requested by @saad-ali in https://docs.google.com/document/d/1wyq_9-EFsr7U90JMYXOHxoJlChwWXJqWsah3ctmuDDo/edit?disco=AAAACevk7-4 We do not need 100, but then we need to make up another number. @GongWilliam, what CSI driver do you use? Can you check why the provisioner crashed? I created 500 PVCs with mock driver and I got them provisioned (with lot of API server throttling) in ~4 minutes without any issues. It was a VM with 4 CPU cores. |
the crash is because the provisioner can not aquire the lease and then it quit, waiting for the kubelet wake it up. |
I also encountered the same problem. after concurrent create 50 PVCs external-provisioner crash. |
Perhaps it will happen when there's too many thread have operation(read/write) on ConfigMap concurrently |
Is there any stacktrace of the crash? |
logs: |
It sounds like the default 100 threads is too much for machines with fewer cores. Capturing a CPU/memory profile during stress I think would be useful to compare different values. I think lowering the default is reasonable, but we will need to wait until we're ready for a 2.0 as changing the default is a breaking change. For now, you can set the --worker-threads flag lower as needed. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten Need help running some performance benchmarking to determine what a good default number would be. |
@msau42: Please ensure the request meets the requirements listed here. If this request no longer meets these requirements, the label can be removed In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
The safest would probably be to set the default to some multiplier of CPU count. Without priorities on goroutines you will always risk the lease routine not getting scheduled/picked up and thus loosing the leader election. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
I attempted to repro this issue several different ways but no success.
This was benchmarked by creating and deleting 5000 PVCs using the external-provisioner and mock-driver on a nodepool of GCE g1-smalls (1 vCPU). Creating the PVCs took ~2m, and creating all PVs took an additional ~20+ minutes, but during this time I only ever saw logs for acquiring the lease:
but never for the lease expiring. Also no pod restarts. |
@chrishenzie: You can't close an active issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Thanks for investigating! It seems like we can't repro the issue on a 1 cpu machine, and with separating out the leader election client from provisioning k8s client, that should help avoid api throttling issues, which could be one avenue of starvation. /close |
@msau42: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
check my large scale test result, I got a relatively accurate memory limit for |
after concurrent create 100 PVCs external-provisioner crash.
the sidecar use leader-election (lease)
I change the work-thread number to 4, it works!
so the question is: is my cpu too weak or the default number of work-thread too big? why set 100 to the default number of work-thread ?
The text was updated successfully, but these errors were encountered: