-
Notifications
You must be signed in to change notification settings - Fork 999
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: RFC for disrupted EBS-Backed StatefulSet delays #6336
Conversation
✅ Deploy Preview for karpenter-docs-prod canceled.
|
Pull Request Test Coverage Report for Build 9423842569Details
💛 - Coveralls |
9fd61ec
to
955acf6
Compare
Thank you @FernandoMiguel for the edits! |
Pull Request Test Coverage Report for Build 9488442511Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9488799347Details
💛 - Coveralls |
9b64f29
to
9ca551c
Compare
Pull Request Test Coverage Report for Build 9489219507Details
💛 - Coveralls |
9ca551c
to
267e74c
Compare
Pull Request Test Coverage Report for Build 9500606241Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9572946601Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
94d8dbb
to
46b0252
Compare
Pull Request Test Coverage Report for Build 9573214398Details
💛 - Coveralls |
designs/statefulset-disruption.md
Outdated
|
||
**Problem B. If step 3 doesn't happen before step 4, there will be a 1+ minute delay** | ||
|
||
If Karpenter calls EC2 TerminateInstance **before** EC2 DetachVolume calls from EBS CSI Driver Controller pod finish, then the volumes won't be detached **until the old instance terminates**.This delay depends on how long it takes the underlying instance to enter the `terminated` state, which depends on the instance type. Typically 1 minute for `m5a.large`, up to 15 minutes for certain Metals/GPU/Windows instances. See [appendix D1](#d1-ec2-termination--ec2-detachvolume-relationship-) for more context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention "From my brief experiments, Typically 1 minute for 5ma.large..."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variability is really unfortunate. I wonder if you see more consistent behavior with ec2:StopInstances
? I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case
edit: I see in the appendix:
The guest OS can take a long time to complete shutting down.
But that sounds like it's worth digging into, not sure how/why the OS shutdown sequence could take 15 minutes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @cartermckinnon, two notes for you.
I assume you can detach a volume when an instance is stopped? If so, maybe stopping prior to termination would at least put a more reasonable upper bound on the delay in this case
-
Great idea on
ec2:StopInstances
, but I'm not sure if the extra step is worth it considering that one can only detach a volume after instance reachesstopped
state (which would take an extra ~8 seconds). I measured the difference between stopping/terminating durations for a couple different instance types here. Looks likestopped
is 10-40 seconds faster depending on instance type or EC2 quirks. -
My original document was misleading when talking about GPU/Windows instance termination times. Most GPU/Windows instance types are closer to linux termination times than metal instances. I have added an instance termination timings section where I tested a few instance types.
The guest OS can take a long time to complete shutting down.
I can try digging into this with the folks at EC2.
designs/statefulset-disruption.md
Outdated
- Configure Kubelet for Graceful Node Shutdown | ||
- Enable Karpenter Spot Instance interruption handling | ||
- Use EBS CSI Driver ≥ `v1.x` in order use the [PreStop Lifecycle Hook](https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/master/docs/faq.md#what-is-the-prestop-lifecycle-hook) | ||
- Set `.node.tolerateAllTaints=false` when deploying the EBS CSI Driver |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If graceful node shutdown is configured, wouldn't it be fine for the driver to tolerate Karpenter's disruption taint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For 6+ min delays: In theory yes, in practice I have needed to turn tolerateAllTaints off in order to not see 6+min delays. We can trade notes on this.
|
||
See [a proof-of-concept implementation of A2 & B1 in PR #1294](https://github.com/kubernetes-sigs/karpenter/pull/1294) | ||
|
||
Finally, we should add the following EBS x Karpenter end-to-end test in karpenter-provider-aws to catch regressions between releases of Karpenter or EBS CSI Driver: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Really like that we are scoping in testing to this RFC! This is definitely something that we should be actively monitoring!
|
||
This means that our sequence of events will match the ideal diagram from section [Ideal Graceful Shutdown for Stateful Workloads][#ideal-graceful-shutdown-for-stateful-workloads] | ||
|
||
We can use similar logic to [today's proof-of-concept implementation](https://github.com/kubernetes-sigs/karpenter/pull/1294), but move it to karpenter-provider-aws and check for `node.Status.VolumesInUse` instead of listing volumeattachment objects. A 20 second max wait was sufficient to prevent delays with m5a instance type, but further testing is needed to ensure it is enough for Windows/GPU instance types. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any kind of signal that the EBS CSI driver could give us to know that everything is detached before terminating? One option is that Karpenter does some work from its neutral code; another option is that the EBS CSI driver injects a finalizer and only removes it when it knows that it has finished the work that it needs to do. This, of course, requires us to create a finalizer that we wait for instance termination. There are similar-ish requirements from ALB though with waiting on instance termination before the target group is deregistered and I'm wondering how much overlap that we have here on these two problems
|
||
Wait for volumes to detach before terminating the instance. | ||
|
||
We can do this by waiting for all volumes of drain-able nodes to be marked as not be in use nor attached before terminating the node in c.cloudProvider.Delete (until a maximum of 20 seconds). See [Appendix D3 for the implementation details of this wait](#d) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is kind of weird IMO. This basically means that the Delete call is going to return an error until we are good to actually delete the instance. I wonder if there's other ways that we could hook into the deletion flow without having to hack the cloudprovider delete call so that it orchestrates the flow that we're looking for
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem today is that we only call delete once we get a success, which means that you are either going to have to make this synchronous or you are going to have to do some kind of weird erroring so that we back-off until we actually do the termination
Pull Request Test Coverage Report for Build 9670727891Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
Pull Request Test Coverage Report for Build 9670778619Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
/remove-lifecycle stale |
/hold Edits |
This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity. |
@jmdeal Since this change ended up being done at the k-sigs/karpenter level, and we don't plan on making it part of the eventual "pre-node-deletion-hooks" plan, should I move an up-to-date and shortened version of this document to that repo? |
Sorry for the late response @AndrewSirenko, if you wanted to do that that would be great! I'll close this PR out and we can track over in k-sigs. |
Fixes #N/A
Description
Create RFC for disrupted EBS-Backed StatefulSet delays
How was this change tested?
N/A
Does this change impact docs?
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.