-
Notifications
You must be signed in to change notification settings - Fork 49
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] To discuss the impact of PR #383(Add owner checks and taking of final snapshots) on Multi-Node Etcd #242
Comments
In a test with @ishan16696, @abdasgupta, @aaronfern and @stoyanr we simulated the impact of doing owner checks on every member. This will be necessary to cut off active client connections to the member by killing the etcd process once. In a multi-node setup killing the etcd process (process is started again afterwards) will have the following impact:
--> Ideally, the final full snapshot is only taken after the owner check of the last member fails. Unfortunately, this is not given at any time. Example:
The given example results in a loss of any data that is written between T5 and T6. |
Yeah, but I think upper bound in a loss of data between T5 and T6 will be around ~5mins and I think it’s totally fine as we currently have deltaSnapshotPeriod is scheduled for every 5mins and in worst case here we can also lose 5mins of data. |
Thanks for bringing this up. IIUC, the Snapshotter is stopped as long as the owner check is failing, i.e. there will be no delta snapshots taken. Can you double check this? |
yes, I have double checked ... this block of code will not let the snapshotter loop start if owner check fails. The only thing we need to discuss is "How to cut-off the client traffic in such a way that etcd peers communication shouldn’t get affected so that there will be no Quorum loss". |
And AFAIK configuring |
So your proposal is to cut-off traffic as soon as one member has a failing owner check and thus there will be no changes any more?
If you want to drop the |
yes, I think it would be suffice. Let take this scenario under some assumption.
T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------
T2 [etcd process of member2 is killed ---> it will cut-off client ingress traffic]
------------------
etcd-0: Leader
etcd-1: Follower (Owner check fails)
etcd-2: Follower
------------------
T3 [etcd process of member1 is killed --> leads to leader-election)
------------------
etcd-0: Leader (Owner check fails)
etcd-1: Follower (Owner check fails)
etcd-2: Follower
------------------
T4(1) [etcd-1 becomes leader --> takes final full snapshot]
------------------
etcd-0: Follower (Owner check fails)
etcd-1: Leader (Owner check fails)
etcd-2: Follower
------------------
T4(2) [etcd-2 becomes leader --> eventually Owner check will fails --> etcd process of member3 will be killed --> somebody will become leader --> take final full snapshot]
------------------
etcd-0: Follower (Owner check fails)
etcd-1: Follower (Owner check fails)
etcd-2: Leader
------------------ |
There is also another way to do it. we can use MoveLeader API call to transfer the leadership to that etcd member which first detects that Owner check has failed. Advantage of this method is that we don't have to worry about any scenarios like above. if(Owner check fails && current etcd == Follower ){
etcd will be killed and restarted
then transfer the leadership using move leader api call
takes the Final full snapshot
cut-off the client traffic
} else if(Owner check fails && current etcd == Leader){
etcd will be killed and restarted --> leads to leader-election
then transfer the leadership using move leader api call
takes the Final full snapshot
cut-off the client traffic
} |
I tried to capture the above proposal of using MoveLeader API call described here #242 (comment) in a picture for better understanding: |
Thanks for the proposal @ishan16696. Let me address two doubts:
|
IMO yes it is possible .... the thing is when we change the pod selector in etcd-service we will able to cut-off the ingress client traffic (refer here) ... I also had already tried to simulate this behaviour (although I used k8s LoadBalancer service to cut-off the client request).
they will eventually kill their corresponding etcd-member when they detects OwnerCheck fails as |
⬆️ What about this point? Doing multi-node etcd is one of the reasons to survive problems in one AZ. |
I think similar question was also raised by @abdasgupta .... I forgot what was the counter argument, @abdasgupta do you remember ? |
AFAIK Owner checks fails is not a very frequent scenario it will happens rarely, so it is not like we are completely lost this advantage of having a cc @stoyanr |
We have the following 2 cases:
Note that from the perspective of the source cluster, it's not possible to make a clear distinction between the 2 cases. The DNS resolution can fail and the control plane migration can happen at the same time, so the source cluster must be disabled when the owner check fails, no matter the actual error. This was discussed a few times and is clearly pointed out in GEP-17 as well. I don't think it would be possible to fall back from this point without compromising the entire "bad case" design idea (to which we agreed after several phases of PoC and a lot of discussions). From the discussion so far I believe the first case is covered. In the second case, the interesting question is what the behavior should be if the DNS resolution only fails in a single zone. If it fails in all (or more than one) zones then the cluster should be disabled, and it's probably not really usable in this case anyway. Ideally, if the DNS resolution fails in a single zone, we should be able to contain the failure to that zone only. Let's discuss in a meeting if this would be possible and how to achieve it. |
@ishan16696 I wouldn't think so. The leaders can change/aren't stable. Why aren't we letting the druid doing it? Is that some "shortcut", so that the final snapshot is taken? But that can be instrumented by the druid also or remain implicit, but instead of performing an owner check, be based on checking whether traffic was cut off, i.e. druid cuts of traffic, backup sidecar notices that via the changed service and takes a full snapshot, because it is obvious/in a sense logical to now cease operations and close shop, which equals to one last final full snapshot.
Uh, what? Why do we consider data loss acceptable (even 5 minutes) without proving it is unavoidable? In other words, unless there is no technical solution or no practical one with acceptable effort (considering the severity of "data loss", we will usually go to greater lengths than for any other feature in Gardener), data loss is not acceptable. We first need to be certain that it cannot be helped, but just like that (without a very detailed explanation), data loss is not acceptable, I would think.
Terminology question: When you say "the check fails" you do not mean the "check fails", but that that the check shows that this ETCD cluster is no longer responsible for that cluster, right? Because if the „check fails", this should not have consequences and over-eagerly cut off traffic. But the main point is, that I do not see why the individual ETCD instances should have all their own checks and not the druid be in control. The druid is the master orchestrator here, including cutting off traffic, I would think?
Yes @timuthy. See above. Absolutely not. If a zone is segregated and the check fails (not: check shows this ETCD lost responsibility for that shoot), whether the ETCD instance can then reach the control plane or not, it shall NEVER cut off traffic. I understand that there may be cases where DNS is malfunctioning (though, probably calls will time our or deliver stale records, which is likely and even more worrying then to cut off traffic, see e.g. our own CoreDNS issues in the past and present) AND the ETCD cluster has lost ownership, but that is a corner case of a corner case of a corner case and the risk is much, much smaller to NOT cut off client traffic than to cut it off too often and render control planes broken as we have seen in the past when we over-eagerly (e.g. because backups failed) shut down our ETCD. In that sense, https://github.com/gardener/gardener/blob/master/docs/proposals/17-shoot-control-plane-migration-bad-case.md#handling-inability-to-resolve-the-owner-dns-record is pretty strong and I thought (but again didn't read the GEP in detail) we agreed to ONLY cut off traffic if it is clear that ownership is lost (not failed) @stoyanr? If druid happens to be in the "broken" zone, it will/should probably fail its own readiness check as well and then come up on another (ready) node that will be (eventually) on a node in a healthy zone and if then the ETCD cluster lost indeed (for real) ownership, it can cut off the traffic. |
To still have the advantage of having Let take this scenario under some assumption.
T1
------------------
etcd-0: Leader
etcd-1: Follower
etcd-2: Follower
------------------
T2 [etcd process of member2 is killed ---> it will let other member knows that `it has detected that Owner check has failed` ]
------------------
etcd-0: Leader
etcd-1: Follower (Owner check fails)
etcd-2: Follower
------------------
Consensus: Owner check hasn't failed yet as other 2 etcd member haven't detected the Owner check failure.
etcd can still able to serve the incoming traffic.
|
To complement the previous comment about how a consensus for owner check results can be implemented:
(we can also work with etcd
|
what about using memberID instead of member-name.
I have one concern regarding the
I'm under impression that we will not gonna use |
No clear preference, this was only an example. Please keep in mind that garbage collection will be more complex with memberID.
I'd rather use a combination of quorum && timeouts because a sole quorum can result in taking the final snapshot prematurely.
At the moment I don't see any other proposed solution, so I used what is already there which is |
@timuthy That is what I would rather leave to the quorum/majority than to all as discussed in the meeting and also @ishan16696 pointed out.
@timuthy Maybe you meant it, but what @ishan16696 and I wondered is the "all", but above you write "quorum". Quorum sounds good, timeouts may help even more as (as discussed in the call, but now in combination with quorum to safe-guard it even more). The general T5/T6 problem from above is already averted by the quorum trick. So there is now an even smaller risk (must think). The original T5 case was anyway a bit "constructed", because if 2/3 pods see lost or failed ownership, it is unlikely that 1/3 pods will see ownership. That means, that 2/3 pods failed to get the DNS record, but that it was still unchanged and the ownership remained with the seed (or a TTL issue maybe, which is a special form of failure, so quite "constructed" I would think). A detail question while trying to play through the different cases: I think @stoyanr explained in the case of ownership loss/failure, that the readiness probe returns (503) and then the etcd process is terminated to terminate all existing connections. Since the Kubelet takes time to detect the failed readiness probe and report it back and then KCM to update the endpoints, how is it ensured that the endpoint is first removed and doesn’t come back? If you restart the etcd process too fast, the endpoint is still up. While you can make some assumptions about the Kubelet, KCM is even more unpredictable. Removing the endpoint yourself runs the danger of racing with KCM. What is the trick here? |
@stoyanr replied out-of-band that the delays add up to the time the kubelet/KCM will usually take to remove the endpoint:
And...
Maybe that can be further safe-guarded, but considering the delays above, it’s probably not worth it. E.g., the moment we will let the readiness probe fail, we know the Kubelet will report it whenever it sees it next time (there is still a tiny uncertainty if it invoked the readiness probe that succeeded and has not yet reported it back). Knowing that the Kubelet will report false next time, one can “already” pro-actively update the pod status ready condition to false and also the endpoint as well. That’s it. If the Kubelet reports false, it’s already false and if KCM checks, the endpoint is already gone. As said, there is this time window of uncertainty where we do not know for sure whether the Kubelet will race with us if it has just called the readiness probe successfully and has not yet reported it back. |
A few cases (time progresses vertically), e.g.: Zone outage/network partitioned -> leader election
Case: Loss of ownership -> traffic cut-off and final full snapshot backup
Case: Zone outage/network partitioned -> leader election &
There are many more (corner cases), but when I think of some (similar to the T5/T6 case that should no longer be possible?), e.g. the leader loses leadership while in full backup, a follower that gets elected to become the next leader should check the owner loss quorum before opening up for traffic (pass or fail the readiness probe from then on depending on the result) and that would prevent intermediate etcd updates, right?
Something like that? |
The constructed case was less about that 1/3 will remain seeing an ownership but rather when it will start seeing a lost ownership because of the deviated owner check intervals and involved TTLs. @stoyanr then suggested to have a second safeguard which the Do I get it correctly, that with your cases above you don't see the necessity to have this second safeguard? |
Yes, I understood, but if two pods see already lost ownership, the third one, checking even later, should see it as well. The chance that it doesn't is very small, is it not? Maybe a TTL issue because the record was fetched right before it was switched, something like that. Anyway, I didn't say impossible. And, with quorum, it doesn't matter anymore.
No, that shouldn't imply it. I don't know how exactly it is implemented, the termination, the readiness probe, etc. So, I was vague because of lack of knowledge, but maybe we can fill in the details together and make it safer? The above has helped me see some things more clearly (like termination will of course cause loss of leadership or leader election), but when that happens, the sections where I write "Terminate ETCD process"/"Fail readiness probe", etc. are too vague for me. It really depends now on the details, but as said yesterday (not helpful as that's not a concrete statement), I don't think it's far now anymore (like also Stoyan said). It's basically a quorum-based "distributed transaction". The pattern is known, the details must be clarified now. What happens chronologically when, storing data, terminating ETCD, checking that data after restart and before ever passing (or not passing) a readiness probe, etc. This information I didn't have to make it more concrete. |
As we have discussed and decided that we would like to turn off the Owner check in case the multi-node etcd as owner checks introduced complexities which are difficult to manage in a multi-node etcd and with HA control planes the "bad case" control plane migration would be triggered very rarely. /close |
Feature (What you would like to be added):
How etcd-druid will going to react to this new change in work-flow of backup-leader ?
Motivation (Why is this needed?):
PR #383 (Add owner checks and taking of final snapshots) wants to disconnect the api-server and etcd for that it is killing the etcd process and fails the Readiness Probe. Now in multi-node etcd Do we wants to kill/disconnect all etcd members ?
If we go with change like backup-leader will kill/disconnect the api-server and etcd cluster will lost its quorum, then we also have to take of etcd-druid reaction to quorum loss.
cc @stoyanr
Approach/Hint to the implement solution (optional):
The text was updated successfully, but these errors were encountered: