-
Notifications
You must be signed in to change notification settings - Fork 734
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce cluster-ignore-disk-write-errors to prevent the process from existing when saving fails #1032
base: unstable
Are you sure you want to change the base?
Conversation
…ss from existing when saving fails In clusterSaveConfigOrDie, we will exit directly when saving fails. In the case of disk failure, the cluster node will exit immediately if there are some changes around the cluster. Passive exit may bring unexpected effects, such as cluster down. Provoide exit-on-cluster-config-file-save-error configuration timen to control this behavior. Users who do not rely on nodes.conf can actively migrate it later (or during low traffic periods). Signed-off-by: Binbin <binloveplay1314@qq.com>
@PingXie This is a change I wanted to make a while ago, we'll discuss it in a later version. |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## unstable #1032 +/- ##
============================================
- Coverage 70.96% 70.96% -0.01%
============================================
Files 120 120
Lines 65061 65067 +6
============================================
+ Hits 46171 46174 +3
- Misses 18890 18893 +3
|
This adds more states we have to reason about, what exactly is the motivation behind being able to continue without saving to disk? It seems like we maybe should support a mode where we don't have to commit to disk at all. I've also see issues with stuck IO causing the fsync to stall, which also can cause an outage. |
The motivation is not to exit immediately because it is passive (if it is a primary it may cause the cluster to go down and be unable to write for a certain period of time, it it is a replica it may cause some read request to fail in the scenario of read-write separation), and not exiting immediately allows us to do a manual exit later so that we have enough time to handle it. |
this is also a problem, i think we should figure out a way to get rid of this latency, we dont rely the nodes.conf very much, and the latency is killing us. |
I'd really appreciate if you guys have some input on this one @zuiderkwast @hpatro @PingXie @madolson |
If saving nodes.conf succeeds once and later it fails, then next time we reboot we can end up with an outdated cluster config. I think this is a problem? Isn't it better to have a mode that doesn't even try to read or write nodes.conf at all, diskless cluster mode? With some external control plane like a kubernetes operator, you don't really need persistent nodes.conf. If a node disappears or reboots, it's just replaced by a new node. I understand the wish to run a node without this storage bottle-neck. |
yes, that will be a problem, so people enable this config should not use the reboot way and should not rely on the nodes.conf
this seems like a good idea to me.
yes, this is right. Some cases are difficule to handle, sometimes a disk error is discovered and a node needs to be migrated. But the node may update nodes.conf at anytime before the migration (or in the process of migration since a migration for sure will trigger a nodes.conf's update). If the update fails, the node will exit and the primary go down, cluster go fail, and we will need a node-timeout to trigger the failover. e231406. on the other hand, i want to first, to add latency around the nodes.conf (WDYT?). We have indeed encountered this save function blocking for 100ms+, or open blocking for a few seconds (in the case of logging) |
The "diskless cluster mode," in my view, takes a different approach by delegating configuration management to an external component. However, I don’t think it would work for users who prefer a "self-contained" deployment and servers automatically restarting after a crash, which is the use case I'm considering below. Allowing a node to continue operating despite failing to save its nodes.conf updates feels similar to a I agree stuck IOs are a bit worse. The failover will still happen, which is good, but the old primary who get stuck in the nodes.conf disk IO won't rejoin the shard as a replica. We could consider making these writes async so we can time out nodes.conf updates. I also find it problematic that the current design mixes configurations (e.g., IP/port, slots) with states (e.g., FAIL/PFAIL) in the same file. I wonder if separating these concerns might reduce the likelihood of server crashes caused by disk failures. |
yes, it is, it does delegate the conf to external component, and don't benefit these user. Offering a switch seems ok to me, or if there is a better solution.
this is a good statement, so should not read or write at all.
so what happen if a async write fail, should we still do the exit when the async write stuck and eventaully fail? Or if the cluster conf changes multiple times during the async, this will also make nodes.conf become a stale data (we can argue it is in a short window)
yes, i think it can reduce the likelihood, but, we still are facing the same problem. |
The async write is just a means for the server to be able to time out stuck nodes.conf file updates. The update semantics should still be sync on a high level.
Right it is a mitigation but might be good enough in practices. |
ok, i see, we set a timeout and then exit before it gets stuck any longer, right?
The reason it that although we split it into two files, when something happen, conf or state changed, we will still need to update the file, which means we are still facing the exit problem, unless we treat some hot update as a non-exit updates, otherwise we will still hit the crash cause by disk failures since we anyway need to save the files (a one big file or two small files.) |
correct.
Yeah it can still fail but my point is that 100% uptime is neither a goal nor attainable. At the end of the day, anything we depend on could and will fail. With the reduced update frequency, I wonder if we could reason that the probability of server crashing due to disk IO failures could be on par with the probability of the (bare-metal or virtual) machine failure? |
another thought i want to mention is that, a lot of normal users (not the self-contained one) think Redis/Valkey is a memory databases, when disabling the persistence (RDB and AOF), they find it is difficult to understand and accept that a cluster fail casued by a conf file failing to update and mostly the conf file is maintained by us (or their admins) and for a "self-contained" deployment, they will eventually find out that if there is a disk failure, the server will still crash after the reboot, and they will find out the solution is migrate the node (with its nodes.conf and data perhaps, or just add a new replica node in a new machine after the failover) the disk we use are not reliable over time, which is a pain for me, so i want a way to slove this problem or migrate this problem, to avoid the stuck (latency) or exit. i would also like to point out that mostly? i guess, or at least for us, people may deploy multiple servers on a single machine, so when a disk failure occurs, multiple nodes (or multiple instances) may be affected. |
Yes, that is a disconnect. As I mentioned earlier, I don't think the current cluster design requires node.conf updates to be synchronous and synchronous nodes.conf updates don't prevent staleness either, since a down node can re-join the cluster any time. The epoch mechanism is there to prevent nodes with stale cluster views from messing up the global cluster state. I’m open to the idea of introducing this configuration, provided it defaults to off. |
i am glad that i got you, the default of this configuration is the same as before, that is, if the update fails, the node will do a exit. |
we can probably rename it to something like cluster-ignore-disk-write-error, just like
|
Interesting! New words like this are always fun. Now, you gave me another idea: Bounded staleness. Let's say we can tolerate write failures for say 5 minutes (or configurable). If the last successful write was older than that, we can crash. Then we can guarantee the file at least relatively new. How about that? |
I was also having similar thoughts after dealing with issues regarding incorrect shard-id persisted to the disk temporarily and crash happens on the server, following that the server on restart reads the nodes.conf file and fails to startup due to failing validation check. Even though it's a bug in the code but the dependency on the nodes.conf leads to this issue. I would also prefer to have the config to disable both read/write from nodes.conf. However, the risk is, as all the information is lost and the node after restart would start having new node id, maybe it needs to undergo meet process with the rest of the cluster. |
i dont think it will help much. When a disk error happen, it is likely unable to recover, except the exhausting the disk, in this case, we can delete some files. But i guess in the most cases, it is unlikely recover, unless we migrate the disk or the machine. So the n minutes window doesn't look that useful.
yes, i also like that idea, we should try to find a way to drop the nodes.conf in this mode, some people who have their own control panel might like this. |
I'm OK with For completely disabling nodes.conf, yes the node needs CLUSTER MEET again to join the cluster and it gets a new node-id. We use this with Kubernetes already. Our (internal) Kubernetes operator creates a new node to replace any node that disappears. Persistent storage in Kubernetes can be problematic, especially in the case of netsplit when the operator can't reach the failing node. In this case, the operator can't be sure if the storage used by the node can safely be reused by another new node or if the node is still alive and is using the storage. Avoiding persistent storage in Kubernetes avoids this problem. (FWIW, I think we should aim for an official Kubernetes operator as an open source control plane in the long term. It's a very non-trivial job though, because some users want persistant storage and others don't, cluster vs standalone, etc. There's a discussion here: https://github.com/orgs/valkey-io/discussions/746.) |
i would like to start adding some latency around it, see #1534 |
…exit Signed-off-by: Binbin <binloveplay1314@qq.com>
Signed-off-by: Binbin <binloveplay1314@qq.com>
exit(1); | ||
} else { | ||
static mstime_t last_log_time_ms = 0; | ||
const mstime_t log_interval_ms = 10000; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may need a configuration item to control this in the future, currently this logic is copy from replica-ignore-disk-write-errors
@@ -196,7 +196,7 @@ start_cluster 1 1 {tags {"latency-monitor cluster external:skip needs:latency"} | |||
# We don't assert anything since we can't be sure whether it will be counted. | |||
R 0 cluster saveconfig | |||
R 1 cluster saveconfig | |||
R 1 cluster failover force | |||
R 1 cluster failover takeover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for changing it is that in the previous test case, force does not have enough votes to finish the election.
In clusterSaveConfigOrDie, we will exit directly when saving
fails. In the case of disk failure, the cluster node will exit
immediately if there are some changes around the cluster.
Passive exit may bring unexpected effects, such as cluster down.
Provoide cluster-ignore-disk-write-errors configuration item to
control this behavior. Users who do not rely on nodes.conf
can actively migrate it later (or during low traffic periods).
The cluster configuration file is the metadata "database" for
the cluster. It is somehow like replica-ignore-disk-write-errors,
it is not recommended to change the default. The default is no,
and the behavior is the same as before, be aware that the cluster
metadata may be inconsistent if set to yes.