-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
etcd on reboot fails to form a cluster #884
Comments
It is fixed by #881 .
|
Hello, thanks for the suggestion, but unfortunately none of the files in the data directory have a "Running" attribute, and there is no file with the name "standby_info" at all? thanks |
I am seeing the same issue and again my data_dir has no stnadby_info file |
I ran into same issue when I did a rolling update from 0.4.3 to 0.4.6. First two etcd instances updated just fine, but the third caused same problems: the 3rd instance logs: And the master instance (etcd-1 in this case) flooded this: EDIT: I found a workaround by using the node deletion api to remove the problematic node: curl -L -XDELETE http://etcd-1:7001/v2/admin/machines/etcd-3 and starting previously problematic etcd-3 again. I'm using coreos/etcd docker container. |
Same issue here (367.1.0 stable) but neither the XDELETE cmd or the standby_info tricks solved the pb. I stopped etcd on every nodes, removed /var/lib/etcd content (conf, log and snapshot - yes quite brutal) and restarted etcd, now everything is back to normal. |
Same here with CoreOS 435.0.0. |
We have reached a point with etcd 0.4.x where the work around described here are required in cases like these. Many if not all of these issues have been fixed on master and available for testing in etcd 0.5.0 alpha. Thanks for reporting issues like this, as it has helped up make 0.5.x more solid and influenced our design around snapshots and cluster config. I'm closing this issue due to age and my believe that this issue is resolved on master and 0.5.0. |
This is with four etcd instances, running in a cluster, then power cycling all four machines. After reboot and restart I end up with three machines waiting to
WARNING | fail getting leader from cluster
and one that claims
WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
All of them start with the following command line (the first time they got started with the appropriate -peers option):
etcd -name n3 -data-dir /var/etcd -addr 172.28.254.23:4001 -peer-addr 172.28.254.23:7001 -peer-heartbeat-interval 200
(the IP address is of course the local one)
Any idea what might be wrong, especially with the one that complains about the mismatch?
thanks
The text was updated successfully, but these errors were encountered: