etcd on reboot fails to form a cluster #884

fleiner · 2014-07-09T05:26:05Z

This is with four etcd instances, running in a cluster, then power cycling all four machines. After reboot and restart I end up with three machines waiting to
WARNING | fail getting leader from cluster
and one that claims
WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

All of them start with the following command line (the first time they got started with the appropriate -peers option):

etcd -name n3 -data-dir /var/etcd -addr 172.28.254.23:4001 -peer-addr 172.28.254.23:7001 -peer-heartbeat-interval 200

(the IP address is of course the local one)

Any idea what might be wrong, especially with the one that complains about the mismatch?

thanks

yichengq · 2014-07-09T06:39:08Z

It is fixed by #881 .
The way to recover it now is:

stop all etcds
open $data_dir/standby_info, and change "Running":true -> "Running":false manually for all etcds.
restart etcds using original command line
It should be able to work fine after fix. Sorry for the inconvenience.

fleiner · 2014-07-10T01:45:54Z

Hello,

thanks for the suggestion, but unfortunately none of the files in the data directory have a "Running" attribute, and there is no file with the name "standby_info" at all?

thanks

mcqj · 2014-07-10T15:18:52Z

I am seeing the same issue and again my data_dir has no stnadby_info file

garo · 2014-08-07T05:37:04Z

I ran into same issue when I did a rolling update from 0.4.3 to 0.4.6. First two etcd instances updated just fine, but the third caused same problems:

the 3rd instance logs:
[etcd] Aug 7 05:34:36.663 INFO | Send Join Request to http://etcd-1:7001/join
[etcd] Aug 7 05:34:36.676 INFO | etcd-3 joined the cluster via peer etcd-1:7001
[etcd] Aug 7 05:34:36.678 INFO | etcd server [name etcd-3, listen on :4001, advertised url http://172.16.6.185:4001]
[etcd] Aug 7 05:34:36.678 INFO | peer server [name etcd-3, listen on :7001, advertised url http://172.16.6.185:7001]
[etcd] Aug 7 05:34:36.678 INFO | etcd-3 starting in peer mode
[etcd] Aug 7 05:34:36.678 INFO | etcd-3: state changed from 'initialized' to 'follower'.
[etcd] Aug 7 05:34:36.723 INFO | etcd-3: state changed from 'follower' to 'snapshotting'.
[etcd] Aug 7 05:34:36.761 INFO | etcd-3: peer added: 'etcd-2'
[etcd] Aug 7 05:34:36.762 INFO | etcd-3: peer added: 'etcd-1'
[etcd] Aug 7 05:34:39.710 INFO | etcd-3: snapshot of 288185054 events at index 288185054 completed
[etcd] Aug 7 05:35:38.472 WARNING | [ss] Error: nil response
[etcd] Aug 7 05:35:38.521 WARNING | [ss] Error: nil response

And the master instance (etcd-1 in this case) flooded this:
WARNING | transporter.ss.decoding.error:proto: field/encoding mismatch: wrong type for field

EDIT: I found a workaround by using the node deletion api to remove the problematic node: curl -L -XDELETE http://etcd-1:7001/v2/admin/machines/etcd-3 and starting previously problematic etcd-3 again. I'm using coreos/etcd docker container.

mkaag · 2014-08-09T12:50:47Z

Same issue here (367.1.0 stable) but neither the XDELETE cmd or the standby_info tricks solved the pb. I stopped etcd on every nodes, removed /var/lib/etcd content (conf, log and snapshot - yes quite brutal) and restarted etcd, now everything is back to normal.

mark-kubacki · 2014-09-14T21:59:40Z

Same here with CoreOS 435.0.0.

kelseyhightower · 2014-10-28T00:14:17Z

We have reached a point with etcd 0.4.x where the work around described here are required in cases like these. Many if not all of these issues have been fixed on master and available for testing in etcd 0.5.0 alpha. Thanks for reporting issues like this, as it has helped up make 0.5.x more solid and influenced our design around snapshots and cluster config.

I'm closing this issue due to age and my believe that this issue is resolved on master and 0.5.0.

yichengq added the bug label Aug 28, 2014

kelseyhightower closed this as completed Oct 28, 2014

devurandom mentioned this issue Jun 10, 2015

etcd fails to connect after update from CoreOS 647.2.0 to 681.0.0: "transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field" and "[vote] Error: nil response" #2945

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

etcd on reboot fails to form a cluster #884

etcd on reboot fails to form a cluster #884

fleiner commented Jul 9, 2014

yichengq commented Jul 9, 2014

fleiner commented Jul 10, 2014

mcqj commented Jul 10, 2014

garo commented Aug 7, 2014

mkaag commented Aug 9, 2014

mark-kubacki commented Sep 14, 2014

kelseyhightower commented Oct 28, 2014

etcd on reboot fails to form a cluster #884

etcd on reboot fails to form a cluster #884

Comments

fleiner commented Jul 9, 2014

yichengq commented Jul 9, 2014

fleiner commented Jul 10, 2014

mcqj commented Jul 10, 2014

garo commented Aug 7, 2014

mkaag commented Aug 9, 2014

mark-kubacki commented Sep 14, 2014

kelseyhightower commented Oct 28, 2014