Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd on reboot fails to form a cluster #884

Closed
fleiner opened this issue Jul 9, 2014 · 7 comments
Closed

etcd on reboot fails to form a cluster #884

fleiner opened this issue Jul 9, 2014 · 7 comments
Labels

Comments

@fleiner
Copy link

fleiner commented Jul 9, 2014

This is with four etcd instances, running in a cluster, then power cycling all four machines. After reboot and restart I end up with three machines waiting to
WARNING | fail getting leader from cluster
and one that claims
WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

All of them start with the following command line (the first time they got started with the appropriate -peers option):

etcd -name n3 -data-dir /var/etcd -addr 172.28.254.23:4001 -peer-addr 172.28.254.23:7001 -peer-heartbeat-interval 200

(the IP address is of course the local one)

Any idea what might be wrong, especially with the one that complains about the mismatch?

thanks

@yichengq
Copy link
Contributor

yichengq commented Jul 9, 2014

It is fixed by #881 .
The way to recover it now is:

  1. stop all etcds
  2. open $data_dir/standby_info, and change "Running":true -> "Running":false manually for all etcds.
  3. restart etcds using original command line
    It should be able to work fine after fix. Sorry for the inconvenience.

@fleiner
Copy link
Author

fleiner commented Jul 10, 2014

Hello,

thanks for the suggestion, but unfortunately none of the files in the data directory have a "Running" attribute, and there is no file with the name "standby_info" at all?

thanks

@mcqj
Copy link

mcqj commented Jul 10, 2014

I am seeing the same issue and again my data_dir has no stnadby_info file

@garo
Copy link

garo commented Aug 7, 2014

I ran into same issue when I did a rolling update from 0.4.3 to 0.4.6. First two etcd instances updated just fine, but the third caused same problems:

the 3rd instance logs:
[etcd] Aug 7 05:34:36.663 INFO | Send Join Request to http://etcd-1:7001/join
[etcd] Aug 7 05:34:36.676 INFO | etcd-3 joined the cluster via peer etcd-1:7001
[etcd] Aug 7 05:34:36.678 INFO | etcd server [name etcd-3, listen on :4001, advertised url http://172.16.6.185:4001]
[etcd] Aug 7 05:34:36.678 INFO | peer server [name etcd-3, listen on :7001, advertised url http://172.16.6.185:7001]
[etcd] Aug 7 05:34:36.678 INFO | etcd-3 starting in peer mode
[etcd] Aug 7 05:34:36.678 INFO | etcd-3: state changed from 'initialized' to 'follower'.
[etcd] Aug 7 05:34:36.723 INFO | etcd-3: state changed from 'follower' to 'snapshotting'.
[etcd] Aug 7 05:34:36.761 INFO | etcd-3: peer added: 'etcd-2'
[etcd] Aug 7 05:34:36.762 INFO | etcd-3: peer added: 'etcd-1'
[etcd] Aug 7 05:34:39.710 INFO | etcd-3: snapshot of 288185054 events at index 288185054 completed
[etcd] Aug 7 05:35:38.472 WARNING | [ss] Error: nil response
[etcd] Aug 7 05:35:38.521 WARNING | [ss] Error: nil response

And the master instance (etcd-1 in this case) flooded this:
WARNING | transporter.ss.decoding.error:proto: field/encoding mismatch: wrong type for field

EDIT: I found a workaround by using the node deletion api to remove the problematic node: curl -L -XDELETE http://etcd-1:7001/v2/admin/machines/etcd-3 and starting previously problematic etcd-3 again. I'm using coreos/etcd docker container.

@mkaag
Copy link

mkaag commented Aug 9, 2014

Same issue here (367.1.0 stable) but neither the XDELETE cmd or the standby_info tricks solved the pb. I stopped etcd on every nodes, removed /var/lib/etcd content (conf, log and snapshot - yes quite brutal) and restarted etcd, now everything is back to normal.

@yichengq yichengq added the bug label Aug 28, 2014
@mark-kubacki
Copy link

Same here with CoreOS 435.0.0.

@kelseyhightower
Copy link
Contributor

We have reached a point with etcd 0.4.x where the work around described here are required in cases like these. Many if not all of these issues have been fixed on master and available for testing in etcd 0.5.0 alpha. Thanks for reporting issues like this, as it has helped up make 0.5.x more solid and influenced our design around snapshots and cluster config.

I'm closing this issue due to age and my believe that this issue is resolved on master and 0.5.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

7 participants