Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vote endpoint returns 404 on new cluster #815

Closed
jefferai opened this issue May 30, 2014 · 49 comments · Fixed by #818
Closed

vote endpoint returns 404 on new cluster #815

jefferai opened this issue May 30, 2014 · 49 comments · Fixed by #818
Labels

Comments

@jefferai
Copy link

I created a fresh cluster of three machines yesterday, using the 0.4.1 tag. I added one key to the cluster (/global/serfkey) and then left it alone overnight.

It's a very simple cluster:

etcdctl -debug -peers http://others:4001 ls --recursive /_etcd/machines
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X GET http://10.8.8.108:4001/v2/keys/_etcd/machines?consistent=true&recursive=true&sorted=false
/_etcd/machines/others
/_etcd/machines/hurley

etcdctl -debug -peers http://others:4001 ls --recursive /
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X GET http://10.8.8.108:4001/v2/keys/?consistent=true&recursive=true&sorted=false
/global
/global/serfkey

Then I tried to simply add a directory:

etcdctl -debug -peers http://others:4001 mkdir /roles/eventcenter
Cluster-Peers: http://10.8.8.108:4001 http://10.8.8.106:4001
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false

Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Curl-Example: curl -X PUT http://10.8.8.108:4001/v2/keys/roles/eventcenter?dir=true&prevExist=false
Error: 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]

One of my three machines (hurley at 10.8.8.108) is now spitting out:

[etcd] May 30 14:17:51.322 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.579 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.579 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.900 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:51.900 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.197 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.197 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.537 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.537 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.882 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[etcd] May 30 14:17:52.882 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

Restarting etcd doesn't help. The other two machines can't connect to it, fail to generate a quorum between them, and the whole cluster is unusable (and I'm hoping not doomed).

@jefferai
Copy link
Author

Here is output with debugging set up high:

From the machines not outputting the above output:

[etcd] May 30 14:48:26.124 DEBUG     | Error parsing response to etcd error: json: cannot unmarshal number into Go value of type error.Error
[etcd] May 30 14:48:26.124 DEBUG     | fail getting machine messages from http://10.8.8.110:7001
[etcd] May 30 14:48:26.124 DEBUG     | fail getting machine messages from http://10.8.8.108:7001
[etcd] May 30 14:48:26.125 DEBUG     | Error parsing response to etcd error: invalid character '<' looking for beginning of value
[etcd] May 30 14:48:26.125 DEBUG     | fail getting machine messages from http://10.8.8.106:7001
[etcd] May 30 14:48:26.126 WARNING   | fail syncing cluster([http://10.8.8.110:7001 http://10.8.8.108:7001 http://10.8.8.106:7001]): unreachable cluster
[etcd] May 30 14:48:31.128 DEBUG     | Error parsing response to etcd error: json: cannot unmarshal number into Go value of type error.Error
[etcd] May 30 14:48:31.128 DEBUG     | fail getting machine messages from http://10.8.8.110:7001
[etcd] May 30 14:48:31.128 DEBUG     | fail getting machine messages from http://10.8.8.108:7001
[etcd] May 30 14:48:31.129 DEBUG     | Error parsing response to etcd error: invalid character '<' looking for beginning of value
[etcd] May 30 14:48:31.129 DEBUG     | fail getting machine messages from http://10.8.8.106:7001
[etcd] May 30 14:48:31.129 WARNING   | fail syncing cluster([http://10.8.8.110:7001 http://10.8.8.108:7001 http://10.8.8.106:7001]): unreachable cluster

And from the machine outputting the above output:

[etcd] May 30 14:49:16.987 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:16.987438 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:16.987 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:16.988 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:16.988917 peer.vote.failed:  hurley <- others
[etcd] May 30 14:49:16.989 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:16.989714 peer.vote.failed:  hurley <- desmond
[raft]14:49:17.384085 peer.vote:  hurley -> others
[etcd] May 30 14:49:17.384 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:17.384166 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:17.384 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:17.385 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.385406 peer.vote.failed:  hurley <- desmond
[etcd] May 30 14:49:17.385 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.385961 peer.vote.failed:  hurley <- others
[raft]14:49:17.588297 peer.vote:  hurley -> others
[etcd] May 30 14:49:17.588 DEBUG     | Send Vote from hurley to http://10.8.8.106:7001
[raft]14:49:17.588398 peer.vote:  hurley -> desmond
[etcd] May 30 14:49:17.588 DEBUG     | Send Vote from hurley to http://10.8.8.110:7001
[etcd] May 30 14:49:17.589 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]14:49:17.589725 peer.vote.failed:  hurley <- others
[etcd] May 30 14:49:17.590 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

@jefferai
Copy link
Author

Something is really wrong here; here is HTTP-level output between the instances:

###
T 10.8.8.108:60866 -> 10.8.8.106:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.106:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.108:45244 -> 10.8.8.110:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.110:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.106:7001 -> 10.8.8.108:60866 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

##
T 10.8.8.110:7001 -> 10.8.8.108:45244 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

#####
T 10.8.8.106:52793 -> 10.8.8.108:7001 [AP]
GET /v2/admin/machines HTTP/1.1.
Host: 10.8.8.108:7001.
User-Agent: Go 1.1 package http.
Accept-Encoding: gzip.
.

##
T 10.8.8.108:7001 -> 10.8.8.106:52793 [AP]
HTTP/1.1 200 OK.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Type: text/plain; charset=utf-8.
Transfer-Encoding: chunked.
.
14a.
[{"name":"desmond","state":"follower","clientURL":"http://10.8.8.110:4001","peerURL":"http://10.8.8.110:7001"},{"name":"hurley","state":"follower","clientURL":"http://10.8.8.108:4001","peerURL":"http://10.8.8.108:7001"},{"name":"others","state":"follower","clientURL":"http://10.8.8.106:4001","peerURL":"http://10.8.8.106:7001"}]
.

#
T 10.8.8.108:7001 -> 10.8.8.106:52793 [AP]
0.
.

#########
T 10.8.8.106:52794 -> 10.8.8.108:7001 [AP]
GET /v2/admin/config HTTP/1.1.
Host: 10.8.8.108:7001.
User-Agent: Go 1.1 package http.
Accept-Encoding: gzip.
.

##
T 10.8.8.108:7001 -> 10.8.8.106:52794 [AP]
HTTP/1.1 200 OK.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Type: text/plain; charset=utf-8.
Transfer-Encoding: chunked.
.
32.
{"activeSize":9,"removeDelay":5,"syncInterval":5}
.

#
T 10.8.8.108:7001 -> 10.8.8.106:52794 [AP]
0.
.

######
T 10.8.8.108:60866 -> 10.8.8.106:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.106:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.108:45244 -> 10.8.8.110:7001 [AP]
POST /vote HTTP/1.1.
Host: 10.8.8.110:7001.
User-Agent: Go 1.1 package http.
Content-Length: 17.
Accept-Encoding: gzip.
.
.........".hurley
#
T 10.8.8.110:7001 -> 10.8.8.108:45244 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

##
T 10.8.8.106:7001 -> 10.8.8.108:60866 [AP]
HTTP/1.1 404 Not Found.
Content-Type: text/plain; charset=utf-8.
Date: Fri, 30 May 2014 15:14:08 GMT.
Content-Length: 19.
.
404 page not found

@jefferai jefferai changed the title Serious error when running mkdir on a new cluster vote endpoint returns 404 on new cluster May 30, 2014
@yichengq
Copy link
Contributor

@jefferai
This is a severe bug.
Could you do me a favor to grep the standby_info, conf and log under all three etcd directories?
The scenario is that two nodes come into standby node, while the third one has no ideas about that at all.

@jefferai
Copy link
Author

@unihorn
You can find the files from all three here: (removed)

Please let me know when you have them, and I will remove the link.

If you need the snapshots, let me know how to best get them to you privately (I could send a link via GitHub message for instance).

@tarnfeld
Copy link
Contributor

tarnfeld commented Jul 8, 2014

Hey guys, looks like this was fixed and release – however i've just run into this exact issue after some unexpected network latency. Took the whole cluster down (all 8 nodes, with an active set of 3) on version 0.4.3.

I managed to get the cluster back up by removing standby_info from every etcd directory on the cluster, and that brought things back.

@djannot
Copy link

djannot commented Jul 20, 2014

Same issue on 0.4.5. It happened 3 times during the last 2 weeks. Removing standby_info on each node and rebooting them also fixed the issue.

@litch
Copy link

litch commented Jul 21, 2014

I'm having what i think may be the same issue. I have a 6-node cluster on Rackspace that's been up for several hours etcd just stopped working completely. I'm on:

NAME=CoreOS
VERSION=353.0.0

Output of journalctl -u etcd on one node (web2)

Jul 21 21:48:10 web2 etcd[1081]: [etcd] Jul 21 21:48:10.785 WARNING   | [vote] Error: nil response
Jul 21 21:48:12 web2 etcd[1081]: [etcd] Jul 21 21:48:12.080 WARNING   | [vote] Error: nil response
Jul 21 21:48:13 web2 etcd[1081]: [etcd] Jul 21 21:48:13.350 WARNING   | [vote] Error: nil response
Jul 21 21:48:15 web2 etcd[1081]: [etcd] Jul 21 21:48:15.523 WARNING   | [vote] Error: nil response
Jul 21 21:48:17 web2 etcd[1081]: [etcd] Jul 21 21:48:17.131 WARNING   | [vote] Error: nil response
Jul 21 21:48:19 web2 etcd[1081]: [etcd] Jul 21 21:48:19.131 WARNING   | [vote] Error: nil response
Jul 21 21:48:21 web2 etcd[1081]: [etcd] Jul 21 21:48:21.129 WARNING   | [vote] Error: nil response
Jul 21 21:48:22 web2 etcd[1081]: [etcd] Jul 21 21:48:22.617 WARNING   | [vote] Error: nil response
Jul 21 21:48:23 web2 etcd[1081]: [etcd] Jul 21 21:48:23.236 WARNING   | [vote] Error: nil response
Jul 21 21:48:24 web2 etcd[1081]: [etcd] Jul 21 21:48:24.520 WARNING   | [vote] Error: nil response
Jul 21 21:48:26 web2 etcd[1081]: [etcd] Jul 21 21:48:26.098 WARNING   | [vote] Error: nil response

And another (web3):

Jul 21 21:42:39 web3 etcd[716]: [etcd] Jul 21 21:42:39.831 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #782 started.
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.387 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: state changed from 'follower' to 'candidate'.
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.389 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.389 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.390 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:41 web3 etcd[716]: [etcd] Jul 21 21:42:41.390 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jul 21 21:42:43 web3 etcd[716]: [etcd] Jul 21 21:42:43.367 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: state changed from 'candidate' to 'follower'.
Jul 21 21:42:43 web3 etcd[716]: [etcd] Jul 21 21:42:43.367 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #784 started.
Jul 21 21:42:45 web3 etcd[716]: [etcd] Jul 21 21:42:45.128 INFO      | af5a2d3e5791488b9b38bf1a8edfe53d: term #785 started.

@robszumski robszumski reopened this Jul 24, 2014
@yichengq yichengq added the bug label Aug 28, 2014
@nnnnathann
Copy link

+1 on this, happening to me also

@payneio
Copy link

payneio commented Sep 2, 2014

+1

@ndemoor
Copy link

ndemoor commented Sep 16, 2014

+1 (v0.4.6)

@omegahm
Copy link

omegahm commented Sep 24, 2014

+1

@zeisss
Copy link

zeisss commented Sep 24, 2014

+1, but removing standby_info didn't help.

@johnae
Copy link

johnae commented Sep 24, 2014

+1 same problem seemingly here. I can't even find a standby_info anywhere.

@memcmp
Copy link

memcmp commented Sep 26, 2014

+1

@nnnnathann
Copy link

Is everyone here on Rackspace? I suspect these issues stem from the cloud-init process with Rackspace, but not sure.

@zeisss
Copy link

zeisss commented Sep 26, 2014

No, we are on AWS

@tarnfeld
Copy link
Contributor

We're running across AWS and a private cloud, it seems to occur on both.

@jefferai
Copy link
Author

Running on KVM-based virtual machines on a local machine. No cloud-init.

@mzsanford
Copy link

Seeing this starting two etcd instances1 on EC2. Hope fully this reproducible test case will help:

  1. Fetch etcd v0.4.6 from the downloads page
  2. Start etcd on EC2 instance A (etcd -discovery https://discovery.etcd.io/$TOKEN -peer-addr=$(curl -qs http://169.254.169.254/latest/meta-data/local-ipv4):7001)
  3. Start etcd on EC2 instance B (same command)
  4. Stop etcd on EC2 instance A
  5. Start etcd on EC2 instance A (same command)
  6. Observe broken state.
    • Instance A reporting:
[etcd] Oct  1 21:24:08.770 DEBUG     | Send Vote from ip-10-0-X-A to http://10.0.X.B:7001
[etcd] Oct  1 21:24:08.771 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
[raft]21:24:08.771653 peer.vote.failed:  ip-10-0-X-A <- ip-10-0-X-B
[raft]21:24:08.998215 peer.vote:  ip-10-0-X-A -> ip-10-0-X-B
  • Instance B reporting:
[etcd] Oct  1 21:24:10.797 DEBUG     | [recv] POST http://10.0.X.B:7001/vote [ip-10-0-X-A]
[etcd] Oct  1 21:24:10.797 WARNING   | [vote] Error: nil response

Note that error does not appear if steps 3 & 4 are performed on EC2 instance B. This is reproducible in this case when the initial leader is restarted.

1 I am planning a cluster of 3+ machines but I was using 2 to test connectivity when I found this. Two nodes is not a production setup but I can't imagine this failure mode is expected even with only two nodes.

Updated: Added debug flags and added more log information.

@ryantanner
Copy link

Seeing the same problem on EC2. Having extreme difficulty standing up an etcd cluster today whereas the exact same configuration worked flawlessly a few days ago.

@philips
Copy link
Contributor

philips commented Oct 1, 2014

@mzsanford @ryantanner Can you ensure that the machines are not configured with the same name? "ETCD_NAME" and "-name". We are ensuring this mis-configuration can't happen in etcd 0.5 but it is in the users hand to ensure this doesn't happen in 0.4

@mzsanford
Copy link

My discovery endpoint was showing the EC2 generated host names (ip-10-0-X-X) in the key so that didn't seem the likely culprit. However, adding -name $(curl -qs http://169.254.169.254/latest/meta-data/instance-id) to my command line (thus using the EC2 instance id as a name) seems to correct the issue. At least after a few dozen restarts using a new discovery token I haven't seen it. @ryantanner can you try the same (or something similar)?

@philips
Copy link
Contributor

philips commented Oct 2, 2014

@mzsanford I don't know how you are running etcd but if you do not specify the name flag then it will default to the hostname. And depending on the OS, network config, etc the hostname may end up being non-unique. On CoreOS we use /etc/machine-id which is a UUID-ish thing generated on the first boot of a host.

@zeisss
Copy link

zeisss commented Oct 5, 2014

I just saw this again on one of our aws clusters. As far as I understand a reboot happened to the leader. Another node took over.
When the previous leader then tries to rejoin the cluster, it began to log errors and

Oct 05 08:41:41 ip-172-31-28-172.<snip> etcd[485]: [etcd] Oct  5 08:41:41.593 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

What I found curious is that he first tried to reconnect with itself:

Oct 05 08:41:39 ip-172-31-28-172.<snip> etcd[485]: [etcd] Oct  5 08:41:39.533 INFO      | 58618282333c43cd96bea999eac1b4a5 attempted to join via 172.31.28.172:7001 failed: fail checking join version: Client Internal Error (Get http://172.31.28.172:7001/version: dial tcp 172.31.28.172:7001: connection refused)

The newly elected leader now began panicing with panic: assertion failed: leader.elected.at.same.term.2. The third node in the cluster began logging [vote] Error: nil response.

You can find the logs of the three nodes here: https://gist.github.com/ZeissS/3fd8cb73dc6d59bf4ae0

@zeisss
Copy link

zeisss commented Oct 5, 2014

PS: I found a way to get the cluster back to running: stop all etcd nodes, start the old leader first, then the one node that didn't took over, wait a few seconds, then the temporary leader. This worked for me and etcdctl ls showed me the keys as expected.

@mzsanford
Copy link

@philips it is started exactly as stated in my original report. I am just starting a test cluster so this is fetched with wget, extracted with tar, change to that directory and run ./etcd with the parameters shown (where that curl command returns the internal IP address). Based on the debug log messages in step 6 of my original report it would seem that the name was correctly deduced as the EC2 instance hostnames follow the ip-X-X-X-X format and this what is shown. After adding -name I have not seen the issue again but since the logged names are legit I fear this will occur again once I rely on the cluster. I'll keep prodding it and keep an eye out.

@zeisss that matches exactly what I had seen. Are you running etcd with -name/ETCD_NAME or letting it be auto-populated?

@zeisss
Copy link

zeisss commented Oct 7, 2014

@mzsanford Whatever the default was on CoreOS 402.0.0 (yeah a rather old image, I know). We didn't temper with the etcd.service file

@chakri-n
Copy link

I narrowed this down to permission problem on my etcd cluster on fedora. The data directory used by etcd is not owned by user etcd.

@jefferai
Copy link
Author

FWIW permission problems (as @chakri-n ) was definitely not a cause of the issue for me.

@coozy
Copy link

coozy commented Oct 22, 2014

Hi, I ran into the same issue as well.
root@[ip-172-31-2-96 ~]# etcd -version
etcd version 0.4.6

  • Two machines in the cluster - both of them 'followers', the etcd names are unique *
[root@ip-172-31-2-96 ~]# curl -L -s 127.0.0.1:7001/v2/admin/machines
[{"name":"1237ec5afe8e4f1c8ed92500b1f74a70","state":"follower","clientURL":"http://172.31.4.149:4001","peerURL":"http://172.31.4.149:7001"},{"name":"b5bb9c3de09449b09bd4ebb9c0c14978","state":"follower","clientURL":"http://172.31.2.96:4001","peerURL":"http://172.31.2.96:7001"}]
  • First machine is flooded with the following mesasge *
Oct 22 07:17:42 ip-172-31-2-96 etcd[19237]: [etcd] Oct 22 07:17:42.372 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Oct 22 07:17:43 ip-172-31-2-96 etcd[19237]: [etcd] Oct 22 07:17:43.928 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
  • Second machine is flooded with the following mesasge *
Oct 22 07:17:02 ip-172-31-4-149 etcd[8833]: [etcd] Oct 22 07:17:02.445 WARNING   | [vote] Error: nil response
Oct 22 07:17:04 ip-172-31-4-149 etcd[8833]: [etcd] Oct 22 07:17:04.843 WARNING   | [vote] Error: nil response

@shuoli84
Copy link

same here
0.4.6 using vagrant on OS X.

Oct 26 16:38:39 core-01 etcd[563]: [etcd] Oct 26 16:38:39.740 WARNING   | [vote] Error: nil response
Oct 26 16:38:40 core-01 etcd[563]: [etcd] Oct 26 16:38:40.083 WARNING   | [vote] Error: nil response

@kelseyhightower
Copy link
Contributor

@shuoli84 @coozy @jefferai Can we make sure we are using the -name flag with etcd?

@jefferai
Copy link
Author

jefferai commented Nov 5, 2014

I was, yes.

@btrepp
Copy link

btrepp commented Dec 19, 2014

I'm using coreos stable (494.5.0).

etcd --version
etcd version 0.4.6

Getting similar entries

One machine has

Dec 19 06:07:02 core1 etcd[584]: [etcd] Dec 19 06:07:02.638 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field
Dec 19 06:07:03 core1 etcd[584]: [etcd] Dec 19 06:07:03.151 WARNING   | transporter.ae.decoding.error:proto: field/encoding mismatch: wrong type for field

The other two

Dec 19 06:11:23 core2 etcd[1848]: [etcd] Dec 19 06:11:23.714 WARNING   | [ae] Error: nil response
Dec 19 06:11:24 core2 etcd[1848]: [etcd] Dec 19 06:11:24.229 WARNING   | [ae] Error: nil response

This is one baremetal (well actually VMs inside microsofts hyperV, but following baremetal style installs).

Systemd is coming from cloud config I think, I have these style files. Systemctl shows me it's using this unit

 systemctl cat etcd
# /usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s

# /run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=192.168.0.2:4001"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/718b5f19527426c6a61fae70d27b3221"
Environment="ETCD_NAME=87e89652b7d9483492764d231ab1e515"
Environment="ETCD_PEER_ADDR=192.168.0.2:7001"
Environment="ETCD_PEER_ELECTION_TIMEOUT=2000"
Environment="ETCD_PEER_HEARTBEAT_INTERVAL=500"

With a different ETCD_ADDR, ETCD_NAME, ETCD_PEER_ADDR for each machine.

Seems etcd 0.4 is somewhat broken?. 0.4.6 is the latest version in any coreos release. So I'm a tad stumped on what to do.

EDIT:

So kicking the leader in the teeth (restarting it via systemctl) seems to have worked. Seems to be some sort of conditional bug in how the leader processes messages from other nodes. So when it occurs it can't decode anything a peer is sending, and it still thinks its the leader and everything breaks.

@rca
Copy link
Contributor

rca commented Jan 5, 2015

I see the same errors as @coozy above:

Jan 05 03:27:57 ip-172-30-0-11.ec2.internal etcd[572]: [etcd] Jan  5 03:27:57.339 WARNING   | [vote] Error: nil response
Jan 05 03:28:32 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:32.262 INFO      | 55feb942c50942aebc9406fa236d62c6: state changed from 'candidate' to 'follower'.
Jan 05 03:28:32 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:32.262 INFO      | 55feb942c50942aebc9406fa236d62c6: term #2308 started.
Jan 05 03:28:33 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:33.965 INFO      | 55feb942c50942aebc9406fa236d62c6: state changed from 'follower' to 'candidate'.
Jan 05 03:28:33 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:33.967 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 05 03:28:35 ip-172-30-0-125.ec2.internal etcd[9798]: [etcd] Jan  5 03:28:35.380 WARNING   | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

@piaohai
Copy link

piaohai commented Jan 11, 2015

Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.730 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828030 950 logs.go:39] etcd DEBUG: get /registry/nodes/172.17.8.103/boundpods [http://127.0.0.1:4001]
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828187 950 logs.go:39] etcd DEBUG: Connecting to etcd: attempt1forkeys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.828213 950 logs.go:39] etcd DEBUG: send.request.to http://127.0.0.1:4001/v2/keys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false | method GET
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.849250 950 logs.go:39] etcd DEBUG: network error:Get /v2/keys/registry/nodes/172.17.8.103/boundpods?consistent=true&recursive=false&sorted=false: stopped after 10 redirects
Jan 11 09:10:06 core-03 kubelet[950]: E0111 09:10:06.851528 950 etcd_tools_watch.go:171] watch was unable to retrieve the current index for the provided key ("/registry/nodes/172.17.8.103/boundpods"): 501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]
Jan 11 09:10:06 core-03 kubelet[950]: I0111 09:10:06.851776 950 etcd.go:71] Watch closed (&api.Status{TypeMeta:api.TypeMeta{Kind:"", APIVersion:""}, ListMeta:api.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"501: All the given peers are not reachable (Tried to connect to each peer twice and failed) [0]", Reason:"", Details:(*api.StatusDetails)(nil), Code:0}). Reopening.
Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.961 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.962 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.303 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.305 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.664 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field
Jan 11 09:10:07 core-03 etcd[456]: [etcd] Jan 11 09:10:07.668 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field

i have met same problem too atfer server crash,

@joshuarubin
Copy link

+1

@xiang90
Copy link
Contributor

xiang90 commented Jan 16, 2015

This is a go-raft problem. And go-raft is unmaintained now. We are using new raft now. So this problem should be solved in etcd 2.0. Thanks!

@xiang90 xiang90 closed this as completed Jan 16, 2015
@david-gurley
Copy link

+1

When will etcd 2.0 get integrated in the alpha channel?

@jpmx
Copy link

jpmx commented Jan 29, 2015

same problem, +1 for etcd 2.0 on alpha

@yichengq
Copy link
Contributor

@david-gurley It plans to happen in Feb.
@jpmx If you meet it in 2.0, please open a new issue with details about it.

@stianstr
Copy link

stianstr commented Feb 5, 2015

So it's a go-raft problem, what are the details of that problem? Does it shed any light on a possible workaround for us who want to use etcd, but do not want to run Alpha in production? Because I assume it will take some time before etcd 2.0 makes it way to Stable?

@jefferai
Copy link
Author

jefferai commented Feb 6, 2015

@stianstr For now (and maybe permanently) we've moved to Consul. etcd has great features but it will be a while before I feel good about its raft implementation; not just because of go-raft but because of how new and untested its new implementation is.

@brandonkruse
Copy link

Consul is really fantastic, however, fleet is tied so closely with etcd that I'd like to stay with etcd. +1 for alpha coreos with etcd 2.x

@mosquitorat
Copy link

I got the same problem with etcd 0.4.6, but after remove all data in /var/lib/etcd/*, I cloud start the etcd cluster.

@wutil
Copy link

wutil commented Mar 16, 2015

@mosquitorat 3Q

@wptad
Copy link

wptad commented Mar 23, 2015

Works for me !! @mosquitorat

Thanks a lot

@wangzi19870227
Copy link

Works for me too !! @mosquitorat

Thanks a lot

@jefferai
Copy link
Author

@mosquitorat Which removes all of the cluster data, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging a pull request may close this issue.