-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vote endpoint returns 404 on new cluster #815
Comments
Here is output with debugging set up high: From the machines not outputting the above output:
And from the machine outputting the above output:
|
Something is really wrong here; here is HTTP-level output between the instances:
|
@jefferai |
@unihorn Please let me know when you have them, and I will remove the link. If you need the snapshots, let me know how to best get them to you privately (I could send a link via GitHub message for instance). |
Hey guys, looks like this was fixed and release – however i've just run into this exact issue after some unexpected network latency. Took the whole cluster down (all 8 nodes, with an active set of 3) on version 0.4.3. I managed to get the cluster back up by removing |
Same issue on 0.4.5. It happened 3 times during the last 2 weeks. Removing standby_info on each node and rebooting them also fixed the issue. |
I'm having what i think may be the same issue. I have a 6-node cluster on Rackspace that's been up for several hours etcd just stopped working completely. I'm on: NAME=CoreOS Output of
And another (web3):
|
+1 on this, happening to me also |
+1 |
+1 (v0.4.6) |
+1 |
+1, but removing standby_info didn't help. |
+1 same problem seemingly here. I can't even find a standby_info anywhere. |
+1 |
Is everyone here on Rackspace? I suspect these issues stem from the cloud-init process with Rackspace, but not sure. |
No, we are on AWS |
We're running across AWS and a private cloud, it seems to occur on both. |
Running on KVM-based virtual machines on a local machine. No cloud-init. |
Seeing this starting two etcd instances1 on EC2. Hope fully this reproducible test case will help:
Note that error does not appear if steps 3 & 4 are performed on EC2 instance B. This is reproducible in this case when the initial leader is restarted. 1 I am planning a cluster of 3+ machines but I was using 2 to test connectivity when I found this. Two nodes is not a production setup but I can't imagine this failure mode is expected even with only two nodes. Updated: Added debug flags and added more log information. |
Seeing the same problem on EC2. Having extreme difficulty standing up an etcd cluster today whereas the exact same configuration worked flawlessly a few days ago. |
@mzsanford @ryantanner Can you ensure that the machines are not configured with the same name? "ETCD_NAME" and "-name". We are ensuring this mis-configuration can't happen in etcd 0.5 but it is in the users hand to ensure this doesn't happen in 0.4 |
My discovery endpoint was showing the EC2 generated host names ( |
@mzsanford I don't know how you are running etcd but if you do not specify the name flag then it will default to the hostname. And depending on the OS, network config, etc the hostname may end up being non-unique. On CoreOS we use /etc/machine-id which is a UUID-ish thing generated on the first boot of a host. |
I just saw this again on one of our aws clusters. As far as I understand a reboot happened to the leader. Another node took over.
What I found curious is that he first tried to reconnect with itself:
The newly elected leader now began panicing with You can find the logs of the three nodes here: https://gist.github.com/ZeissS/3fd8cb73dc6d59bf4ae0 |
PS: I found a way to get the cluster back to running: stop all etcd nodes, start the old leader first, then the one node that didn't took over, wait a few seconds, then the temporary leader. This worked for me and |
@philips it is started exactly as stated in my original report. I am just starting a test cluster so this is fetched with @zeisss that matches exactly what I had seen. Are you running etcd with |
@mzsanford Whatever the default was on CoreOS 402.0.0 (yeah a rather old image, I know). We didn't temper with the |
I narrowed this down to permission problem on my etcd cluster on fedora. The data directory used by etcd is not owned by user etcd. |
FWIW permission problems (as @chakri-n ) was definitely not a cause of the issue for me. |
Hi, I ran into the same issue as well.
|
same here
|
I was, yes. |
I'm using coreos stable (494.5.0).
Getting similar entries One machine has
The other two
This is one baremetal (well actually VMs inside microsofts hyperV, but following baremetal style installs). Systemd is coming from cloud config I think, I have these style files. Systemctl shows me it's using this unit
With a different ETCD_ADDR, ETCD_NAME, ETCD_PEER_ADDR for each machine. Seems etcd 0.4 is somewhat broken?. 0.4.6 is the latest version in any coreos release. So I'm a tad stumped on what to do. EDIT: So kicking the leader in the teeth (restarting it via systemctl) seems to have worked. Seems to be some sort of conditional bug in how the leader processes messages from other nodes. So when it occurs it can't decode anything a peer is sending, and it still thinks its the leader and everything breaks. |
I see the same errors as @coozy above:
|
Jan 11 09:10:06 core-03 etcd[456]: [etcd] Jan 11 09:10:06.730 WARNING | transporter.vr.decoding.error:proto: field/encoding mismatch: wrong type for field i have met same problem too atfer server crash, |
+1 |
This is a go-raft problem. And go-raft is unmaintained now. We are using new raft now. So this problem should be solved in etcd 2.0. Thanks! |
+1 When will etcd 2.0 get integrated in the alpha channel? |
same problem, +1 for etcd 2.0 on alpha |
@david-gurley It plans to happen in Feb. |
So it's a go-raft problem, what are the details of that problem? Does it shed any light on a possible workaround for us who want to use etcd, but do not want to run Alpha in production? Because I assume it will take some time before etcd 2.0 makes it way to Stable? |
@stianstr For now (and maybe permanently) we've moved to Consul. etcd has great features but it will be a while before I feel good about its raft implementation; not just because of go-raft but because of how new and untested its new implementation is. |
Consul is really fantastic, however, fleet is tied so closely with etcd that I'd like to stay with etcd. +1 for alpha coreos with etcd 2.x |
I got the same problem with etcd 0.4.6, but after remove all data in /var/lib/etcd/*, I cloud start the etcd cluster. |
@mosquitorat 3Q |
Works for me !! @mosquitorat Thanks a lot |
Works for me too !! @mosquitorat Thanks a lot |
@mosquitorat Which removes all of the cluster data, correct? |
I created a fresh cluster of three machines yesterday, using the 0.4.1 tag. I added one key to the cluster (/global/serfkey) and then left it alone overnight.
It's a very simple cluster:
Then I tried to simply add a directory:
One of my three machines (hurley at 10.8.8.108) is now spitting out:
Restarting etcd doesn't help. The other two machines can't connect to it, fail to generate a quorum between them, and the whole cluster is unusable (and I'm hoping not doomed).
The text was updated successfully, but these errors were encountered: