Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot upgrade cluster from 3.4 to 3.5. #11264

Closed
YoyinZyc opened this issue Oct 16, 2019 · 8 comments · Fixed by #11274
Closed

Cannot upgrade cluster from 3.4 to 3.5. #11264

YoyinZyc opened this issue Oct 16, 2019 · 8 comments · Fixed by #11274

Comments

@YoyinZyc
Copy link
Contributor

YoyinZyc commented Oct 16, 2019

I followed the upgrade instruction to upgrade the cluster from 3.4 to 3.5. The cluster is unhealthy when it has 3.4 nodes and 3.5 nodes.
Here are my findings:

  1. when I stop one 3.4 node and restart it with 3.5, the 3.5 node cannot be published and keeps sending vote request to other two servers. The 3.5 node is unhealthy while another two are healthy. Here is the log of the 3.5 node.

{"level":"warn","ts":"2019-10-16T11:44:33.748-0700","caller":"etcdserver/server.go:2045","msg":"failed to publish local member to cluster through raft","local-member-id":"91bc3c398fb3c146","local-member-attributes":"{Name:infra2 ClientURLs:[http://127.0.0.1:22379]}","request-path":"/0/members/91bc3c398fb3c146/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:923","msg":"91bc3c398fb3c146 is starting a new election at term 21"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:713","msg":"91bc3c398fb3c146 became candidate at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:824","msg":"91bc3c398fb3c146 received MsgVoteResp from 91bc3c398fb3c146 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:811","msg":"91bc3c398fb3c146 [logterm: 12, index: 9] sent MsgVote request to 8211f1d0f64f3269 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:811","msg":"91bc3c398fb3c146 [logterm: 12, index: 9] sent MsgVote request to fd422379fda50e48 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:35.435-0700","caller":"raft/raft.go:923","msg":"91bc3c398fb3c146 is starting a new election at term 22"}

Another two 3.4 servers will ignore the vote request.

{"level":"info","ts":"2019-10-16T11:44:33.944-0700","caller":"raft/raft.go:844","msg":"8211f1d0f64f3269 [logterm: 12, index: 9, vote: 8211f1d0f64f3269] ignored MsgVote from 91bc3c398fb3c146 [logterm: 12, index: 9] at term 12: lease is not expired (remaining ticks: 10)"}

  1. when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy. Here is the log of 3.4 node.

{"level":"warn","ts":"2019-10-16T12:09:44.872-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997634337s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}
{"level":"warn","ts":"2019-10-16T12:09:46.875-0700","caller":"etcdserver/v3_server.go:708","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}

  1. Everything works well if I upgrade the cluster from 3.3 to 3.4 and from 3.3 to 3.5.
    m1.log
    m2.log
    m3.log
    operation.txt
@jingyih
Copy link
Contributor

jingyih commented Oct 16, 2019

cc @gyuho

@gyuho
Copy link
Contributor

gyuho commented Oct 16, 2019

when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy.

So, the error logs are coming from the last 3.4 node, correct?

@YoyinZyc
Copy link
Contributor Author

when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy.

So, the error logs are coming from the last 3.4 node, correct?

yep.

@gyuho
Copy link
Contributor

gyuho commented Oct 16, 2019

Ok, must be a bug...

Can you help debug by providing more logs in that 3.4 node?

@gyuho
Copy link
Contributor

gyuho commented Oct 16, 2019

Could be some bug around

"3.5.0": {AuthCapability: true, V3rpcCapability: true},

@YoyinZyc
Copy link
Contributor Author

Ok, must be a bug...

Can you help debug by providing more logs in that 3.4 node?

attached all the log files.

@gyuho
Copy link
Contributor

gyuho commented Oct 17, 2019

@YoyinZyc If we look at the log in m1.log

{"level":"warn","ts":"2019-10-16T17:11:06.112-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997768406s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}

The request was sent when leadership election happened.

{"level":"info","ts":"2019-10-16T17:10:49.183-0700","caller":"raft/node.go:325","msg":"raft.node: 8211f1d0f64f3269 elected leader 91bc3c398fb3c146 at term 31"}

Can we check m1 is still unhealthy afterwards?

@YoyinZyc
Copy link
Contributor Author

Yuchen Zhou If we look at the log in m1.log

{"level":"warn","ts":"2019-10-16T17:11:06.112-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997768406s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}

The request was sent when leadership election happened.

{"level":"info","ts":"2019-10-16T17:10:49.183-0700","caller":"raft/node.go:325","msg":"raft.node: 8211f1d0f64f3269 elected leader 91bc3c398fb3c146 at term 31"}

Can we check m1 is still unhealthy afterwards?

I tried health check again. It responds the same error and it is still unhealthy.

{"level":"warn","ts":"2019-10-17T10:26:08.449-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.998418765s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}
{"level":"warn","ts":"2019-10-17T10:26:10.451-0700","caller":"etcdserver/v3_server.go:708","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging a pull request may close this issue.

3 participants