Cannot upgrade cluster from 3.4 to 3.5. #11264

YoyinZyc · 2019-10-16T19:12:49Z

I followed the upgrade instruction to upgrade the cluster from 3.4 to 3.5. The cluster is unhealthy when it has 3.4 nodes and 3.5 nodes.
Here are my findings:

when I stop one 3.4 node and restart it with 3.5, the 3.5 node cannot be published and keeps sending vote request to other two servers. The 3.5 node is unhealthy while another two are healthy. Here is the log of the 3.5 node.

{"level":"warn","ts":"2019-10-16T11:44:33.748-0700","caller":"etcdserver/server.go:2045","msg":"failed to publish local member to cluster through raft","local-member-id":"91bc3c398fb3c146","local-member-attributes":"{Name:infra2 ClientURLs:[http://127.0.0.1:22379]}","request-path":"/0/members/91bc3c398fb3c146/attributes","publish-timeout":"7s","error":"etcdserver: request timed out"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:923","msg":"91bc3c398fb3c146 is starting a new election at term 21"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:713","msg":"91bc3c398fb3c146 became candidate at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:824","msg":"91bc3c398fb3c146 received MsgVoteResp from 91bc3c398fb3c146 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:811","msg":"91bc3c398fb3c146 [logterm: 12, index: 9] sent MsgVote request to 8211f1d0f64f3269 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:33.935-0700","caller":"raft/raft.go:811","msg":"91bc3c398fb3c146 [logterm: 12, index: 9] sent MsgVote request to fd422379fda50e48 at term 22"}
{"level":"info","ts":"2019-10-16T11:44:35.435-0700","caller":"raft/raft.go:923","msg":"91bc3c398fb3c146 is starting a new election at term 22"}

Another two 3.4 servers will ignore the vote request.

{"level":"info","ts":"2019-10-16T11:44:33.944-0700","caller":"raft/raft.go:844","msg":"8211f1d0f64f3269 [logterm: 12, index: 9, vote: 8211f1d0f64f3269] ignored MsgVote from 91bc3c398fb3c146 [logterm: 12, index: 9] at term 12: lease is not expired (remaining ticks: 10)"}

when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy. Here is the log of 3.4 node.

{"level":"warn","ts":"2019-10-16T12:09:44.872-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997634337s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}
{"level":"warn","ts":"2019-10-16T12:09:46.875-0700","caller":"etcdserver/v3_server.go:708","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}

Everything works well if I upgrade the cluster from 3.3 to 3.4 and from 3.3 to 3.5.
m1.log
m2.log
m3.log
operation.txt

jingyih · 2019-10-16T23:10:00Z

cc @gyuho

gyuho · 2019-10-16T23:24:12Z

when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy.

So, the error logs are coming from the last 3.4 node, correct?

YoyinZyc · 2019-10-16T23:26:35Z

when I stop the another 3.4 node and restart it with 3.5, the two 3.5 nodes become healthy but the 3.4 node is unhealthy.

So, the error logs are coming from the last 3.4 node, correct?

yep.

gyuho · 2019-10-16T23:50:41Z

Ok, must be a bug...

Can you help debug by providing more logs in that 3.4 node?

gyuho · 2019-10-16T23:52:55Z

Could be some bug around

etcd/etcdserver/api/capability.go

Line 44 in debf86c

"3.5.0": {AuthCapability: true, V3rpcCapability: true},

YoyinZyc · 2019-10-17T00:22:37Z

Ok, must be a bug...

Can you help debug by providing more logs in that 3.4 node?

attached all the log files.

gyuho · 2019-10-17T00:41:38Z

@YoyinZyc If we look at the log in m1.log

{"level":"warn","ts":"2019-10-16T17:11:06.112-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997768406s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}

The request was sent when leadership election happened.

{"level":"info","ts":"2019-10-16T17:10:49.183-0700","caller":"raft/node.go:325","msg":"raft.node: 8211f1d0f64f3269 elected leader 91bc3c398fb3c146 at term 31"}

Can we check m1 is still unhealthy afterwards?

YoyinZyc · 2019-10-17T17:30:12Z

Yuchen Zhou If we look at the log in m1.log

{"level":"warn","ts":"2019-10-16T17:11:06.112-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.997768406s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}

The request was sent when leadership election happened.

{"level":"info","ts":"2019-10-16T17:10:49.183-0700","caller":"raft/node.go:325","msg":"raft.node: 8211f1d0f64f3269 elected leader 91bc3c398fb3c146 at term 31"}

Can we check m1 is still unhealthy afterwards?

I tried health check again. It responds the same error and it is still unhealthy.

{"level":"warn","ts":"2019-10-17T10:26:08.449-0700","caller":"etcdserver/util.go:144","msg":"apply request took too long","took":"4.998418765s","expected-duration":"100ms","prefix":"read-only range ","request":"key:"health" ","response":"","error":"context canceled"}
{"level":"warn","ts":"2019-10-17T10:26:10.451-0700","caller":"etcdserver/v3_server.go:708","msg":"timed out waiting for read index response (local node might have slow network)","timeout":"7s"}

YoyinZyc mentioned this issue Oct 16, 2019

e2e: only expect cluster's major version is > 3 in release upgrade test #11266

Merged

jingyih added the type/bug label Oct 16, 2019

gyuho added the stage/investigating label Oct 17, 2019

gyuho added this to the etcd-v3.5 milestone Oct 17, 2019

YoyinZyc mentioned this issue Oct 17, 2019

rafthttp: add 3.4.0,3.5.0 stream type #11274

Merged

jingyih closed this as completed in #11274 Oct 21, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot upgrade cluster from 3.4 to 3.5. #11264

Cannot upgrade cluster from 3.4 to 3.5. #11264

YoyinZyc commented Oct 16, 2019 •

edited

Loading

jingyih commented Oct 16, 2019

gyuho commented Oct 16, 2019

YoyinZyc commented Oct 16, 2019

gyuho commented Oct 16, 2019

gyuho commented Oct 16, 2019

YoyinZyc commented Oct 17, 2019

gyuho commented Oct 17, 2019

YoyinZyc commented Oct 17, 2019

Cannot upgrade cluster from 3.4 to 3.5. #11264

Cannot upgrade cluster from 3.4 to 3.5. #11264

Comments

YoyinZyc commented Oct 16, 2019 • edited Loading

jingyih commented Oct 16, 2019

gyuho commented Oct 16, 2019

YoyinZyc commented Oct 16, 2019

gyuho commented Oct 16, 2019

gyuho commented Oct 16, 2019

YoyinZyc commented Oct 17, 2019

gyuho commented Oct 17, 2019

YoyinZyc commented Oct 17, 2019

YoyinZyc commented Oct 16, 2019 •

edited

Loading