Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ci] :x-pack:rolling-upgrade:with-system-key times out when starting oneThirdUpgradedTestCluster node0 #32566

Closed
andyb-elastic opened this issue Aug 1, 2018 · 5 comments
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI

Comments

@andyb-elastic
Copy link
Contributor

andyb-elastic commented Aug 1, 2018

Happened in CI intake job, on a PR job, and I was able to reproduce it locally

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/2448/console

CI log
build-2448.txt

Test cluster logs
v6.5.0-SNAPSHOT#oldClusterTestCluster-node0.log
v6.5.0-SNAPSHOT#oldClusterTestCluster-node1.log
v6.5.0-SNAPSHOT#oldClusterTestCluster-node2.log
v6.5.0-SNAPSHOT#oneThirdUpgradedTestCluster-node0.log

I'm not sure if this deserialization error is the real cause but it appeared in all three instances I looked at the cluster logs for - it looks like there might have been some recent changes here (for example #32319)

[2018-08-01T20:22:35,314][WARN ][o.e.d.z.ZenDiscovery     ] [node-1] failed to validate incoming join request from node [{upgraded-node-0}{_DluXPheQ3q0NQzXEPKpzQ}{pb58T916Q0azZH_9t9KsVw}{127.0.0.1}{127.0.0.1:44168}{testattr=test, upgraded=true, ml.machine_memory=31606448128, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}]
org.elasticsearch.transport.RemoteTransportException: [upgraded-node-0][127.0.0.1:44168][internal:discovery/zen/join/validate]
Caused by: java.lang.IllegalStateException: unexpected byte [0x04]
        at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:439) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readBoolean(StreamInput.java:429) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.StreamInput.readOptionalLong(StreamInput.java:322) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.xpack.core.ml.job.config.Job.<init>(Job.java:242) ~[?:?]
        at org.elasticsearch.xpack.core.ml.MlMetadata.<init>(MlMetadata.java:140) ~[?:?]
        at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:46) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.NamedWriteableAwareStreamInput.readNamedWriteable(NamedWriteableAwareStreamInput.java:39) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.MetaData.readFrom(MetaData.java:834) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.cluster.ClusterState.readFrom(ClusterState.java:727) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.discovery.zen.MembershipAction$ValidateJoinRequest.readFrom(MembershipAction.java:173) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.common.io.stream.Streamable.lambda$newWriteableReader$0(Streamable.java:51) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.RequestHandlerRegistry.newRequest(RequestHandlerRegistry.java:56) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.handleRequest(TcpTransport.java:1633) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1501) ~[elasticsearch-6.5.0-SNAPSHOT.jar:6.5.0-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) ~[?:?]

Bonus logs from my local reproduction
oldClusterTestCluster node0 run.log
oldClusterTestCluster node1 run.log
oldClusterTestCluster node2 run.log
oneThirdUpgradedTestCluster node0 run.log

@andyb-elastic andyb-elastic added >test-failure Triaged test failures from CI :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. :ml Machine learning labels Aug 1, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@ywelsch ywelsch removed the :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. label Aug 2, 2018
@ywelsch
Copy link
Contributor

ywelsch commented Aug 2, 2018

@dimitris-athanasiou can you take a look?

@dimitris-athanasiou
Copy link
Contributor

@ywelsch That's weird. I haven't merged in the PR I was talking about even! looking

@dimitris-athanasiou
Copy link
Contributor

Ok. This is my bad. Somehow I cherry-picked and merged the commit from #32496 in 6.x before ever merging the PR. I must have done so accidentally when I was backporting some other bug fixes. I don't recall at all how it happened :-(. I've merged #32496 on master now which should fix the build. Apologies for the noise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants