Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TLS connection error on the transport protocol during rolling upgrades #2823

Closed
sebgl opened this issue Apr 6, 2020 · 9 comments · Fixed by #2831
Closed

TLS connection error on the transport protocol during rolling upgrades #2823

sebgl opened this issue Apr 6, 2020 · 9 comments · Fixed by #2831
Assignees
Labels
>bug Something isn't working v1.1.0

Comments

@sebgl
Copy link
Contributor

sebgl commented Apr 6, 2020

I observed several E2E tests failing during a rolling upgrade (example: https://devops-ci.elastic.co/job/cloud-on-k8s-e2e-tests-stack-versions/38/).

I think it randomly impacts all rolling upgrades where nodes IPs are replaced.

Symptoms:

  • one of the node never joins the cluster, it cannot connect to the other nodes:

ES 7.1.1 logs:

{"type": "server", "timestamp": "2020-04-06T02:26:55,898+0000", "level": "INFO", "component": "o.e.c.c.JoinHelper", "cluster.name": "test-reverted-mutation-x9g8", "node.name": "test-reverted-mutation-x9g8-es-masterdata-2",  "message": "failed to join {test-reverted-mutation-x9g8-es-masterdata-0}{kZoqIftrTdyNcZkZ3HziSA}{l3XsMzFhSwWyCAiRwZpDhQ}{10.1.178.50}{10.1.178.50:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true} with JoinRequest{sourceNode={test-reverted-mutation-x9g8-es-masterdata-2}{5ac1-E9KQ4OsnLsszGdbrg}{XCCcVXlpRwKtdRjcMj67_Q}{10.1.177.44}{10.1.177.44:9300}{ml.machine_memory=2147483648, xpack.installed=true, box_type=mixed, ml.max_open_jobs=20}, optionalJoin=Optional[Join{term=2, lastAcceptedTerm=1, lastAcceptedVersion=29, sourceNode={test-reverted-mutation-x9g8-es-masterdata-2}{5ac1-E9KQ4OsnLsszGdbrg}{XCCcVXlpRwKtdRjcMj67_Q}{10.1.177.44}{10.1.177.44:9300}{ml.machine_memory=2147483648, xpack.installed=true, box_type=mixed, ml.max_open_jobs=20}, targetNode={test-reverted-mutation-x9g8-es-masterdata-0}{kZoqIftrTdyNcZkZ3HziSA}{l3XsMzFhSwWyCAiRwZpDhQ}{10.1.178.50}{10.1.178.50:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true}}]}" , 
"stacktrace": ["org.elasticsearch.transport.RemoteTransportException: [test-reverted-mutation-x9g8-es-masterdata-0][10.1.178.50:9300][internal:cluster/coordination/join]",
"Caused by: org.elasticsearch.transport.ConnectTransportException: [test-reverted-mutation-x9g8-es-masterdata-2][10.1.177.44:9300] general node connection failure",
"at org.elasticsearch.transport.TcpTransport$ChannelsConnectedListener$1.onFailure(TcpTransport.java:1284) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportHandshaker$HandshakeResponseHandler.handleLocalException(TransportHandshaker.java:155) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.TransportHandshaker.lambda$sendHandshake$0(TransportHandshaker.java:67) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.ActionListener.lambda$wrap$0(ActionListener.java:83) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:61) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.action.ActionListener.lambda$toBiConsumer$2(ActionListener.java:97) ~[elasticsearch-7.1.1.jar:7.1.1]",
"at org.elasticsearch.common.concurrent.CompletableContext.lambda$addListener$0(CompletableContext.java:39) ~[elasticsearch-core-7.1.1.jar:7.1.1]",
"at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:859) ~[?:?]",
"at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:837) ~[?:?]",
"at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:506) ~[?:?]",
"at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2144) ~[?:?]",
"at org.elasticsearch.common.concurrent.CompletableContext.complete(CompletableContext.java:61) ~[elasticsearch-core-7.1.1.jar:7.1.1]",
"at org.elasticsearch.transport.netty4.Netty4TcpChannel.lambda$new$0(Netty4TcpChannel.java:51) ~[transport-netty4-client-7.1.1.jar:7.1.1]",
"at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:511) ~[netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.util.concurrent.DefaultPromise.notifyListeners0(DefaultPromise.java:504) ~[netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:483) ~[netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:424) ~[netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.util.concurrent.DefaultPromise.trySuccess(DefaultPromise.java:103) ~[netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.DefaultChannelPromise.trySuccess(DefaultChannelPromise.java:84) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannel$CloseFuture.setClosed(AbstractChannel.java:1152) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannel$AbstractUnsafe.doClose0(AbstractChannel.java:768) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:744) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannel$AbstractUnsafe.close(AbstractChannel.java:615) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.close(DefaultChannelPipeline.java:1376) ~[netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeClose(AbstractChannelHandlerContext.java:624) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:608) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.close(AbstractChannelHandlerContext.java:465) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.ssl.SslUtils.handleHandshakeFailure(SslUtils.java:350) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.ssl.SslHandler.setHandshakeFailure(SslHandler.java:1581) ~[netty-handler-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.ssl.SslHandler.handleUnwrapThrowable(SslHandler.java:1239) [netty-handler-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1209) [netty-handler-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1247) [netty-handler-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]",
"at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]",
"at java.lang.Thread.run(Thread.java:835) [?:?]",
"Caused by: org.elasticsearch.transport.TransportException: handshake failed because connection reset",

ES 6.8.5 logs:

[2020-04-06T01:27:41,409][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [test-es-keystore-mkv6-es-masterdata-2] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/10.103.34.20:9300, remoteAddress=/10.103.32.34:55346}
[2020-04-06T01:27:41,485][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [test-es-keystore-mkv6-es-masterdata-2] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/10.103.34.20:9300, remoteAddress=/10.103.32.34:55348}
[2020-04-06T01:27:41,489][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [test-es-keystore-mkv6-es-masterdata-2] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/10.103.34.20:9300, remoteAddress=/10.103.32.34:55344}
[2020-04-06T01:27:41,612][WARN ][o.e.x.c.s.t.n.SecurityNetty4Transport] [test-es-keystore-mkv6-es-masterdata-2] client did not trust this server's certificate, closing connection Netty4TcpChannel{localAddress=/10.103.34.20:9300, remoteAddress=/10.103.32.34:55350}
[2020-04-06T01:27:41,622][INFO ][o.e.d.z.ZenDiscovery     ] [test-es-keystore-mkv6-es-masterdata-2] failed to send join request to master [{test-es-keystore-mkv6-es-masterdata-1}{T1n_KWimSuOlGd2pTbfbWg}{ULIVyNJ6QKyJCP6zrRX66w}{10.103.32.34}{10.103.32.34:9300}{ml.machine_memory=2147483648, ml.max_open_jobs=20, xpack.installed=true, ml.enabled=true}], reason [RemoteTransportException[[test-es-keystore-mkv6-es-masterdata-1][10.103.32.34:9300][internal:discovery/zen/join]]; nested: ConnectTransportException[[test-es-keystore-mkv6-es-masterdata-2][10.103.34.20:9300] general node connection failure]; nested: TransportException[handshake failed because connection reset]; ]
  • Other nodes report they cannot trust the new node IP address:
[2020-04-06T01:36:32,392][WARN ][o.e.t.TcpTransport       ] [test-es-keystore-mkv6-es-masterdata-1] exception caught on transport layer [Netty4TcpChannel{localAddress=/10.103.32.34:34222, remoteAddress=10.103.34.20/10.103.34.20:9300}], closing connection
io.netty.handler.codec.DecoderException: javax.net.ssl.SSLHandshakeException: No subject alternative names matching IP address 10.103.34.20 found
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:472) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) ~[netty-codec-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1434) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:965) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:656) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:556) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:510) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:470) [netty-transport-4.1.32.Final.jar:4.1.32.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:909) [netty-common-4.1.32.Final.jar:4.1.32.Final]
	at java.lang.Thread.run(Thread.java:830) [?:?]
@sebgl sebgl added the >bug Something isn't working label Apr 6, 2020
@sebgl sebgl self-assigned this Apr 6, 2020
@sebgl sebgl added the v1.1.0 label Apr 6, 2020
@sebgl
Copy link
Contributor Author

sebgl commented Apr 6, 2020

May be related to #2659 where we introduced full TLS verification.

@sebgl
Copy link
Contributor Author

sebgl commented Apr 6, 2020

I managed to reproduce locally by triggering some rolling upgrades on a 3 nodes cluster.

Pod elasticsearch-sample-es-default-1 fails to join the cluster.
Checking the reported TLS cert on the transport protocol:

⟩ openssl s_client -connect localhost:9300
CONNECTED(00000005)
depth=1 OU = elasticsearch-sample, CN = elasticsearch-sample-transport
verify error:num=19:self signed certificate in certificate chain
verify return:0
4672036460:error:1401E412:SSL routines:CONNECT_CR_FINISHED:sslv3 alert bad certificate:/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/libressl/libressl-47.100.4/libressl-2.8/ssl/ssl_pkt.c:1200:SSL alert number 42
4672036460:error:1401E0E5:SSL routines:CONNECT_CR_FINISHED:ssl handshake failure:/AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/libressl/libressl-47.100.4/libressl-2.8/ssl/ssl_pkt.c:585:
---
Certificate chain
 0 s:/OU=elasticsearch-sample/CN=elasticsearch-sample-es-default-1.node.elasticsearch-sample.default.es.local
   i:/OU=elasticsearch-sample/CN=elasticsearch-sample-transport
 1 s:/OU=elasticsearch-sample/CN=elasticsearch-sample-transport
   i:/OU=elasticsearch-sample/CN=elasticsearch-sample-transport
---
Server certificate-----BEGIN CERTIFICATE-----
MIIEazCCA1OgAwIBAgIQBJaL7HaZXTa3CRDd8J5JNjANBgkqhkiG9w0BAQsFADBI
MR0wGwYDVQQLExRlbGFzdGljc2VhcmNoLXNhbXBsZTEnMCUGA1UEAxMeZWxhc3Rp
Y3NlYXJjaC1zYW1wbGUtdHJhbnNwb3J0MB4XDTIwMDQwNjEwMDg0NloXDTIxMDQw
NjEwMTg0NlowdjEdMBsGA1UECxMUZWxhc3RpY3NlYXJjaC1zYW1wbGUxVTBTBgNV
BAMTTGVsYXN0aWNzZWFyY2gtc2FtcGxlLWVzLWRlZmF1bHQtMS5ub2RlLmVsYXN0
aWNzZWFyY2gtc2FtcGxlLmRlZmF1bHQuZXMubG9jYWwwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDXzjg0VqcDy1yUYCHdSFz7bWDtRBejhabALi/r+/3r
w1icF1Vlk2u9iwxUYD5paWZ/fF9bvDf6ehvRJlf3umFuPZkyWupY06WUK63hEnLA
Bw6IUBSMb0gWqUZh7thUDh4BQTt2FM7l4yKYFcF9SK2xhIrIlKvlyte7951LZpPu
LWcHzwssvETnrpg/d62pq1j3YpKWVMB1TbllfUg+nV4POyi3vRVT1jps823WrfMN
Wg8hv5vLCW/TMPOia1ujzvqTBHldeTxr0UGNs6U2HerkBW7E/NCs2i4HAG+dNTP0
xGjJgtzTT108NrpLgj3q/DItc7YGSShTO/ufu0qEntCpAgMBAAGjggEhMIIBHTAO
BgNVHQ8BAf8EBAMCBaAwHQYDVR0lBBYwFAYIKwYBBQUHAwEGCCsGAQUFBwMCMIHr
BgNVHREEgeMwgeCgVQYDVQQDoE4MTGVsYXN0aWNzZWFyY2gtc2FtcGxlLWVzLWRl
ZmF1bHQtMS5ub2RlLmVsYXN0aWNzZWFyY2gtc2FtcGxlLmRlZmF1bHQuZXMubG9j
YWyCTGVsYXN0aWNzZWFyY2gtc2FtcGxlLWVzLWRlZmF1bHQtMS5ub2RlLmVsYXN0
aWNzZWFyY2gtc2FtcGxlLmRlZmF1bHQuZXMubG9jYWyCLWVsYXN0aWNzZWFyY2gt
c2FtcGxlLWVzLXRyYW5zcG9ydC5kZWZhdWx0LnN2Y4cEChIRjYcEfwAAATANBgkq
hkiG9w0BAQsFAAOCAQEActo/najsuoZ09yDp5jzpPq4pjRogZNBrb5Dgii7j3l15
jAg6gP2ipQPe41TeCy+FItCLBsx4+ffinKZjwcCVHZmQ9cbjxxWrEGyhyn1GXaNq
V7nyGc5AExkiyF53/hoRMb4pdVnVQQkF8/TBUCWyYqwuzTaarVCfKHKyLV6RrxuZ
hOnYjjtmvCdaMsx+MTXOEWj2tOXO+KaOAIiVKgLBupwt0WDS6zxBuYqrHRwfLENj
XUnn+8ro3feXr8Rtvh8lbv0nyr8dbhkXmgWM/9AZh8Hs+ZrALVM55EeFavha/K5F
R/4+4+EgRXyOATDFDtl5TAvYtZ7iAvGcPq8NL2BaZw==
-----END CERTIFICATE-----
subject=/OU=elasticsearch-sample/CN=elasticsearch-sample-es-default-1.node.elasticsearch-sample.default.es.local
issuer=/OU=elasticsearch-sample/CN=elasticsearch-sample-transport
---
Acceptable client certificate CA names
/OU=elasticsearch-sample/CN=elasticsearch-sample-transport
Server Temp Key: ECDH, X25519, 253 bits
---
SSL handshake has read 2535 bytes and written 161 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-SHA384
    Session-ID: E127EC6C4208ECE99816901B482C991E89439F3337AE97D1E9708DA7D25CE52F
    Session-ID-ctx: 
    Master-Key: 8E3062A8A7356A042DB04FA3FFEF9D18ADAA6C58A5F6E8D1F767B672C39B37C2D12FED2BD8E9C7421038105C8CB0F087
    Start Time: 1586168827
    Timeout   : 7200 (sec)
    Verify return code: 19 (self signed certificate in certificate chain)
---

Pasting this certificate on https://www.sslshopper.com/certificate-decoder.html reports:

Subject Alternative Names: othername:<unsupported>, elasticsearch-sample-es-default-1.node.elasticsearch-sample.default.es.local, elasticsearch-sample-es-transport.default.svc, IP Address:10.18.17.141, IP Address:127.0.0.1

IP Address 10.18.17.141 mismatches the Pod IP Address 10.18.17.142.

Inspecting the content of secret elasticsearch-sample-es-transport-certificates:

⟩ k get secret elasticsearch-sample-es-transport-certificates -o json | jq -r '.data["elasticsearch-sample-es-default-1.tls.crt"]' | base64 -D
-----BEGIN CERTIFICATE-----
MIIEbDCCA1SgAwIBAgIRALyVFBQLhJIKlRB8V/5uwogwDQYJKoZIhvcNAQELBQAw
SDEdMBsGA1UECxMUZWxhc3RpY3NlYXJjaC1zYW1wbGUxJzAlBgNVBAMTHmVsYXN0
aWNzZWFyY2gtc2FtcGxlLXRyYW5zcG9ydDAeFw0yMDA0MDYxMDEzMjdaFw0yMTA0
MDYxMDIzMjdaMHYxHTAbBgNVBAsTFGVsYXN0aWNzZWFyY2gtc2FtcGxlMVUwUwYDVQQDE0xlbGFzdGljc2VhcmNoLXNhbXBsZS1lcy1kZWZhdWx0LTEubm9kZS5lbGFz
dGljc2VhcmNoLXNhbXBsZS5kZWZhdWx0LmVzLmxvY2FsMIIBIjANBgkqhkiG9w0B
AQEFAAOCAQ8AMIIBCgKCAQEA1844NFanA8tclGAh3Uhc+21g7UQXo4WmwC4v6/v9
68NYnBdVZZNrvYsMVGA+aWlmf3xfW7w3+nob0SZX97phbj2ZMlrqWNOllCut4RJy
wAcOiFAUjG9IFqlGYe7YVA4eAUE7dhTO5eMimBXBfUitsYSKyJSr5crXu/edS2aT
7i1nB88LLLxE566YP3etqatY92KSllTAdU25ZX1IPp1eDzsot70VU9Y6bPNt1q3z
DVoPIb+bywlv0zDzomtbo876kwR5XXk8a9FBjbOlNh3q5AVuxPzQrNouBwBvnTUz9MRoyYLc009dPDa6S4I96vwyLXO2BkkoUzv7n7tKhJ7QqQIDAQABo4IBITCCAR0w
DgYDVR0PAQH/BAQDAgWgMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjCB
6wYDVR0RBIHjMIHgoFUGA1UEA6BODExlbGFzdGljc2VhcmNoLXNhbXBsZS1lcy1k
ZWZhdWx0LTEubm9kZS5lbGFzdGljc2VhcmNoLXNhbXBsZS5kZWZhdWx0LmVzLmxv
Y2FsgkxlbGFzdGljc2VhcmNoLXNhbXBsZS1lcy1kZWZhdWx0LTEubm9kZS5lbGFz
dGljc2VhcmNoLXNhbXBsZS5kZWZhdWx0LmVzLmxvY2Fsgi1lbGFzdGljc2VhcmNo
LXNhbXBsZS1lcy10cmFuc3BvcnQuZGVmYXVsdC5zdmOHBAoSEY6HBH8AAAEwDQYJ
KoZIhvcNAQELBQADggEBAFszVENR3iIfC3s9SY2r44HF8kwduqs64Y9TRxurSXVV
4aStJffMxFTirDufZlNtbeRrOpxotbwFDJ7Vp8pfZQuKtMuLA2dkamM4+PFOjpeT
h0IbQQPFxzDk9Ye9slyRK0i423iESB35pDnJ7sIHKxswbXRdTwi8yLyjO39DJlr8
4po7HDb4h+q7pxGMDwg1hyqBdMkaR8YdbdhPceEg8Yh/deYVCdk9X5Q9zjeIIuv3
492FmUKdkLfnlgR6fJ8/juX09yI+hKqQjZqZu6KiaAsbkb2tNig74zDWWqqTUtA6
wCnmoQCLl1tSFY3tzAjIhIT70xWgkJ3r9zUAp9TTZeo=
-----END CERTIFICATE-----
-----BEGIN CERTIFICATE-----
MIIDXDCCAkSgAwIBAgIQDYdFu94IedEPYnpbn0TEPjANBgkqhkiG9w0BAQsFADBI
MR0wGwYDVQQLExRlbGFzdGljc2VhcmNoLXNhbXBsZTEnMCUGA1UEAxMeZWxhc3Rp
Y3NlYXJjaC1zYW1wbGUtdHJhbnNwb3J0MB4XDTIwMDQwNjEwMDgxMloXDTIwMDQw
NjIwMTgxMlowSDEdMBsGA1UECxMUZWxhc3RpY3NlYXJjaC1zYW1wbGUxJzAlBgNV
BAMTHmVsYXN0aWNzZWFyY2gtc2FtcGxlLXRyYW5zcG9ydDCCASIwDQYJKoZIhvcN
AQEBBQADggEPADCCAQoCggEBAKacNEB7mNMeLQfhZrnwGMbhJ0CKrLmeuvBGYoJJ
2HUyD1DIXADqZro27VHZzE+IQ4j83ZcHWmqY8R8iVSrfYYV4jKDsa1zk2cJOI9Fi
VKHFgPPJyNo2Db3Cb3P30QhzHpD30vkuh2dldoUpaNhrgRGc1YZHnaYLehCgpqv9
FDOwROUg5Ilq9rnKn6RQsq00jSV6bVHjOSs8LTc5t4Fp5IVEma7gEAkxgr7X8dxL
ZmXRNOM9SzmhtY0E6uDTTqLUIpo+DtnFlTkqnaW6naLCsQ7432nyo4VinRRR/pn1
QEND8pld5iCYuPvlXv8qSKrPdXcIseT33u1sT3NIZA6TKi0CAwEAAaNCMEAwDgYD
VR0PAQH/BAQDAgKEMB0GA1UdJQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjAPBgNV
HRMBAf8EBTADAQH/MA0GCSqGSIb3DQEBCwUAA4IBAQAsj0TAOCtSgLkZ0/Itv0Fl
qOsGZLskStnxNgmznFLaBZdFRxRDwa4ULhKRZVLzNy/YeQ4k2Pcg5qgdOWpgnMdZ
nD3jZRYzD6tbn22zofAzeyBaD3/GGnmBP1U7242yqilw5kQK1o1XeyvxafqX0j64
8I0vGZoSJ09RcI1q6BBkkMyrac1dgMCHZ/+3VqfPnEbGidCdn5GC4n6yOrwyNddS
TcPxnBbL26dZ6LXA80nonsZGAM/eYS9PSRlV3KTr+sbPoUVeCM/tTY5B3qjq5382
UMru4J7Jdu2BlUvdqVsa8Y5IWMb7k0HSn++H6NHLVE3QufnmFWqTLepTXyoC0kw8
-----END CERTIFICATE-----

It reports IP IP Address:10.18.17.142 , which is the correct one, and not the one I get when I query the ES server.

So it looks like ES did not reload its transport cert file and still has the "old" one loaded in memory?

@sebgl
Copy link
Contributor Author

sebgl commented Apr 6, 2020

At this point in time, if I manually delete the Pod certificate entry from the secret, which triggers a new one to be recreated automatically, the situation is unlocked. The ES process loads the new certificate file and we move on with the rolling upgrade.

@sebgl
Copy link
Contributor Author

sebgl commented Apr 6, 2020

I'm wondering if the following may happen when Elasticsearch starts (after a rolling upgrade):

  1. ES loads the certificates it finds on disk, which are not up-to-date yet with the new pod IP
  2. The updated certs are propagated to the filesystem from the updated K8s secret.
  3. ES monitors the filesystem for any change on the certificate files, missing the filesystem event from the change at step 2.
  4. Any cert update from now on is correctly accounted for.

I'm looking into ES code to figure it out.

@anyasabo
Copy link
Contributor

anyasabo commented Apr 6, 2020

At this point in time, if I manually delete the Pod certificate entry from the secret, which triggers a new one to be recreated automatically, the situation is unlocked.

I'm assuming it also works if you kill the pod and it restarts? Which might be an easier repro step

@sebgl
Copy link
Contributor Author

sebgl commented Apr 7, 2020

I discussed with ES devs offline, who confirmed there's a time window (~5sec) between loading certs for the first time and watching the filesystem for changes, where any update on the file is ignored. Leading to the correct cert not being served if updated at this exact timing.

I created an issue in the ES repo.

Until this gets resolved, I think we can improve the init container startup script accordingly.
Currently it waits until certificate files are created on disk (from k8s secret update by the operator). This covers the initial creation of the node, but not the restart/upgrade case, where the certificate exists but may not be valid if the IP changed. We could add an additional check to wait until the certificate contains the current node IP.
Edit: not that simple since we don't have openssl installed. We could find another way to get the IP address details (eg. mount a file in the Secret containing an IP address), or look for other alternatives (stop using IP addresses and rely on DNS name instead?).

@sebgl
Copy link
Contributor Author

sebgl commented Apr 7, 2020

We discussed this with @nkvoll and @pebrc and came up with the following:

  • Short-term plan (1.1 release): revert TLS verification to partial, similar to what it was in the 1.0 release. It's not great for security, but not worse than what we had in the last release.
  • Short/middle term plan (1.1 release?): adapt the prepare-fs script in the init-container to wait for the certificates to have the correct IP. Since we don't have openssl available, we may want to add additional files in the secret, containing the pod IPs, so the script can check it matches the expected $POD_IP. We could also rename the existing entries to have the pod IP, but it would probably break existing clusters as soon as the upgrade goes through? /!\ see this issue
  • Longer-term plan: Elasticsearch will include a bugfix in a future version. However it still feels wrong that a Pod (re)starts with the wrong certificate for a short while, which should be prevented with the point above.
  • Longer-term plan: re-enable full TLS verification.

Upgrading existing clusters (this gets more complicated):

We must pay extra attention to existing clusters out there when we decide to re-enable full TLS verification. It will enable a rolling upgrade of the cluster. At the moment the Pod restarts with full TLS verification enabled, it may not be able to contact other Pods in the cluster anymore if they have been impacted by the bug described in this issue. In which case the rolling upgrade will never complete.

So we probably need to make sure we only switch to full TLS verification a cluster that could not be impacted by the bug. Either because it's running a fixed Elasticsearch version, either because it has been created/upgraded by a fixed ECK version (with the init container fix).
To detect this last case we may want to include a "marker" in the fix init container so that we can detect Pods are running a fixed version or not.
It would lead to the following high-level flow:

  • a cluster was created with ECK 1.0, it may or may not include Pods impacted by the bug, serving a certificate with the wrong IP
  • ECK version upgrade including the init container fix
  • the cluster gets a rolling upgrade, once done we are fairly confident no Pod is impacted by the TLS certs bug. At this point TLS verification is still set to "partial".
  • the cluster gets a second rolling upgrade to set TLS verification to "full", because we know no Pod is impacted by the TLS certs bug.

@anyasabo
Copy link
Contributor

anyasabo commented Apr 7, 2020

As discussed out of band, long term we may also want to stop advertising IP addresses (network.publish_host) entirely and just use the pod host name, which should obviate this issue. I looked back pretty far in the history to see why we set this to the IP address and it wasn't clear to me.

@sebgl
Copy link
Contributor Author

sebgl commented Apr 8, 2020

I created a bunch of issues to eventually handle this situation correctly.
In the short-term, #2831 should fix the observed problem and E2E test failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug Something isn't working v1.1.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants