[ML] Continuous data frame should be more robust to new and deleted indices #43992

sophiec20 · 2019-07-04T17:45:01Z

Found in 7.3.0 "build_hash" : "f8fd432", "build_date" : "2019-07-03T15:05:06.452272Z",

3 node cluster.
Index template for temp-* has 3 shards and 1 replica.
New index temp-100? is being created every 12 seconds with a bulk upload of 4000 documents.

When polling GET _data_frame/transforms/blah*/_stats periodic checkpoint exceptions occur. These are displayed in the UI transform list as generic server error 500 toast messages, providing the page refresh cycle coincides.

Index temp_1013 has just been created. There is a small window when this index health is yellow. I think it might also be possible that the replica is not yet ready (not sure if health is considered yellow in this case).

  "node_failures": [
    {
      "type": "failed_node_exception",
      "reason": "Failed to retrieve checkpointing info",
      "node_id": "qMS4vptxxxkr7baDqqqq",
      "caused_by": {
        "type": "checkpoint_exception",
        "reason": "checkpoint_exception: Failure during source checkpoint info retrieval",
        "caused_by": {
          "type": "index_not_found_exception",
          "reason": "no such index [temp_1013]",
          "index_uuid": "_na_",
          "resource.type": "index_or_alias",
          "resource.id": "temp_1013",
          "index": "temp_1013"
        }
      }
    },
    {
      "type": "failed_node_exception",
      "reason": "Failed to retrieve checkpointing info",
      "node_id": "qMS4vptxxxkr7baDqqqq",
      "caused_by": {
        "type": "checkpoint_exception",
        "reason": "checkpoint_exception: Failure during source checkpoint info retrieval",
        "caused_by": {
          "type": "null_pointer_exception",
          "reason": null
        }
      }
    }

The elasticsearch logs contained repeated messages

[2019-07-04T17:32:36,894][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:33:07,026][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:35:59,204][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check
[2019-07-04T17:38:01,976][ERROR][o.e.x.d.t.DataFrameTransformTask] [node3] failure in update check

Expected behavior
New source index creation is likely for continuous data frames. Continuous data frames should be tolerant of this.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-07-04T17:45:02Z

Pinging @elastic/ml-core

sophiec20 · 2019-07-05T09:48:41Z

Full exception from "failure in update check"

[2019-07-05T00:00:58,252][ERROR][o.e.x.d.t.DataFrameTransformTask] [node1] failure in update check
java.lang.NullPointerException: null
        at org.elasticsearch.xpack.dataframe.checkpoint.DataFrameTransformsCheckpointService.extractIndexCheckPoints(DataFrameTransformsCheckpointService.java:236) ~[?:?]
        at org.elasticsearch.xpack.dataframe.checkpoint.DataFrameTransformsCheckpointService.lambda$getCheckpoint$0(DataFrameTransformsCheckpointService.java:115) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onCompletion(TransportBroadcastByNodeAction.java:383) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction.onNodeResponse(TransportBroadcastByNodeAction.java:352) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:324) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.broadcast.node.TransportBroadcastByNodeAction$AsyncAction$1.handleResponse(TransportBroadcastByNodeAction.java:314) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1101) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler$1.doRun(InboundHandler.java:224) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.handleResponse(InboundHandler.java:216) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:141) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:105) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:660) [elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:62) [transport-netty4-client-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:323) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:297) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:241) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1478) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1227) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1274) [netty-handler-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:502) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:441) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:278) [netty-codec-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:352) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1408) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:374) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:360) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:930) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:163) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:682) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:582) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:536) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:496) [netty-transport-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:906) [netty-common-4.1.36.Final.jar:4.1.36.Final]
        at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) [netty-common-4.1.36.Final.jar:4.1.36.Final]
        at java.lang.Thread.run(Thread.java:835) [?:?]

sophiec20 · 2019-07-05T09:58:05Z

Furthermore, if the setup also periodically deletes trailing indices that fall within the pattern, the checkpoint progress fails to move forward. "operations_behind" : -1 occurs which seems to stop the progress moving forward.

For the most recent test run, the progress is stopped at high 99.x%.

sophiec20 · 2019-07-05T09:59:20Z

Additional exception snippet. This occurs less frequently.

[2019-07-05T07:31:08,381][ERROR][o.e.x.d.t.DataFrameTransformTask] [node1] failure in update check
org.elasticsearch.transport.RemoteTransportException: [node2][127.0.0.1:9352][indices:admin/get]
Caused by: org.elasticsearch.index.IndexNotFoundException: no such index [gallery-temp_4639]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.indexNotFoundException(IndexNameExpressionResolver.java:761) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.innerResolve(IndexNameExpressionResolver.java:713) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver$WildcardExpressionResolver.resolve(IndexNameExpressionResolver.java:669) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndices(IndexNameExpressionResolver.java:163) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:142) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.cluster.metadata.IndexNameExpressionResolver.concreteIndexNames(IndexNameExpressionResolver.java:75) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.get.TransportGetIndexAction.checkBlock(TransportGetIndexAction.java:77) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.admin.indices.get.TransportGetIndexAction.checkBlock(TransportGetIndexAction.java:50) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.doStart(TransportMasterNodeAction.java:170) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.start(TransportMasterNodeAction.java:161) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:138) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.doExecute(TransportMasterNodeAction.java:58) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:145) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$apply$0(SecurityActionFilter.java:86) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$authorizeRequest$4(SecurityActionFilter.java:172) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$15(AuthorizationService.java:341) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
        at org.elasticsearch.xpack.security.authz.interceptor.SearchRequestInterceptor.intercept(SearchRequestInterceptor.java:19) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.BulkShardRequestInterceptor.intercept(BulkShardRequestInterceptor.java:71) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:144) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:117) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.action.StepListener.onResponse(StepListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.xpack.security.authz.interceptor.FieldAndDocumentLevelSecurityRequestInterceptor.intercept(FieldAndDocumentLevelSecurityRequestInterceptor.java:61) ~[?:?]
        at org.elasticsearch.xpack.security.authz.interceptor.UpdateRequestInterceptor.intercept(UpdateRequestInterceptor.java:23) ~[?:?]
        at org.elasticsearch.xpack.security.authz.AuthorizationService.lambda$runRequestInterceptors$14(AuthorizationService.java:336) ~[?:?]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture$1.run(ListenableFuture.java:99) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.EsExecutors$DirectExecutorService.execute(EsExecutors.java:193) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:92) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at org.elasticsearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:84) ~[elasticsearch-7.3.0-SNAPSHOT.jar:7.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1540) ~[?:?]
...

droberts195 · 2019-07-05T12:38:45Z

At a code level I think the problems observed are:

DataFrameTransformsCheckpointService.getCheckpoint makes a GetIndexRequest, then for each index found it makes a IndicesStatsRequest. In between the two calls indices can be deleted. Therefore an IndexNotFoundException in response to any single IndicesStatsRequest should just be treated as though that index didn't exist in the original GetIndexRequest result.
In DataFrameTransformsCheckpointServiceextractIndexCheckPoints calls are made to shard.getSeqNoStats(). This can return null and this is the cause of the NPE. A check for this situation has recently been added, so the NPE will no longer occur. However, instead a different exception is thrown that aborts the entire checkpoint. This is also not friendly to the index pattern that's an input to the transform being a dynamically changing set of indices, for example managed by ILM.

We need to find a way to make checkpoints robust to indices entering or leaving the set of source indices. When an index enters or leaves the set it's reasonable to treat this as meaning there's been a change since the previous checkpoint. But for the indices that do still exist and are still open it's still possible to calculate checkpoint stats.

droberts195 · 2019-07-05T14:54:16Z

I discussed this with @hendrikmuhs. For 7.3 some simple bug fixes we could do are:

Don't spam the log with huge stack traces when likely problems occur during checkpoint calculation like indices being created, closed or deleted. A single line debug message would suffice when checkpoint calculation is complicated by these events.
Alter the "has anything changed" check from "changes > 0" to "changes != 0" so that it treats "couldn't calculate the checkpoint" as a change. Or alternatively "changes > 0" could be altered to "changes > 0 for the indices that are currently searchable".

However, this also interacts quite heavily with solving the 65000 terms problem. So the timeline and mechanism for fixing that affects the decision of what to do about this problem.

- do not let checkpointing fail if indexes got deleted - treat missing seqNoStats as just created indices (checkpoint 0) - loglevel: do not treat failed updated checks as error fixes elastic#43992

make checkpointing more robust: - do not let checkpointing fail if indexes got deleted - treat missing seqNoStats as just created indices (checkpoint 0) - loglevel: do not treat failed updated checks as error fixes #43992

make checkpointing more robust: - do not let checkpointing fail if indexes got deleted - treat missing seqNoStats as just created indices (checkpoint 0) - loglevel: do not treat failed updated checks as error fixes elastic#43992

make checkpointing more robust: - do not let checkpointing fail if indexes got deleted - treat missing seqNoStats as just created indices (checkpoint 0) - loglevel: do not treat failed updated checks as error fixes #43992

sophiec20 added >bug :ml Machine learning :ml/Transform Transform v7.3.0 labels Jul 4, 2019

sophiec20 changed the title ~~[ML] Data frame failed to retrieve checkpointing info exceptions on newly created indices~~ [ML] Continuous data frame should be more robust to new and deleted indices Jul 5, 2019

jpountz removed v7.3.0 :ml Machine learning labels Jul 5, 2019

droberts195 mentioned this issue Jul 5, 2019

testDataFrameTransformCrud failed with "Failed to retrieve checkpointing info" #44011

Closed

droberts195 self-assigned this Jul 5, 2019

hendrikmuhs added the v7.3.0 label Jul 9, 2019

hendrikmuhs assigned hendrikmuhs and unassigned droberts195 Jul 15, 2019

hendrikmuhs mentioned this issue Jul 15, 2019

[ML-DataFrame] make checkpointing more robust #44344

Merged

hendrikmuhs closed this as completed in #44344 Jul 16, 2019

This was referenced Jul 16, 2019

[7.4][ML-DataFrame] make checkpointing more robust (#44344) #44414

Merged

[7.3][ML-DataFrame] make checkpointing more robust (#44344) #44415

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Continuous data frame should be more robust to new and deleted indices #43992

[ML] Continuous data frame should be more robust to new and deleted indices #43992

sophiec20 commented Jul 4, 2019 •

edited

Loading

elasticmachine commented Jul 4, 2019

sophiec20 commented Jul 5, 2019

sophiec20 commented Jul 5, 2019

sophiec20 commented Jul 5, 2019

droberts195 commented Jul 5, 2019 •

edited

Loading

droberts195 commented Jul 5, 2019 •

edited

Loading

[ML] Continuous data frame should be more robust to new and deleted indices #43992

[ML] Continuous data frame should be more robust to new and deleted indices #43992

Comments

sophiec20 commented Jul 4, 2019 • edited Loading

elasticmachine commented Jul 4, 2019

sophiec20 commented Jul 5, 2019

sophiec20 commented Jul 5, 2019

sophiec20 commented Jul 5, 2019

droberts195 commented Jul 5, 2019 • edited Loading

droberts195 commented Jul 5, 2019 • edited Loading

sophiec20 commented Jul 4, 2019 •

edited

Loading

droberts195 commented Jul 5, 2019 •

edited

Loading

droberts195 commented Jul 5, 2019 •

edited

Loading