Telemetry plugin is crashing functional test server #122146

jportner · 2021-12-29T20:08:36Z

Kibana version:

I am experiencing this on the main branch (HEAD: 6185ddc).

Describe the bug:

I noticed this behavior starting a few weeks ago, though admittedly I haven't attempted to run functional tests locally for several months, so I'm not sure when this started.

Steps to reproduce:

It appears that this error happens during the SO migration process, which is initiated by the ES archiver that we use to load fixtures for most of our test suites.

The strange thing is, while I can reliably reproduce this locally, CI doesn't seem to run into this problem. I have reproduced this with two different functional test suites, saved_object_api_integration and spaces_api_integration.

Start the functional test server (FTS) for a given suite, for example: node scripts/functional_tests_server.js --config x-pack/test/saved_object_api_integration/spaces_only/config.ts

Wait for the server to finish loading, you'll see a message like this:

Elasticsearch and Kibana are ready for functional testing. Start the functional tests
in another terminal session by running this command from this directory:

    node scripts/functional_test_runner --config x-pack/test/saved_object_api_integration/spaces_only/config.ts

Start the functional test runner (FTR) for the same suite, for example: node scripts/functional_test_runner.js --config x-pack/test/saved_object_api_integration/spaces_only/config.ts
Observe tests running normally (see logs below)
Wait for a time, usually the crash happens in less than 1 minute
Observe test error (see logs below)

Normal FTR logs

       └-> "before all" hook: beforeTestSuite.trigger for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
       └-> "before all" hook for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "mappings.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "data.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_task_manager_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Created index ".kibana_8.1.0_001"
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] ".kibana_8.1.0_001" settings {"index":{"auto_expand_replicas":"0-1","number_of_replicas":"0","number_of_shards":"1"}}
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Indexed 37 docs into ".kibana"
         │ debg Migrating saved objects
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Migrated Kibana index after loading Kibana data
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Ensured that default space exists in .kibana
       └-> should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]
         └-> "before each" hook: global before each for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         └- ✓ pass  (125ms)
       └-> "after all" hook for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Unloading indices from "mappings.json"
         │ warn since spaces are enabled, all objects other than the default space were deleted from .kibana rather than deleting the whole index
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Unloading indices from "data.json"
       └-> "after all" hook: afterTestSuite.trigger for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
     └-> "after all" hook: afterTestSuite.trigger in "_bulk_resolve"

Normal FTS logs

   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/-VEfrTVtSHSgYTB17bF2aQ] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/FWsDyhcZTOugR9MRc5Xrog] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:10.026-05:00][INFO ][savedobjects-service] [.kibana_task_manager] INIT -> CREATE_NEW_TARGET. took: 2ms.
   │ proc [kibana] [2021-12-29T14:21:10.027-05:00][INFO ][savedobjects-service] [.kibana] INIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:10.029-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 2ms.
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_task_manager_8.1.0_001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [node-01] updating number_of_replicas to [0] for indices [.kibana_task_manager_8.1.0_001]
   │ proc [kibana] [2021-12-29T14:21:10.033-05:00][INFO ][savedobjects-service] [.kibana] Starting to process 9 documents.
   │ proc [kibana] [2021-12-29T14:21:10.033-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_TRANSFORM. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:10.036-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_TRANSFORM -> TRANSFORMED_DOCUMENTS_BULK_INDEX. took: 3ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_task_manager_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_task_manager_8.1.0_001][0]]"
   │ proc [kibana] [2021-12-29T14:21:11.068-05:00][INFO ][savedobjects-service] [.kibana_task_manager] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 1042ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:11.761-05:00][INFO ][savedobjects-service] [.kibana_task_manager] MARK_VERSION_INDEX_READY -> DONE. took: 693ms.
   │ proc [kibana] [2021-12-29T14:21:11.761-05:00][INFO ][savedobjects-service] [.kibana_task_manager] Migration completed after 1737ms
   │ proc [kibana] [2021-12-29T14:21:11.838-05:00][INFO ][savedobjects-service] [.kibana] TRANSFORMED_DOCUMENTS_BULK_INDEX -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 1802ms.
   │ proc [kibana] [2021-12-29T14:21:11.842-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:11.844-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> OUTDATED_DOCUMENTS_REFRESH. took: 2ms.
   │ proc [kibana] [2021-12-29T14:21:11.872-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_REFRESH -> UPDATE_TARGET_MAPPINGS. took: 28ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:11.976-05:00][INFO ][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS -> UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK. took: 104ms.
   │ info [o.e.t.LoggingTaskListener] [node-01] 3939 finished with response BulkByScrollResponse[took=110.2ms,timed_out=false,sliceId=null,updated=37,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[]]
   │ proc [kibana] [2021-12-29T14:21:12.185-05:00][INFO ][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> DONE. took: 209ms.
   │ proc [kibana] [2021-12-29T14:21:12.185-05:00][INFO ][savedobjects-service] [.kibana] Migration completed after 2162ms
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]

Error FTR logs

       └-> "before all" hook for "should return 200 success [isolatedtype/space1-isolatedtype-id,sharedtype/all_spaces,sharedtype/default_and_space_1,sharedtype/only_space_1,sharecapabletype/only_space_1,globaltype/globaltype-id] and not found [isolatedtype/defaultspace-isolatedtype-id,isolatedtype/space2-isolatedtype-id,sharedtype/only_space_2,sharecapabletype/only_default_space,hiddentype/any,dashboard/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "mappings.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "data.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_task_manager_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Created index ".kibana_8.1.0_001"
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] ".kibana_8.1.0_001" settings {"index":{"auto_expand_replicas":"0-1","number_of_replicas":"0","number_of_shards":"1"}}
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Indexed 37 docs into ".kibana"
         │ debg Migrating saved objects
         │ERROR [migrate saved objects] request failed (attempt=1/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=2/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=3/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=4/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=5/5): connect ECONNREFUSED 127.0.0.1:5620
         └- ✖ fail: saved objects spaces only enabled _bulk_update within the space_1 space "before all" hook for "should return 200 success [isolatedtype/space1-isolatedtype-id,sharedtype/all_spaces,sharedtype/default_and_space_1,sharedtype/only_space_1,sharecapabletype/only_space_1,globaltype/globaltype-id] and not found [isolatedtype/defaultspace-isolatedtype-id,isolatedtype/space2-isolatedtype-id,sharedtype/only_space_2,sharecapabletype/only_default_space,hiddentype/any,dashboard/does-not-exist]"
         │      Error: [migrate saved objects] request failed (attempt=5/5): connect ECONNREFUSED 127.0.0.1:5620 -- and ran out of retries
         │       at KbnClientRequester.request (node_modules/@kbn/test/target_node/kbn_client/kbn_client_requester.js:158:15)
         │       at runMicrotasks (<anonymous>)
         │       at processTicksAndRejections (node:internal/process/task_queues:96:5)
         │       at KbnClientSavedObjects.migrate (node_modules/@kbn/test/target_node/kbn_client/kbn_client_saved_objects.js:51:9)
         │       at migrateKibanaIndex (node_modules/@kbn/es-archiver/target_node/lib/indices/kibana_index.js:67:3)
         │       at loadAction (node_modules/@kbn/es-archiver/target_node/actions/load.js:84:5)
         │       at EsArchiver.load (node_modules/@kbn/es-archiver/target_node/es_archiver.js:84:12)
         │       at Object.apply (node_modules/@kbn/test/target_node/functional_test_runner/lib/mocha/wrap_function.js:87:16)

Error FTS logs

   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/zy6LgiktQoKhy2U85FfHQQ] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/kQs6lzRsTDuyXE3tz5O6oQ] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:217) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:202) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:258) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1432) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:47) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:763) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
   │            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
   │            at java.lang.Thread.run(Thread.java:833) [?:?]
   │      Caused by: org.elasticsearch.transport.RemoteTransportException: [node-01][127.0.0.1:9300][indices:data/read/get[s]]
   │      Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │            at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:2075) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:1194) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:179) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:95) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:79) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:113) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:34) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:104) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:775) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            ... 3 more
   │ proc [kibana] Unhandled Promise rejection detected:
   │ proc [kibana] 
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"
   │ proc [kibana] ResponseError: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │ proc [kibana]     at KibanaTransport.request (/Users/joe/GitHub/kibana-3/node_modules/@elastic/transport/src/Transport.ts:517:17)
   │ proc [kibana]     at runMicrotasks (<anonymous>)
   │ proc [kibana]     at processTicksAndRejections (node:internal/process/task_queues:96:5)
   │ proc [kibana]     at ApmClient.GetApi [as get] (/Users/joe/GitHub/kibana-3/node_modules/@elastic/elasticsearch/src/api/api/get.ts:59:10) {
   │ proc [kibana]   meta: {
   │ proc [kibana]     body: { error: [Object], status: 503 },
   │ proc [kibana]     statusCode: 503,
   │ proc [kibana]     headers: {
   │ proc [kibana]       'x-opaque-id': 'unknownId',
   │ proc [kibana]       'x-elastic-product': 'Elasticsearch',
   │ proc [kibana]       'content-type': 'application/json;charset=utf-8',
   │ proc [kibana]       'content-length': '663'
   │ proc [kibana]     },
   │ proc [kibana]     meta: {
   │ proc [kibana]       context: null,
   │ proc [kibana]       request: [Object],
   │ proc [kibana]       name: 'elasticsearch-js',
   │ proc [kibana]       connection: [Object],
   │ proc [kibana]       attempts: 0,
   │ proc [kibana]       aborted: false
   │ proc [kibana]     },
   │ proc [kibana]     warnings: [Getter]
   │ proc [kibana]   },
   │ proc [kibana]   isBoom: true,
   │ proc [kibana]   isServer: true,
   │ proc [kibana]   data: null,
   │ proc [kibana]   output: {
   │ proc [kibana]     statusCode: 503,
   │ proc [kibana]     payload: {
   │ proc [kibana]       message: 'No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]',
   │ proc [kibana]       statusCode: 503,
   │ proc [kibana]       error: 'Service Unavailable'
   │ proc [kibana]     },
   │ proc [kibana]     headers: {}
   │ proc [kibana]   },
   │ proc [kibana]   [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/esUnavailable'
   │ proc [kibana] }
   │ proc [kibana] 
   │ proc [kibana] Terminating process...
   │ proc [kibana]  server crashed  with status code 1
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/bHurwnsBTn2m5Akuc_ZXzA] update_mapping [_doc]

Expected behavior:

Kibana functional tests should not fail when run locally.

Any additional context:

The interesting thing to me in the logs is that, when the FTS errors out, I always see these lines, without fail:

   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]

I believe that's originating from the Telemetry plugin's getIsOptedIn function here:

kibana/src/plugins/telemetry/server/plugin.ts

Line 137 in 00cc5fe

const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);

This isn't enclosed in a try/catch, but if I add one:

diff --git a/src/plugins/telemetry/server/plugin.ts b/src/plugins/telemetry/server/plugin.ts
index aa22410358f..22294b03c63 100644
--- a/src/plugins/telemetry/server/plugin.ts
+++ b/src/plugins/telemetry/server/plugin.ts
@@ -134,20 +134,24 @@ export class TelemetryPlugin implements Plugin<TelemetryPluginSetup, TelemetryPl
     return {
       getIsOptedIn: async () => {
         const internalRepository = new SavedObjectsClient(savedObjectsInternalRepository);
-        const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);
-
-        const config = await this.config$.pipe(take(1)).toPromise();
-        const allowChangingOptInStatus = config.allowChangingOptInStatus;
-        const configTelemetryOptIn = typeof config.optIn === 'undefined' ? null : config.optIn;
-        const currentKibanaVersion = this.currentKibanaVersion;
-        const isOptedIn = getTelemetryOptIn({
-          currentKibanaVersion,
-          telemetrySavedObject,
-          allowChangingOptInStatus,
-          configTelemetryOptIn,
-        });
-
-        return isOptedIn === true;
+        try {
+          const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);
+
+          const config = await this.config$.pipe(take(1)).toPromise();
+          const allowChangingOptInStatus = config.allowChangingOptInStatus;
+          const configTelemetryOptIn = typeof config.optIn === 'undefined' ? null : config.optIn;
+          const currentKibanaVersion = this.currentKibanaVersion;
+          const isOptedIn = getTelemetryOptIn({
+            currentKibanaVersion,
+            telemetrySavedObject,
+            allowChangingOptInStatus,
+            configTelemetryOptIn,
+          });
+          return isOptedIn === true;
+        } catch (err) {
+          this.logger.error('WTF: ' + err.message);
+        }
+        return false;
       },
     };
   }

my functional tests work fine, and I see the caught error in the FTS log:

Click to expand

   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/K1ZYrqBPTuufNdeTzBQ_LA] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/8X3-zVeiQoax82cS8WUvlw] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:217) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:202) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:258) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1432) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:47) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:763) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
   │            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
   │            at java.lang.Thread.run(Thread.java:833) [?:?]
   │      Caused by: org.elasticsearch.transport.RemoteTransportException: [node-01][127.0.0.1:9300][indices:data/read/get[s]]
   │      Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │            at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:2075) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:1194) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:179) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:95) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:79) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:113) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:34) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:104) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:775) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            ... 3 more
   │ proc [kibana] [2021-12-29T14:46:59.760-05:00][ERROR][plugins.telemetry] WTF: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"

The only other thing that stands out is that there was a recent addition to the Fleet plugin which now consumes this function from the Telemetry plugin (this was added ~2 months ago via #115180):

kibana/x-pack/plugins/fleet/server/telemetry/sender.ts

Line 87 in 905d6b5

this.isOptedIn = await this.isTelemetryOptedIn();

This is not enclosed in a try/catch, where the other usages of this function are.
So, it appears this function is getting called in the middle of the migration process, causing the server to crash.

What I can't work out is: why isn't this error happening in CI?
~~Is the fleet plugin disabled in CI?~~ Edit: it looks like the Fleet plugin can't be disabled. But the Telemetry plugin can be disabled, so maybe that's disabled in CI?

At any rate, while the Fleet plugin appears to be causing this bug, I think the root cause is really the Telemetry plugin, IMO this function should fail gracefully.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-12-29T20:08:39Z

Pinging @elastic/kibana-core (Team:Core)

lukeelmers · 2021-12-29T22:19:14Z

Added a note on the PR, but after multiple attempts I am unable to reproduce this. Would be interested to see if anyone else can.

afharo · 2022-01-04T14:55:56Z

~~Is the fleet plugin disabled in CI?~~ Edit: it looks like the Fleet plugin can't be disabled. But the Telemetry plugin can be disabled, so maybe that's disabled in CI?

According to the FT config files, it's enabled and opted-in for your specific config file (because it inherits it from the x-pack/test/api_integration/config.ts file).

I just ran it with the latest main and it didn't fail for me either. Probably the upstream user of the API fixed their unhandled promises?

jportner · 2022-01-04T15:21:12Z

I just ran it with the latest main and it didn't fail for me either. Probably the upstream user of the API fixed their unhandled promises?

I pulled the latest main just now and reproduced it again locally. Also took a look at the problematic consumer (fleet plugin) and they didn't change anything.
I guess there is some sort of race condition that is causing this function to be called during the migration process, and when that happens, an error occurs and crashes the Kibana server.

jportner added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Dec 29, 2021

jportner mentioned this issue Dec 29, 2021

Fix telemetry plugin error crash #122148

Merged

jportner self-assigned this Dec 29, 2021

afharo mentioned this issue Jan 11, 2022

[Telemetry] Deprecate TelemetryStart.getIsOptedIn() in favour of an observer #122632

Closed

jportner closed this as completed in #122148 Jan 11, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Telemetry plugin is crashing functional test server #122146

Telemetry plugin is crashing functional test server #122146

jportner commented Dec 29, 2021 •

edited

Loading

elasticmachine commented Dec 29, 2021

lukeelmers commented Dec 29, 2021

afharo commented Jan 4, 2022

jportner commented Jan 4, 2022

Telemetry plugin is crashing functional test server #122146

Telemetry plugin is crashing functional test server #122146

Comments

jportner commented Dec 29, 2021 • edited Loading

elasticmachine commented Dec 29, 2021

lukeelmers commented Dec 29, 2021

afharo commented Jan 4, 2022

jportner commented Jan 4, 2022

jportner commented Dec 29, 2021 •

edited

Loading