Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry plugin is crashing functional test server #122146

Closed
jportner opened this issue Dec 29, 2021 · 4 comments · Fixed by #122148
Closed

Telemetry plugin is crashing functional test server #122146

jportner opened this issue Dec 29, 2021 · 4 comments · Fixed by #122148
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc

Comments

@jportner
Copy link
Contributor

jportner commented Dec 29, 2021

Kibana version:

I am experiencing this on the main branch (HEAD: 6185ddc).

Describe the bug:

I noticed this behavior starting a few weeks ago, though admittedly I haven't attempted to run functional tests locally for several months, so I'm not sure when this started.

Steps to reproduce:

It appears that this error happens during the SO migration process, which is initiated by the ES archiver that we use to load fixtures for most of our test suites.

The strange thing is, while I can reliably reproduce this locally, CI doesn't seem to run into this problem. I have reproduced this with two different functional test suites, saved_object_api_integration and spaces_api_integration.

  1. Start the functional test server (FTS) for a given suite, for example: node scripts/functional_tests_server.js --config x-pack/test/saved_object_api_integration/spaces_only/config.ts
  2. Wait for the server to finish loading, you'll see a message like this:
    Elasticsearch and Kibana are ready for functional testing. Start the functional tests
    in another terminal session by running this command from this directory:
    
        node scripts/functional_test_runner --config x-pack/test/saved_object_api_integration/spaces_only/config.ts
    
  3. Start the functional test runner (FTR) for the same suite, for example: node scripts/functional_test_runner.js --config x-pack/test/saved_object_api_integration/spaces_only/config.ts
  4. Observe tests running normally (see logs below)
  5. Wait for a time, usually the crash happens in less than 1 minute
  6. Observe test error (see logs below)
Normal FTR logs
       └-> "before all" hook: beforeTestSuite.trigger for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
       └-> "before all" hook for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "mappings.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "data.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_task_manager_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Created index ".kibana_8.1.0_001"
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] ".kibana_8.1.0_001" settings {"index":{"auto_expand_replicas":"0-1","number_of_replicas":"0","number_of_shards":"1"}}
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Indexed 37 docs into ".kibana"
         │ debg Migrating saved objects
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Migrated Kibana index after loading Kibana data
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Ensured that default space exists in .kibana
       └-> should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]
         └-> "before each" hook: global before each for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         └- ✓ pass  (125ms)
       └-> "after all" hook for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Unloading indices from "mappings.json"
         │ warn since spaces are enabled, all objects other than the default space were deleted from .kibana rather than deleting the whole index
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Unloading indices from "data.json"
       └-> "after all" hook: afterTestSuite.trigger for "should return 200 success [resolvetype/exact-match,resolvetype/conflict] and bad request [hiddentype/any] and not found [resolvetype/alias-match,resolvetype/disabled,resolvetype/does-not-exist]"
     └-> "after all" hook: afterTestSuite.trigger in "_bulk_resolve"
Normal FTS logs
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/-VEfrTVtSHSgYTB17bF2aQ] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/FWsDyhcZTOugR9MRc5Xrog] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:10.026-05:00][INFO ][savedobjects-service] [.kibana_task_manager] INIT -> CREATE_NEW_TARGET. took: 2ms.
   │ proc [kibana] [2021-12-29T14:21:10.027-05:00][INFO ][savedobjects-service] [.kibana] INIT -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:10.029-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 2ms.
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_task_manager_8.1.0_001] creating index, cause [api], templates [], shards [1]/[1]
   │ info [o.e.c.r.a.AllocationService] [node-01] updating number_of_replicas to [0] for indices [.kibana_task_manager_8.1.0_001]
   │ proc [kibana] [2021-12-29T14:21:10.033-05:00][INFO ][savedobjects-service] [.kibana] Starting to process 9 documents.
   │ proc [kibana] [2021-12-29T14:21:10.033-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_TRANSFORM. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:10.036-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_TRANSFORM -> TRANSFORMED_DOCUMENTS_BULK_INDEX. took: 3ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_task_manager_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_task_manager_8.1.0_001][0]]"
   │ proc [kibana] [2021-12-29T14:21:11.068-05:00][INFO ][savedobjects-service] [.kibana_task_manager] CREATE_NEW_TARGET -> MARK_VERSION_INDEX_READY. took: 1042ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:11.761-05:00][INFO ][savedobjects-service] [.kibana_task_manager] MARK_VERSION_INDEX_READY -> DONE. took: 693ms.
   │ proc [kibana] [2021-12-29T14:21:11.761-05:00][INFO ][savedobjects-service] [.kibana_task_manager] Migration completed after 1737ms
   │ proc [kibana] [2021-12-29T14:21:11.838-05:00][INFO ][savedobjects-service] [.kibana] TRANSFORMED_DOCUMENTS_BULK_INDEX -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 1802ms.
   │ proc [kibana] [2021-12-29T14:21:11.842-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 4ms.
   │ proc [kibana] [2021-12-29T14:21:11.844-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> OUTDATED_DOCUMENTS_REFRESH. took: 2ms.
   │ proc [kibana] [2021-12-29T14:21:11.872-05:00][INFO ][savedobjects-service] [.kibana] OUTDATED_DOCUMENTS_REFRESH -> UPDATE_TARGET_MAPPINGS. took: 28ms.
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
   │ proc [kibana] [2021-12-29T14:21:11.976-05:00][INFO ][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS -> UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK. took: 104ms.
   │ info [o.e.t.LoggingTaskListener] [node-01] 3939 finished with response BulkByScrollResponse[took=110.2ms,timed_out=false,sliceId=null,updated=37,created=0,deleted=0,batches=1,versionConflicts=0,noops=0,retries=0,throttledUntil=0s,bulk_failures=[],search_failures=[]]
   │ proc [kibana] [2021-12-29T14:21:12.185-05:00][INFO ][savedobjects-service] [.kibana] UPDATE_TARGET_MAPPINGS_WAIT_FOR_TASK -> DONE. took: 209ms.
   │ proc [kibana] [2021-12-29T14:21:12.185-05:00][INFO ][savedobjects-service] [.kibana] Migration completed after 2162ms
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/_i26TM2mRj6709Up8xm4VQ] update_mapping [_doc]
Error FTR logs
       └-> "before all" hook for "should return 200 success [isolatedtype/space1-isolatedtype-id,sharedtype/all_spaces,sharedtype/default_and_space_1,sharedtype/only_space_1,sharecapabletype/only_space_1,globaltype/globaltype-id] and not found [isolatedtype/defaultspace-isolatedtype-id,isolatedtype/space2-isolatedtype-id,sharedtype/only_space_2,sharecapabletype/only_default_space,hiddentype/any,dashboard/does-not-exist]"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "mappings.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Loading "data.json"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_task_manager_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Deleted existing index ".kibana_8.1.0_001"
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Created index ".kibana_8.1.0_001"
         │ debg [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] ".kibana_8.1.0_001" settings {"index":{"auto_expand_replicas":"0-1","number_of_replicas":"0","number_of_shards":"1"}}
         │ info [x-pack/test/saved_object_api_integration/common/fixtures/es_archiver/saved_objects/spaces] Indexed 37 docs into ".kibana"
         │ debg Migrating saved objects
         │ERROR [migrate saved objects] request failed (attempt=1/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=2/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=3/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=4/5): connect ECONNREFUSED 127.0.0.1:5620
         │ERROR [migrate saved objects] request failed (attempt=5/5): connect ECONNREFUSED 127.0.0.1:5620
         └- ✖ fail: saved objects spaces only enabled _bulk_update within the space_1 space "before all" hook for "should return 200 success [isolatedtype/space1-isolatedtype-id,sharedtype/all_spaces,sharedtype/default_and_space_1,sharedtype/only_space_1,sharecapabletype/only_space_1,globaltype/globaltype-id] and not found [isolatedtype/defaultspace-isolatedtype-id,isolatedtype/space2-isolatedtype-id,sharedtype/only_space_2,sharecapabletype/only_default_space,hiddentype/any,dashboard/does-not-exist]"
         │      Error: [migrate saved objects] request failed (attempt=5/5): connect ECONNREFUSED 127.0.0.1:5620 -- and ran out of retries
         │       at KbnClientRequester.request (node_modules/@kbn/test/target_node/kbn_client/kbn_client_requester.js:158:15)
         │       at runMicrotasks (<anonymous>)
         │       at processTicksAndRejections (node:internal/process/task_queues:96:5)
         │       at KbnClientSavedObjects.migrate (node_modules/@kbn/test/target_node/kbn_client/kbn_client_saved_objects.js:51:9)
         │       at migrateKibanaIndex (node_modules/@kbn/es-archiver/target_node/lib/indices/kibana_index.js:67:3)
         │       at loadAction (node_modules/@kbn/es-archiver/target_node/actions/load.js:84:5)
         │       at EsArchiver.load (node_modules/@kbn/es-archiver/target_node/es_archiver.js:84:12)
         │       at Object.apply (node_modules/@kbn/test/target_node/functional_test_runner/lib/mocha/wrap_function.js:87:16)
Error FTS logs
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/zy6LgiktQoKhy2U85FfHQQ] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/kQs6lzRsTDuyXE3tz5O6oQ] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:217) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:202) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:258) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1432) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:47) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:763) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
   │            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
   │            at java.lang.Thread.run(Thread.java:833) [?:?]
   │      Caused by: org.elasticsearch.transport.RemoteTransportException: [node-01][127.0.0.1:9300][indices:data/read/get[s]]
   │      Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │            at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:2075) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:1194) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:179) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:95) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:79) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:113) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:34) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:104) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:775) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            ... 3 more
   │ proc [kibana] Unhandled Promise rejection detected:
   │ proc [kibana] 
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"
   │ proc [kibana] ResponseError: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │ proc [kibana]     at KibanaTransport.request (/Users/joe/GitHub/kibana-3/node_modules/@elastic/transport/src/Transport.ts:517:17)
   │ proc [kibana]     at runMicrotasks (<anonymous>)
   │ proc [kibana]     at processTicksAndRejections (node:internal/process/task_queues:96:5)
   │ proc [kibana]     at ApmClient.GetApi [as get] (/Users/joe/GitHub/kibana-3/node_modules/@elastic/elasticsearch/src/api/api/get.ts:59:10) {
   │ proc [kibana]   meta: {
   │ proc [kibana]     body: { error: [Object], status: 503 },
   │ proc [kibana]     statusCode: 503,
   │ proc [kibana]     headers: {
   │ proc [kibana]       'x-opaque-id': 'unknownId',
   │ proc [kibana]       'x-elastic-product': 'Elasticsearch',
   │ proc [kibana]       'content-type': 'application/json;charset=utf-8',
   │ proc [kibana]       'content-length': '663'
   │ proc [kibana]     },
   │ proc [kibana]     meta: {
   │ proc [kibana]       context: null,
   │ proc [kibana]       request: [Object],
   │ proc [kibana]       name: 'elasticsearch-js',
   │ proc [kibana]       connection: [Object],
   │ proc [kibana]       attempts: 0,
   │ proc [kibana]       aborted: false
   │ proc [kibana]     },
   │ proc [kibana]     warnings: [Getter]
   │ proc [kibana]   },
   │ proc [kibana]   isBoom: true,
   │ proc [kibana]   isServer: true,
   │ proc [kibana]   data: null,
   │ proc [kibana]   output: {
   │ proc [kibana]     statusCode: 503,
   │ proc [kibana]     payload: {
   │ proc [kibana]       message: 'No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]',
   │ proc [kibana]       statusCode: 503,
   │ proc [kibana]       error: 'Service Unavailable'
   │ proc [kibana]     },
   │ proc [kibana]     headers: {}
   │ proc [kibana]   },
   │ proc [kibana]   [Symbol(SavedObjectsClientErrorCode)]: 'SavedObjectsClient/esUnavailable'
   │ proc [kibana] }
   │ proc [kibana] 
   │ proc [kibana] Terminating process...
   │ proc [kibana]  server crashed  with status code 1
   │ info [o.e.c.m.MetadataMappingService] [node-01] [.kibana_8.1.0_001/bHurwnsBTn2m5Akuc_ZXzA] update_mapping [_doc]

Expected behavior:

Kibana functional tests should not fail when run locally.

Any additional context:

The interesting thing to me in the logs is that, when the FTS errors out, I always see these lines, without fail:

   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]

I believe that's originating from the Telemetry plugin's getIsOptedIn function here:

const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);

This isn't enclosed in a try/catch, but if I add one:

diff --git a/src/plugins/telemetry/server/plugin.ts b/src/plugins/telemetry/server/plugin.ts
index aa22410358f..22294b03c63 100644
--- a/src/plugins/telemetry/server/plugin.ts
+++ b/src/plugins/telemetry/server/plugin.ts
@@ -134,20 +134,24 @@ export class TelemetryPlugin implements Plugin<TelemetryPluginSetup, TelemetryPl
     return {
       getIsOptedIn: async () => {
         const internalRepository = new SavedObjectsClient(savedObjectsInternalRepository);
-        const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);
-
-        const config = await this.config$.pipe(take(1)).toPromise();
-        const allowChangingOptInStatus = config.allowChangingOptInStatus;
-        const configTelemetryOptIn = typeof config.optIn === 'undefined' ? null : config.optIn;
-        const currentKibanaVersion = this.currentKibanaVersion;
-        const isOptedIn = getTelemetryOptIn({
-          currentKibanaVersion,
-          telemetrySavedObject,
-          allowChangingOptInStatus,
-          configTelemetryOptIn,
-        });
-
-        return isOptedIn === true;
+        try {
+          const telemetrySavedObject = await getTelemetrySavedObject(internalRepository);
+
+          const config = await this.config$.pipe(take(1)).toPromise();
+          const allowChangingOptInStatus = config.allowChangingOptInStatus;
+          const configTelemetryOptIn = typeof config.optIn === 'undefined' ? null : config.optIn;
+          const currentKibanaVersion = this.currentKibanaVersion;
+          const isOptedIn = getTelemetryOptIn({
+            currentKibanaVersion,
+            telemetrySavedObject,
+            allowChangingOptInStatus,
+            configTelemetryOptIn,
+          });
+          return isOptedIn === true;
+        } catch (err) {
+          this.logger.error('WTF: ' + err.message);
+        }
+        return false;
       },
     };
   }

my functional tests work fine, and I see the caught error in the FTS log:

Click to expand
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_task_manager_8.1.0_001/K1ZYrqBPTuufNdeTzBQ_LA] deleting index
   │ info [o.e.c.m.MetadataDeleteIndexService] [node-01] [.kibana_8.1.0_001/8X3-zVeiQoax82cS8WUvlw] deleting index
   │ info [o.e.c.m.MetadataCreateIndexService] [node-01] [.kibana_8.1.0_001] creating index, cause [api], templates [], shards [1]/[0]
   │ info [r.suppressed] [node-01] path: /.kibana_8.1.0/_doc/telemetry%3Atelemetry, params: {index=.kibana_8.1.0, id=telemetry:telemetry}
   │      org.elasticsearch.action.NoShardAvailableActionException: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:217) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.onFailure(TransportSingleShardAction.java:202) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction$2.handleException(TransportSingleShardAction.java:258) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1349) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.processException(TransportService.java:1458) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TransportService$DirectResponseChannel.sendResponse(TransportService.java:1432) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.transport.TaskTransportChannel.sendResponse(TaskTransportChannel.java:50) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.ChannelActionListener.onFailure(ChannelActionListener.java:47) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.onFailure(ActionRunnable.java:77) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onFailure(ThreadContext.java:763) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:28) [elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
   │            at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
   │            at java.lang.Thread.run(Thread.java:833) [?:?]
   │      Caused by: org.elasticsearch.transport.RemoteTransportException: [node-01][127.0.0.1:9300][indices:data/read/get[s]]
   │      Caused by: org.elasticsearch.index.shard.IllegalIndexShardStateException: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │            at org.elasticsearch.index.shard.IndexShard.readAllowed(IndexShard.java:2075) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.shard.IndexShard.get(IndexShard.java:1194) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.innerGet(ShardGetService.java:179) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:95) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.index.get.ShardGetService.get(ShardGetService.java:79) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:113) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.get.TransportGetAction.shardOperation(TransportGetAction.java:34) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.lambda$asyncShardOperation$0(TransportSingleShardAction.java:104) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:775) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.1.0-SNAPSHOT.jar:8.1.0-SNAPSHOT]
   │            ... 3 more
   │ proc [kibana] [2021-12-29T14:46:59.760-05:00][ERROR][plugins.telemetry] WTF: No shard available for [get [.kibana_8.1.0][telemetry:telemetry]: routing [null]]: no_shard_available_action_exception: [illegal_index_shard_state_exception] Reason: CurrentState[RECOVERING] operations only allowed when shard state is one of [POST_RECOVERY, STARTED]
   │ info [o.e.c.r.a.AllocationService] [node-01] current.health="GREEN" message="Cluster health status changed from [YELLOW] to [GREEN] (reason: [shards started [[.kibana_8.1.0_001][0]]])." previous.health="YELLOW" reason="shards started [[.kibana_8.1.0_001][0]]"

The only other thing that stands out is that there was a recent addition to the Fleet plugin which now consumes this function from the Telemetry plugin (this was added ~2 months ago via #115180):

this.isOptedIn = await this.isTelemetryOptedIn();

This is not enclosed in a try/catch, where the other usages of this function are.
So, it appears this function is getting called in the middle of the migration process, causing the server to crash.

What I can't work out is: why isn't this error happening in CI?
Is the fleet plugin disabled in CI? Edit: it looks like the Fleet plugin can't be disabled. But the Telemetry plugin can be disabled, so maybe that's disabled in CI?

At any rate, while the Fleet plugin appears to be causing this bug, I think the root cause is really the Telemetry plugin, IMO this function should fail gracefully.

@jportner jportner added bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc labels Dec 29, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-core (Team:Core)

@lukeelmers
Copy link
Member

Added a note on the PR, but after multiple attempts I am unable to reproduce this. Would be interested to see if anyone else can.

@afharo
Copy link
Member

afharo commented Jan 4, 2022

Is the fleet plugin disabled in CI? Edit: it looks like the Fleet plugin can't be disabled. But the Telemetry plugin can be disabled, so maybe that's disabled in CI?

According to the FT config files, it's enabled and opted-in for your specific config file (because it inherits it from the x-pack/test/api_integration/config.ts file).

I just ran it with the latest main and it didn't fail for me either. Probably the upstream user of the API fixed their unhandled promises?

@jportner
Copy link
Contributor Author

jportner commented Jan 4, 2022

I just ran it with the latest main and it didn't fail for me either. Probably the upstream user of the API fixed their unhandled promises?

I pulled the latest main just now and reproduced it again locally. Also took a look at the problematic consumer (fleet plugin) and they didn't change anything.
I guess there is some sort of race condition that is causing this function to be called during the migration process, and when that happens, an error occurs and crashes the Kibana server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants