7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

rudolf · 2021-03-24T15:24:13Z

v7.12.0 migrations can fail if the .kibana index has a large number of saved objects or the Elasticsearch cluster is under heavy load.

This will cause errors logs like:

[.kibana] [receive_timeout_transport_exception]: [instance-0000000002][10.42.1.112:19541][cluster:monitor/task/get] request_id [2648] timed out after [59940ms]
[.kibana] [timeout_exception]: Timed out waiting for completion of [org.elasticsearch.index.reindex.BulkByScrollTask@6a74c54]

The root cause is a bug in the v2 migrations that means Kibana will only wait for 60s for some of the steps. When a cluster is busy or there are a lot of fleet-agent-events _reindex or _update_by_query tasks can take longer than 60s leading to a timeout.

Workaround for large fleet-agent-events

If you've used fleet and have a large number of fleet-agent-events follow this workaround

First establish that there are large number of fleet-agent-events:

POST .kibana/_search?filter_path=aggregations
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }
}

Shutdown Kibana

Wait for any reindex tasks started by Kibana to complete. The output of the following request shouldn't show any reindex tasks for the .kibana* indices:

GET  _tasks?actions=*update/byquery,reindex&group_by=parents&detailed
# Cancel any tasks where the "description" field shows it's operating on a `.kibana*` index
# e.g. `"description": "update-by-query [.kibana_7.12.0_001]"
POST _tasks/<task id from above>/_cancel
# Verify that all tasks have been cancelled:
GET  _tasks?actions=*update/byquery,reindex&group_by=parents&detailed

Delete fleet-agent-events

POST .kibana,.kibana_7.12.0_*/_delete_by_query?conflicts=proceed&wait_for_completion=false
{
  "query": {
    "bool": {
      "must": 
        {
          "term": {
            "type": "fleet-agent-events"
          }
        }
    }
  }
}

Check that the task returned from the above operation has completed before proceeding

GET _tasks/<id>

Restart Kibana

PR #95305

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-03-29T11:22:46Z

Pinging @elastic/kibana-core (Team:Core)

hueyg · 2021-06-17T14:49:56Z

Forgive my ignorance, but how do you carry out these POST and GET commands if you cannot run Kibana to use the Console? I assume you can use CURL, but what exactly am I carrying out these commands against? What port?

I came here from the 7.13 documentation and suffering this exact issue from a cluster that I have had around since 6.x days. The upgrades where carried out from official elascticsearch repos enabled on RedHat systems.

pgayvallet · 2021-06-17T15:01:06Z

but how do you carry out these POST and GET commands if you cannot run Kibana to use the Console

Any CLI or GUI client, such as curl or postman

e.g for curl

curl -XPOST "https://localhost:9200/.kibana/_search?filter_path=aggregations" -d'
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }'

What port?

Your configured ES port. The default is 9200

hueyg · 2021-06-17T15:04:20Z

but how do you carry out these POST and GET commands if you cannot run Kibana to use the Console

Any CLI or GUI client, such as curl or postman

e.g for curl
curl -XPOST "https://localhost:9200/.kibana/_search?filter_path=aggregations" -d'
{
  "aggs": {
    "saved_object_type": {
      "terms": {"field": "type"}
    }
  }'
What port?

Your configured ES port. The default is 9200

Thank you pgayvallet!

wonderland14 · 2022-07-14T15:23:44Z

Hi @pgayvallet , may I have a sample POST for postman? I am trying to open the index, because my Kibana is not accessible because I accidentally closed the kibana_security and I think some other kibana's indices. I do research but still confused how come all indices will re-open. Sorry for my ignorance 'bout these

rudolf added bug Fixes for quality problems that affect the customer experience v7.12.0 project:ResilientSavedObjectMigrations Reduce Kibana upgrade failures by making saved object migrations more resilient labels Mar 24, 2021

rudolf mentioned this issue Mar 24, 2021

Add docs for v2 migration timeouts related to fleet-agent-events #95370

Merged

9 tasks

joshdover added the Team:Core Core services & architecture: plugins, logging, config, saved objects, http, ES client, i18n, etc label Mar 29, 2021

joshdover mentioned this issue Mar 30, 2021

migrations v2: Retry tasks that timeout #95305

Merged

9 tasks

rudolf closed this as completed in #95305 Apr 2, 2021

joshdover mentioned this issue Aug 5, 2021

Use a scripted reindex for re-writing document _ids for shared saved object types #107740

Closed

6 tasks

marekpow mentioned this issue Nov 29, 2021

Can't run kibana on ES 7.15.2 #119902

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

rudolf commented Mar 24, 2021 •

edited

Loading

elasticmachine commented Mar 29, 2021

hueyg commented Jun 17, 2021 •

edited

Loading

pgayvallet commented Jun 17, 2021

hueyg commented Jun 17, 2021

wonderland14 commented Jul 14, 2022

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

7.12.0 upgrade migrations fail with timeout_exception or receive_timeout_transport_exception #95321

Comments

rudolf commented Mar 24, 2021 • edited Loading

Workaround for large fleet-agent-events

elasticmachine commented Mar 29, 2021

hueyg commented Jun 17, 2021 • edited Loading

pgayvallet commented Jun 17, 2021

hueyg commented Jun 17, 2021

wonderland14 commented Jul 14, 2022

rudolf commented Mar 24, 2021 •

edited

Loading

hueyg commented Jun 17, 2021 •

edited

Loading