Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Fleet]: Kibana upgrade failed from 8.7.1>8.8.0 BC8 when multiple agent policies with integrations exist. #158361

Closed
amolnater-qasource opened this issue May 24, 2023 · 67 comments · Fixed by #159807 or #161249
Assignees
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team

Comments

@amolnater-qasource
Copy link

amolnater-qasource commented May 24, 2023

Kibana Build details:

VERSION: 8.8 BC8 Kibana cloud environment
BUILD: 63142
COMMIT: 2973fcc10d985e4ab94e5eeef976aad0046c6cce
Elasticsearch: 02 zones
Kibana:  01 zone
Integrations Server: 01 zone
Enterprise Server: 01 zone

Host OS and Browser version: All, All

Preconditions:

  1. 8.7.1 production Kibana cloud environment should be available.

Steps to reproduce:

  1. Create an agent policy and add various integrations: System, Endpoint, Nginx, Apache, Linux and Windows.
  2. Duplicate above policy.[We have created 21 duplicates of above policy]
  3. Install 1 agent with 1 of the policy. [We installed 1 ubuntu 22 .tar agent].
  4. Trigger kibana upgrade to 8.8.
  5. Observe after sometime, kibana upgrade failed and user is unable to access kibana.

What's working fine:

  • Successfully able to upgrade kibana to 8.7.1>8.8.0 BC8 when no data is available[blank kibana].

Note:

  • We are successfully able to reproduce this issue on two different environments.

Expected:
Kibana upgrade should be successful from 8.7.1>8.8.0 BC8 when multiple agent policies with integrations exist.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-05-24.15-26-58.mp4

9
10
image

@amolnater-qasource amolnater-qasource added bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. Team:Fleet Team label for Observability Data Collection Fleet team labels May 24, 2023
@elasticmachine
Copy link
Contributor

Pinging @elastic/fleet (Team:Fleet)

@amolnater-qasource
Copy link
Author

@manishgupta-qasource Please review.

@manishgupta-qasource
Copy link

Secondary review for this ticket is Done

@jlind23
Copy link
Contributor

jlind23 commented May 24, 2023

@amolnater-qasource @manishgupta-qasource Am I correct to assume that this is a Kibana rather than a Fleet issue?

@juliaElastic
Copy link
Contributor

I think we have to check the logs for the root cause.
@amolnater-qasource Could you share the cloud admin link for this instance?

@amolnater-qasource
Copy link
Author

Hi @jlind23 @juliaElastic

Thank you for looking into this issue.
Blank kibana (with no changes) upgraded successfully from 8.7.1>8.8.0.
However, issue is only reproducible when the above shared data was added.

Further, we have observed failure specifically in Integrations Server.
Screenshot:
image
image

Other 03 modules Elasticsearch, Kibana and Enterprise server upgraded successfully.
image
image
image

Further, sharing the cloud admin link over the slack. @juliaElastic

Please let us know if anything else is required from our end.
Thanks!

@juliaElastic
Copy link
Contributor

juliaElastic commented May 24, 2023

Checked the logs and seeing this error in Integration Server logs around 10am when the restart happened:
Kibana Fleet setup failed: http POST request to http://0631b66d429f426086af0c70b1f7b2fb.containerhost:9244/api/fleet/setup fails: The requested resource is currently unavailable.: %!w(<nil>). Response: {"ok":false,"message":"The requested resource is currently unavailable."}

In kibana logs it looks like kibana is continuously killed and restarts, there are a lot of logs around SO migrations.

/app/kibana.sh: line 80:    17 Killed                  NODE_OPTIONS="--max-old-space-size=800" ${KIBANA_HOME}/bin/kibana -c /app/config/kibana.yml $*



May 25, 2023 @ 11:33:43.948 |   | Starting [139] plugins: [usageCollection,telemetryCollectionManager,telemetryCollectionXpack,taskManager,kibanaUsageCollection,cloud,translations,share,screenshotMode,newsfeed,savedObjectsFinder,monitoringCollection,licensing,mapsEms,globalSearch,globalSearchProviders,features,guidedOnboarding,banners,licenseApiGuard,customBranding,ftrApis,fieldFormats,expressions,screenshotting,dataViews,cloudExperiments,charts,esUiShared,customIntegrations,home,searchprofiler,painlessLab,management,cloudDataMigration,advancedSettings,spaces,security,telemetry,licenseManagement,snapshotRestore,lists,files,encryptedSavedObjects,cloudFullStory,cloudChat,eventLog,actions,notifications,grokdebugger,console,contentManagement,bfetch,data,watcher,unifiedFieldList,savedSearch,savedObjectsTagging,savedObjectsManagement,unifiedSearch,graph,embeddable,uiActionsEnhanced,presentationUtil,expressionShape,expressionRevealImage,expressionRepeatImage,expressionMetric,expressionImage,controls,fileUpload,ingestPipelines,eventAnnotation,ecsDataQualityDashboard,dataViewFieldEditor,visualizations,visTypeXy,visTypeVislib,visTypeVega,visTypeTimeseries,visTypeTimelion,visTypeTagcloud,visTypeTable,visTypeMetric,visTypeMarkdown,visTypeHeatmap,expressionXY,expressionTagcloud,expressionPartitionVis,visTypePie,expressionMetricVis,expressionLegacyMetricVis,expressionHeatmap,expressionGauge,visTypeGauge,dashboard,lens,maps,aiops,dashboardEnhanced,dataViewManagement,alerting,triggersActionsUi,transform,stackConnectors,stackAlerts,ruleRegistry,cases,timelines,sessionView,kubernetesSecurity,threatIntelligence,discover,reporting,canvas,fleet,osquery,indexManagement,rollup,remoteClusters,crossClusterReplication,indexLifecycleManagement,cloudSecurityPosture,cloudDefend,exploratoryView,observability,observabilityOnboarding,discoverEnhanced,dataVisualizer,ml,synthetics,securitySolution,infra,upgradeAssistant,monitoring,logstash,enterpriseSearch,apm,assetManager] | - | INFO
-- | -- | -- | -- | --

  | May 25, 2023 @ 11:33:43.947 |   | Stopping all plugins. | - | INFO

  | May 25, 2023 @ 11:33:43.895 |   | [.kibana_task_manager] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.895 |   | [.kibana_task_manager] Migration completed after 187ms | - | INFO

  | May 25, 2023 @ 11:33:43.894 |   | [.kibana_task_manager] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.894 |   | [.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 5ms. | - | INFO

  | May 25, 2023 @ 11:33:43.889 |   | [.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 11ms. | - | INFO

  | May 25, 2023 @ 11:33:43.878 |   | [.kibana_task_manager] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 6ms. | - | INFO

  | May 25, 2023 @ 11:33:43.876 |   | [.kibana_ingest] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.876 |   | [.kibana_ingest] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 10ms. | - | INFO

  | May 25, 2023 @ 11:33:43.876 |   | [.kibana_ingest] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.876 |   | [.kibana_ingest] Migration completed after 114ms | - | INFO

  | May 25, 2023 @ 11:33:43.874 |   | [.kibana_alerting_cases] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.874 |   | [.kibana_alerting_cases] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 18ms. | - | INFO

  | May 25, 2023 @ 11:33:43.874 |   | [.kibana_alerting_cases] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.874 |   | [.kibana_alerting_cases] Migration completed after 118ms | - | INFO

  | May 25, 2023 @ 11:33:43.872 |   | [.kibana_task_manager] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 10ms. | - | INFO

  | May 25, 2023 @ 11:33:43.872 |   | [.kibana_task_manager] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.866 |   | [.kibana_ingest] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 17ms. | - | INFO

  | May 25, 2023 @ 11:33:43.865 |   | [.kibana_security_solution] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.865 |   | [.kibana_security_solution] Migration completed after 113ms | - | INFO

  | May 25, 2023 @ 11:33:43.864 |   | [.kibana_security_solution] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 13ms. | - | INFO

  | May 25, 2023 @ 11:33:43.864 |   | [.kibana_security_solution] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.862 |   | [.kibana_task_manager] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 154ms. | - | INFO

  | May 25, 2023 @ 11:33:43.861 |   | [.kibana] Migration completed after 157ms | - | INFO

  | May 25, 2023 @ 11:33:43.860 |   | [.kibana] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 6ms. | - | INFO

  | May 25, 2023 @ 11:33:43.860 |   | [.kibana] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.860 |   | [.kibana] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.856 |   | [.kibana_alerting_cases] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 36ms. | - | INFO

  | May 25, 2023 @ 11:33:43.854 |   | [.kibana] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 39ms. | - | INFO

  | May 25, 2023 @ 11:33:43.851 |   | [.kibana_security_solution] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 42ms. | - | INFO

  | May 25, 2023 @ 11:33:43.849 |   | [.kibana_ingest] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 35ms. | - | INFO

  | May 25, 2023 @ 11:33:43.846 |   | [.kibana_analytics] Migration completed after 102ms | - | INFO

  | May 25, 2023 @ 11:33:43.845 |   | [.kibana_analytics] CHECK_TARGET_MAPPINGS -> CHECK_VERSION_INDEX_READY_ACTIONS. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.845 |   | [.kibana_analytics] CHECK_VERSION_INDEX_READY_ACTIONS -> DONE. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.844 |   | [.kibana_analytics] OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT -> CHECK_TARGET_MAPPINGS. took: 26ms. | - | INFO

  | May 25, 2023 @ 11:33:43.820 |   | [.kibana_alerting_cases] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 9ms. | - | INFO

  | May 25, 2023 @ 11:33:43.818 |   | [.kibana_analytics] OUTDATED_DOCUMENTS_SEARCH_READ -> OUTDATED_DOCUMENTS_SEARCH_CLOSE_PIT. took: 12ms. | - | INFO

  | May 25, 2023 @ 11:33:43.815 |   | [.kibana] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 11ms. | - | INFO

  | May 25, 2023 @ 11:33:43.814 |   | [.kibana_ingest] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.813 |   | [.kibana_ingest] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 12ms. | - | INFO

  | May 25, 2023 @ 11:33:43.811 |   | [.kibana_alerting_cases] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 0ms. | - | INFO

  | May 25, 2023 @ 11:33:43.811 |   | [.kibana_alerting_cases] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 21ms. | - | INFO

  | May 25, 2023 @ 11:33:43.809 |   | [.kibana_security_solution] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 10ms. | - | INFO

  | May 25, 2023 @ 11:33:43.806 |   | [.kibana_analytics] OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT -> OUTDATED_DOCUMENTS_SEARCH_READ. took: 10ms. | - | INFO

  | May 25, 2023 @ 11:33:43.804 |   | [.kibana] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.803 |   | [.kibana] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 22ms. | - | INFO

  | May 25, 2023 @ 11:33:43.801 |   | [.kibana_ingest] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 39ms. | - | INFO

  | May 25, 2023 @ 11:33:43.799 |   | [.kibana_security_solution] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.798 |   | [.kibana_security_solution] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 12ms. | - | INFO

  | May 25, 2023 @ 11:33:43.796 |   | [.kibana_analytics] UPDATE_SOURCE_MAPPINGS_PROPERTIES -> OUTDATED_DOCUMENTS_SEARCH_OPEN_PIT. took: 1ms. | - | INFO

  | May 25, 2023 @ 11:33:43.795 |   | [.kibana_analytics] WAIT_FOR_YELLOW_SOURCE -> UPDATE_SOURCE_MAPPINGS_PROPERTIES. took: 11ms. | - | INFO

  | May 25, 2023 @ 11:33:43.790 |   | [.kibana_alerting_cases] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 34ms. | - | INFO

  | May 25, 2023 @ 11:33:43.787 |   | [.kibana_security_solution] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 34ms. | - | INFO

  | May 25, 2023 @ 11:33:43.784 |   | [.kibana_analytics] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 40ms. | - | INFO

  | May 25, 2023 @ 11:33:43.781 |   | [.kibana] INIT -> WAIT_FOR_YELLOW_SOURCE. took: 77ms. | - | INFO

  | May 25, 2023 @ 11:33:43.677 |   | Waiting until all Elasticsearch nodes are compatible with Kibana before starting saved objects migrations... | - | INFO

  | May 25, 2023 @ 11:33:43.677 |   | Starting saved objects migrations


@juliaElastic
Copy link
Contributor

@amolnater-qasource Is this issue only happening in cloud, does the upgrade succeed on-prem with this many policies and integrations?

@amolnater-qasource
Copy link
Author

Hi @juliaElastic

We have revalidated this issue on on-prem setup and found it reproducible there too.

Steps followed:

  • Setup on-prem 8.7.1 environment.
  • Installed fleet server and created 21 policies with System, Fleet Server, Endpoint, Nginx, Apache, Linux and Windows integrations.
  • Copied 8.7.1 folders config and data from elasticsearch and kibana to 8.8.0 BC8 artifacts
  • We observed the upgrade gets stuck and kibana never gets ready to be accessible.

Elasticsearch Logs:
Elasticsearch bat logs.txt

Kibana Logs:
Kibana bat logs.txt

Screen Recording:

Amol.Self-Win.1.-.ec2-3-91-98-132.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-05-25.15-25-24.mp4

12

Please let us know if anything else is required from our end.
Thanks!

@juliaElastic
Copy link
Contributor

The on-prem error seems to be different, it times out on the .kibana index split. I'll try to reproduce locally.

[2023-05-25T10:12:44.207+00:00][ERROR][savedobjects-service] [.kibana] Action failed with '[index_not_green_timeout] Timeout waiting for the status of the [.kibana_8.8.0_001] index to become 'green' Refer to https://www.elastic.co/guide/en/kibana/8.8/resolve-migrations-failures.html#_repeated_time_out_requests_that_eventually_fail for information on how to resolve the issue.'. Retrying attempt 1 in 2 seconds.
[2023-05-25T10:12:44.212+00:00][INFO ][savedobjects-service] [.kibana] CLONE_TEMP_TO_TARGET -> CLONE_TEMP_TO_TARGET. took: 91628ms.
[2023-05-25T10:12:44.226+00:00][ERROR][savedobjects-service] [.kibana_ingest] Action failed with '[index_not_green_timeout] Timeout waiting for the status of the [.kibana_ingest_8.8.0_001] index to become 'green' Refer to https://www.elastic.co/guide/en/kibana/8.8/resolve-migrations-failures.html#_repeated_time_out_requests_that_eventually_fail for information on how to resolve the issue.'. Retrying attempt 1 in 2 seconds.
[2023-05-25T10:12:44.230+00:00][INFO ][savedobjects-service] [.kibana_ingest] CLONE_TEMP_TO_TARGET -> CLONE_TEMP_TO_TARGET. took: 91629ms.

@juliaElastic
Copy link
Contributor

juliaElastic commented May 25, 2023

On the cloud issue, it seems that kibana was going oom on upgrade, after giving it more memory (4gb), the setup completed successfully.
Also tested on an instance with 2GB kibana, and the upgrade worked fine. So it seems we are hitting the OOM issue on 1GB only (with the testcase above, 20 agent policies, 120 integration policies).

@amolnater-qasource
Copy link
Author

Hi Team,

We have also validated this issue on upgrading from 8.6.2>8.7.1 production kibana cloud environment and found it not reproducible there.

Setup Details:

  • Kibana RAM- 1GB
  • Agent Policies: 22
  • Agent: 01

Observations:

  • Kibana upgraded successfully to the latest version.

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-05-25.18-32-00.mp4

Please let us know if anything else is required from our end.
Thanks!

@juliaElastic juliaElastic self-assigned this May 26, 2023
@jlind23 jlind23 assigned nchaulet and unassigned juliaElastic Jun 13, 2023
@nchaulet
Copy link
Member

@amolnater-qasource I tried to reproduce that issue against 8.8.1 and it does not seems reproducible (and I saw that some change to SO migrations related to memory usage have been merged #157494)

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 14, 2023

@nchaulet Have you tried on self-managed or cloud?
I could reproduce before in cloud with 1gb kibana memory.
On self-managed, couldn't reproduce on my local machine, but there the kibana migration was causing the issue, which might have been improved by the linked gh issue.

@amolnater-qasource
Copy link
Author

Hi @nchaulet

We have revalidated this issue on upgrade from 8.7.1 to 8.8.1 and found the issue still reproducible with below steps:

  1. Create an agent policy and add various integrations: System, Endpoint, Nginx, Apache, Linux and Windows.
  2. Duplicate above policy.[We have created 21 duplicates of above policy]
  3. Install 1 agent with 1 of the policy. [We installed 1 ubuntu 22 .tar agent].
  4. Trigger kibana upgrade to 8.8.1.
  5. Observe after sometime, kibana upgrade failed and user is unable to access kibana.

Setup Details:

  • Kibana RAM- 1GB
  • Agent Policies: 22
  • Agent: 01

Screen Recording:

Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-06-14.12-01-52.mp4

image
image

cc: @jlind23 @juliaElastic

Please let us know if anything else is required from our end.
Thanks

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 14, 2023

Investigating on the latest cluster shared by Amol: https://admin.found.no/deployments/c3cf9384322f4686946b78f0e2c6147c

In the Kibana logs, seeing a constant restart, every 1-2m or so. Logs are mainly about kibana SO migrations, there is not a clear error.

Checked the Kibana metrics dashboard of Application containers, and seeing a 100% CPU that correlates with the restarts:
https://platform-logging.kb.europe-west1.gcp.cloud.es.io/app/dashboards#/view/Compute-Applications-containers-overview?_g=h@c823129
Memory also jumps to around 80%
image
Previously we had confirmation that the container is being restarted due to OOM issue.

The question is whether this is caused by kibana migration or something in Fleet. I think the cause is not the SO migration in cloud, even though there are logs about it on every restart, it takes about 0ms for each, which means the migration already happened and the logic only checks that there is nothing left to migrate.

The issue experienced here is different than the OOMs seen in SDHs, there ES was reaching OOM and logging circuit breaker errors. Here I don't see the ES containers reaching CPU or memory limits, so the issue is on the kibana side.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 14, 2023

I think I found something.
Noticed in the logs this: Found 22 outdated agent policies
Which is logged out by the logic that updates agent policies on schema update.

There was a change in agent schema version in 8.8 here, so that's why the update is triggered.

I've removed the agent policies from the .kibana_ingest index, and after this Kibana went back to healthy.
So I think it's possible that updating all policies at once in this logic uses too much memory.
I'll add some tracing and memory measurement to see if this is really the culprit.

There is already a batch size of 100 for these updates, and configurable: #150688

@jlind23
Copy link
Contributor

jlind23 commented Jun 14, 2023

@juliaElastic We know that we have users deployment with more than hundreds of policies, we should probably lower the default batch size value. Wdyt?

@juliaElastic
Copy link
Contributor

Let me test with a lower setting to see if it helps. Then we can lower the default.

@rudolf
Copy link
Contributor

rudolf commented Jun 14, 2023

To investigate further I would suggest first reproduce locally by starting Kibana with --max-old-space-size=800 (the value used on cloud 1gb instances)

Then create some heap snapshots:

  1. Increase the --max-old-space-size to something big like 2GB and enable heap snapshots with --heapsnapshot-signal=SIGUSR2
  2. enable metrics logger by adding the following to kibana.yml so we can track memory usage:
    logging.loggers:
      - name: metrics.ops
        level: debug
    
  3. Take a first heap snapshot after Kibana first becomes available (which will be used as reference) with sudo kill -USR2 $(cat /run/kibana/kibana.pid)
  4. Take a second heap snapshot when Kibana is using a significant amount of memory with sudo kill -USR2 $(cat /run/kibana/kibana.pid)

(Depending on how you install/run Kibana /run/kibana/kibana.pid might not be available, but you can substitute $(cat /run/kibana/kibana.pid) for the PID of the Kibana process)

@nchaulet
Copy link
Member

I found a memory leak on the Fleet API that seems to be related to the usage we do of AsyncLocalStorage I created that issue #159762 and will add some heapdump and more details on it. But not sure if it's related to that issue

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 15, 2023

Started kibana locally (latest main) with these settings: export NODE_OPTIONS="--max-old-space-size=800 --heapsnapshot-signal=SIGUSR2" and started adding integrations to reproduce the issue. Added system, nginx, apache, linux, windows. When I tried to add endpoint integration, kibana crashed with OOM.
I started to see high "Event loop utilization" even before the crash.

[2023-06-15T09:18:16.882+02:00][WARN ][http.server.Kibana] Event loop utilization for /julia/api/fleet/epm/packages/endpoint/8.8.0 exceeded threshold of 250ms (1023ms out of 1086ms) and 15% (94%)
<--- Last few GCs --->

[61867:0x7fad80008000]  1115881 ms: Mark-sweep 746.4 (834.7) -> 736.0 (841.4) MB, 162.3 / 0.2 ms  (average mu = 0.871, current mu = 0.391) allocation failure scavenge might not succeed
[61867:0x7fad80008000]  1116183 ms: Mark-sweep 751.7 (841.4) -> 742.3 (846.7) MB, 288.9 / 0.1 ms  (average mu = 0.695, current mu = 0.046) allocation failure scavenge might not succeed


<--- JS stacktrace --->

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
 1: 0x107f81675 node::Abort() (.cold.1) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 2: 0x106c79f59 node::Abort() [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 3: 0x106c7a13e node::OOMErrorHandler(char const*, bool) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 4: 0x106def100 v8::Utils::ReportOOMFailure(v8::internal::Isolate*, char const*, bool) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 5: 0x106def0c3 v8::internal::V8::FatalProcessOutOfMemory(v8::internal::Isolate*, char const*, bool) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 6: 0x106f90815 v8::internal::Heap::FatalProcessOutOfMemory(char const*) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 7: 0x106f8f19c v8::internal::Heap::CollectGarbage(v8::internal::AllocationSpace, v8::internal::GarbageCollectionReason, v8::GCCallbackFlags) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 8: 0x106f9ba40 v8::internal::Heap::AllocateRawWithLightRetrySlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 9: 0x106f9bac1 v8::internal::Heap::AllocateRawWithRetryOrFailSlowPath(int, v8::internal::AllocationType, v8::internal::AllocationOrigin, v8::internal::AllocationAlignment) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
10: 0x106f60ff4 v8::internal::FactoryBase<v8::internal::Factory>::NewFixedArrayWithFiller(v8::internal::Handle<v8::internal::Map>, int, v8::internal::Handle<v8::internal::Oddball>, v8::internal::AllocationType) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
11: 0x1071fc4b6 v8::internal::Handle<v8::internal::NameDictionary> v8::internal::HashTable<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::EnsureCapacity<v8::internal::Isolate>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::NameDictionary>, int, v8::internal::AllocationType) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
12: 0x1071fcf43 v8::internal::Handle<v8::internal::NameDictionary> v8::internal::Dictionary<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::Add<v8::internal::Isolate>(v8::internal::Isolate*, v8::internal::Handle<v8::internal::NameDictionary>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyDetails, v8::internal::InternalIndex*) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
13: 0x1071fce9f v8::internal::BaseNameDictionary<v8::internal::NameDictionary, v8::internal::NameDictionaryShape>::Add(v8::internal::Isolate*, v8::internal::Handle<v8::internal::NameDictionary>, v8::internal::Handle<v8::internal::Name>, v8::internal::Handle<v8::internal::Object>, v8::internal::PropertyDetails, v8::internal::InternalIndex*) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
14: 0x107327667 v8::internal::Runtime_AddDictionaryProperty(int, unsigned long*, v8::internal::Isolate*) [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
15: 0x1076c93d9 Builtins_CEntry_Return1_DontSaveFPRegs_ArgvOnStack_NoBuiltinExit [/Users/juliabardi/.nvm/versions/node/v16.19.1/bin/node]
 please hold  Kibana server is not ready so requests have been paused

I did 2 heapdumps, first after kibana started, then after creating a few integrations. Checked in Chrome dev tools, and seeing some big deltas but so far I don't recognize anything related to fleet.
Uploaded them here:
curl -L -H 'Authorization: 812223999755b259' -o 'heapdumps.zip' https://upload.elastic.co/d/79252cf4002a59014dc9348fb919a205adc9b9ef318746b4f3c35177779c9c92

Also tried commenting out the usage of AsyncLocalStorage in audit_logging.ts, but still got an OOM after adding one more endpoint integration policy.

@jlind23
Copy link
Contributor

jlind23 commented Jun 15, 2023

@juliaElastic did you reduce the SO batch size or it is not related to the OOM you faced?

@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 23, 2023

Seeing the same logs in kibana that it is constantly restarted: https://34390efcbaff40e8a6bc54f9f17e93f9.eastus2.staging.azure.foundit.no:9243/app/r/s/n9Dbd

So I think the AsyncLocalStorage fix helps with the Kibana Integrations degradation over time as it fixes a memory leak.

The stack upgrade issue looks different, something uses more memory during Kibana upgrade, making the 1gb not sufficient.
It is weird that I don't see high memory usage in kibana metrics: https://34390efcbaff40e8a6bc54f9f17e93f9.eastus2.staging.azure.foundit.no:9243/app/monitoring#/kibana/instances/c667fde5-af47-457b-b3e0-a4fa239cf67e

@rudolf
Copy link
Contributor

rudolf commented Jun 23, 2023

would we be able to get a memory dump of this deployment?

@jlind23 Yes I'll work with @juliaElastic and SRE to get a heap snapshot taken.

@rudolf
Copy link
Contributor

rudolf commented Jun 23, 2023

Kibana does not run out of heap but gets killed by the OOM-killer after RSS reaches > 1GB. This likely means it's not running out of heap (controlled by javascript) but consuming too much external memory. Likely causes could be unzipping large files or lots of concurrent connections consuming a lot of external memory for the sockets, or page fragmentation.

@rudolf
Copy link
Contributor

rudolf commented Jun 23, 2023

Confirmed that the OOM-killer is killing Kibana because it has a very large RSS even when the heap is small. When using a 4GB container total heap is ~400mb while RSS is more than 1GB larger.

Screenshot 2023-06-24 at 00 51 46

@rudolf
Copy link
Contributor

rudolf commented Jun 23, 2023

Nodejs/v8 does not have any good tools to further analyse why the RSS might be so high.

Suggested next steps:

  • Create a detailed explanation of what fleet is doing during this upgrade and why. How could 20 policies create so much more memory load?
  • Isolate potential PR's that could have caused this. Can we reproduce this on older versions, could the same happen from 8.6 -> 8.7? If this is definitely an 8.8 regression what are the most likely PRs?
  • Rule out glibc memory fragmentation by running on MacOS with stack monitoring and comparing kibana_stats.process.memory.resident_set_size_in_bytes (can export this dashboard that I created on Julia's monitoring cluster https://34390efcbaff40e8a6bc54f9f17e93f9.eastus2.staging.azure.foundit.no:9243/app/r/s/ihDao)
  • Did we introduce or heavily rely on any dependencies that use native nodejs modules?

@sophiec20
Copy link
Contributor

What is the latest status for this high impact issue?

@jlind23
Copy link
Contributor

jlind23 commented Jun 28, 2023

So far @rudolf and @juliaElastic worked on this and here are the findings:

  • This happens only on 1gb Kibana instance because there is not enough memory available
  • This happens on clusters that contains many integrations and policies which might indicate that this is caused by one of the fleet functionalities introduced recently
  • This happens only on cloud as we have a process watching the resident memory and killing the container if it goes above a given threshold

@juliaElastic will reproduce locally as soon as she can and will observe the resident memory allocation and see what can be causing this.

@amolnater-qasource amolnater-qasource removed the QA:Ready for Testing Code is merged and ready for QA to validate label Jun 29, 2023
@juliaElastic
Copy link
Contributor

juliaElastic commented Jun 30, 2023

Started reproducing this locally and seeing a jump about 200MB in heap memory 8.8, and 600MB in RSS after 8.8 upgrade (at 13:14 on the dashboard below).

image

I'm going to test reverting a few features in 8.8 to see if it makes a difference.

EDIT: I did another test by commenting out the calls of MessageSigningService added in 8.8 (that generates key pair, and adds signature on agent policies) and repeated the upgrade to 8.8, and not seeing any memory increase compared to 8.7.

image

@elastic/security-defend-workflows Any ideas what could cause the significant increase in memory using message signing? Do you know if there was any profiling done with this feature?

EDIT: tried to reproduce the issue again by enabling Message signing and doing upgrade again, but not seeing those high memory values now.
I am wondering if the issue could be related to EPR packages, where we have some caching (in kibana memory and also in browser).

@kevinlog
Copy link
Contributor

@juliaElastic - thanks for bubbling this up.

EDIT: tried to reproduce the issue again by enabling Message signing and doing upgrade again, but not seeing those high memory values now.

Is the current thinking that the message signing is not causing the problem?

@joeypoon - any ideas about the message signing service and potential for high memory usage?

@juliaElastic
Copy link
Contributor

@kevinlog Yeah I can't confirm that message signing is causing a problem, it is strange that I can't reproduce the high memory usage again.
I tried to clear cache in chrome, and kibana is being restarted, so the memory cache should be cleared.

@joeypoon
Copy link
Member

@joeypoon - any ideas about the message signing service and potential for high memory usage?

We did bump the agent policy schema version for 8.8 so all policies will get upgraded. The upgrading logic itself isn't new but maybe default batch size is too large for smaller instances with many policies? There was discussion about making the policy deploy logic more efficient by introducing bulking logic as it's making individual calls per policy right now.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 3, 2023

I tested a few more things:

  • In cloud, 8.7 uses about 500MB RSS, and it stays about the same when upgrading without any agent policies to 8.8. Upgrading with 20 agent policies makes RSS jump to about 1.2 GB.

  • I could reproduce locally with the downloaded 8.7.1 to 8.8.2 versions (not the build from source)

    • 8.7.1 used 500MB RSS, and it went up to 1.6GB RSS with upgrade. The SO migration took about 35s at 14:57.
    • Interestingly when running kibana from source (latest 8.8 branch) I didn't see such a big jump (it went from 800MB in 8.7 to about 900-1000MB in 8.8). One difference I noticed is that in the source version I was missing bundled packages.
image
  • I tried another time with kibana from source and copied bundled packages, and seeing now a big jump to more than 2GB RSS, so I think bundled packages is definitely has something with it. Keep looking.
image

@jlind23
Copy link
Contributor

jlind23 commented Jul 3, 2023

@juliaElastic any chance we can change the batch size for the SO migration in order to avoid such a memory increase?

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 4, 2023

We did bump the agent policy schema version for 8.8 so all policies will get upgraded. The upgrading logic itself isn't new but maybe default batch size is too large for smaller instances with many policies? There was discussion about making the policy deploy logic more efficient by introducing bulking logic as it's making individual calls per policy right now.

Thanks for the tip, I think updating the agent policy schema has something to do with this, I tried the same 20 agent policy update with batch size: 1 here and seeing a slower increase in RSS with a lower peak.
I think we are loading the packages from epr multiple times, and not using the cache, because the calls per agent policy follow each other too quickly.

image
[2023-07-04T08:47:26.135+02:00][DEBUG][plugins.fleet] setting file list to the cache for apache-1.13.0
Trace: 
    at setPackageInfo (cache.ts:82:11)
    at getEsPackage (storage.ts:225:17)
    at runMicrotasks (<anonymous>)
    at processTicksAndRejections (node:internal/process/task_queues:96:5)
    at getPackageFromSource (get.ts:359:13)
    at getPackageInfo (get.ts:250:31)
    at full_agent_policy.ts:81:27
    at async Promise.all (index 4)
    at getFullAgentPolicy (full_agent_policy.ts:77:3)
    at async Promise.all (index 0)
    at AgentPolicyService.deployPolicies (agent_policy.ts:810:26)
    at upgradeAgentPolicySchemaVersion (upgrade_agent_policy_schema_version.ts:42:5)
    at createSetupSideEffects (setup.ts:181:3)
    at awaitIfPending (setup_utils.ts:30:20)
    at plugin.ts:496:9

I think we have to improve the getFullAgentPolicy function, which reads each package from epr, and this function is called for each of the 20 agent policies.

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 4, 2023

Tested the downloaded 8.8.2 kibana with the xpack.fleet.setup.agentPolicySchemaUpgradeBatchSize: 1 config, and the RSS reaches about 1GB (previously it was 1.6GB), and seeing in the logs that the logic uses more cache this time. Still a 500MB jump, I'll test if this goes through in cloud with a 1GB kibana instance.

image
2023-07-04T10:16:44.227+02:00][DEBUG][plugins.fleet] Found 21 outdated agent policies
[2023-07-04T10:16:44.254+02:00][WARN ][plugins.fleet] Unable to get fleet server hosts for policy 9defe570-fb9f-11ed-87b3-0fbf68cde96d: Default Fleet Server host is not setup
[2023-07-04T10:16:44.516+02:00][DEBUG][plugins.fleet] setting file list to the cache for linux-0.6.9
[2023-07-04T10:16:44.518+02:00][DEBUG][plugins.fleet] setting package info to the cache for linux-0.6.9
[2023-07-04T10:16:44.519+02:00][DEBUG][plugins.fleet] retrieved installed package linux-0.6.9 from ES
[2023-07-04T10:16:44.530+02:00][DEBUG][plugins.fleet] setting file list to the cache for windows-1.22.0
[2023-07-04T10:16:44.534+02:00][DEBUG][plugins.fleet] setting package info to the cache for windows-1.22.0
[2023-07-04T10:16:44.534+02:00][DEBUG][plugins.fleet] retrieved installed package windows-1.22.0 from ES
[2023-07-04T10:16:44.551+02:00][DEBUG][plugins.fleet] setting file list to the cache for nginx-1.11.1
[2023-07-04T10:16:44.553+02:00][DEBUG][plugins.fleet] setting package info to the cache for nginx-1.11.1
[2023-07-04T10:16:44.553+02:00][DEBUG][plugins.fleet] retrieved installed package nginx-1.11.1 from ES
[2023-07-04T10:16:44.568+02:00][DEBUG][plugins.fleet] setting file list to the cache for system-1.29.0
[2023-07-04T10:16:44.572+02:00][DEBUG][plugins.fleet] setting package info to the cache for system-1.29.0
[2023-07-04T10:16:44.572+02:00][DEBUG][plugins.fleet] retrieved installed package system-1.29.0 from ES
[2023-07-04T10:16:44.660+02:00][DEBUG][plugins.fleet] setting file list to the cache for apache-1.13.0
[2023-07-04T10:16:44.662+02:00][DEBUG][plugins.fleet] setting package info to the cache for apache-1.13.0
[2023-07-04T10:16:44.662+02:00][DEBUG][plugins.fleet] retrieved installed package apache-1.13.0 from ES
[2023-07-04T10:16:45.016+02:00][DEBUG][plugins.fleet] setting file list to the cache for endpoint-8.7.1
[2023-07-04T10:16:45.079+02:00][DEBUG][plugins.fleet] setting package info to the cache for endpoint-8.7.1
[2023-07-04T10:16:45.080+02:00][DEBUG][plugins.fleet] retrieved package endpoint-8.7.1 from registry
[2023-07-04T10:16:45.258+02:00][DEBUG][plugins.fleet] setting file list to the cache for elastic_agent-1.8.0
[2023-07-04T10:16:45.260+02:00][DEBUG][plugins.fleet] setting package info to the cache for elastic_agent-1.8.0
[2023-07-04T10:16:45.261+02:00][DEBUG][plugins.fleet] retrieved installed package elastic_agent-1.8.0 from ES
[2023-07-04T10:16:46.336+02:00][WARN ][plugins.fleet] Unable to get fleet server hosts for policy 9defe570-fb9f-11ed-87b3-0fbf68cde96d: Default Fleet Server host is not setup
[2023-07-04T10:16:46.493+02:00][DEBUG][plugins.fleet] retrieved installed package linux-0.6.9 from cache
[2023-07-04T10:16:46.495+02:00][DEBUG][plugins.fleet] retrieved installed package apache-1.13.0 from cache
[2023-07-04T10:16:46.496+02:00][DEBUG][plugins.fleet] retrieved installed package nginx-1.11.1 from cache
[2023-07-04T10:16:46.497+02:00][DEBUG][plugins.fleet] retrieved package endpoint-8.7.1 from cache
[2023-07-04T10:16:46.498+02:00][DEBUG][plugins.fleet] retrieved installed package windows-1.22.0 from cache
[2023-07-04T10:16:46.500+02:00][DEBUG][plugins.fleet] retrieved installed package system-1.29.0 from cache
[2023-07-04T10:16:46.626+02:00][DEBUG][plugins.fleet] retrieved installed package elastic_agent-1.8.0 from cache
[2023-07-04T10:16:47.107+02:00][DEBUG][plugins.fleet] Upgraded 1 agent policies in 2880ms, 20 remaining
...
[2023-07-04T10:17:41.372+02:00][DEBUG][plugins.fleet] Upgraded 1 agent policies in 2049ms, 0 remaining
[2023-07-04T10:17:41.372+02:00][DEBUG][plugins.fleet] Setting up Fleet enrollment keys
[2023-07-04T10:17:41.413+02:00][INFO ][plugins.fleet] Fleet setup completed

@juliaElastic
Copy link
Contributor

Testing on cloud with the batch size: 1, found that it can be overridden in Advanced Edit

"kibana": {
  "user_settings_override_json": {
                "xpack.fleet.setup.agentPolicySchemaUpgradeBatchSize": 1
            }
  }

Good news, the upgrade succeeded this time, RSS reaching about 600MB.

image

@juliaElastic
Copy link
Contributor

For reference, found an easier way to reproduce the issue on an existing 8.8 cluster with 20 policies with 5 integrations each.

  • Enable stack monitoring to start ingesting kibana metrics
  • Import the dashboard to see RSS OOM 8.7.ndjson.zip
  • Update agent policies to have a different schema_version and trigger Fleet setup.
  • Watch on the dashboard how much RSS is being used.
        curl -sk -XPOST --user fleet_superuser:pw -H 'content-type:application/json' \
        -H'x-elastic-product-origin:fleet' \
        http://localhost:9200/.kibana_ingest/_update_by_query -d '
        { "query": {
         "match": {
           "type": "ingest-agent-policies"
         }
       },"script": {
         "source": "ctx._source[\"ingest-agent-policies\"].schema_version = \"1.0.0\"",
         "lang": "painless"
       }
     }'

@juliaElastic
Copy link
Contributor

juliaElastic commented Jul 4, 2023

Created 2 draft prs to improve the memory usage:

I would go ahead with the second approach, it just needs some tests and fixing the build.

@juliaElastic
Copy link
Contributor

Created a KB article with the known issue in 8.8: https://support.elastic.dev/knowledge/view/3687cd1e

@amolnater-qasource amolnater-qasource added the QA:Ready for Testing Code is merged and ready for QA to validate label Jul 16, 2023
@amolnater-qasource
Copy link
Author

Hi Team,

We have revalidated this issue on latest 8.9.0 BC4 kibana cloud environment and found it fixed now.

Observations:

  • Kibana upgrades successfully from 8.8.2>8.9.0 BC4 when multiple agent policies with integrations exist.

Screen Recording:

Agent.policies.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-19.13-46-39.mp4
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-19.14-00-48.mp4

Build details:
VERSION: 8.9.0 BC4
BUILD: 64661
COMMIT: ddf0c19

Hence we are marking this issue as QA:Validated.

Thanks!

@amolnater-qasource amolnater-qasource added QA:Validated Issue has been validated by QA and removed QA:Ready for Testing Code is merged and ready for QA to validate labels Jul 19, 2023
@harshitgupta-qasource
Copy link

harshitgupta-qasource commented Jan 24, 2024

Bug Conversion

  • Test-Case not required as this particular checkpoint is already covered in exploratory testing.

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience impact:high Addressing this issue will have a high level of impact on the quality/strength of our product. QA:Validated Issue has been validated by QA Team:Fleet Team label for Observability Data Collection Fleet team
Projects
None yet