-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet]: Kibana upgrade failed from 8.7.1>8.8.0 BC8 when multiple agent policies with integrations exist. #158361
Comments
Pinging @elastic/fleet (Team:Fleet) |
@manishgupta-qasource Please review. |
Secondary review for this ticket is Done |
@amolnater-qasource @manishgupta-qasource Am I correct to assume that this is a Kibana rather than a Fleet issue? |
I think we have to check the logs for the root cause. |
Thank you for looking into this issue. Further, we have observed failure specifically in Other 03 modules Elasticsearch, Kibana and Enterprise server upgraded successfully. Further, sharing the cloud admin link over the slack. @juliaElastic Please let us know if anything else is required from our end. |
Checked the logs and seeing this error in Integration Server logs around 10am when the restart happened: In kibana logs it looks like kibana is continuously killed and restarts, there are a lot of logs around SO migrations.
|
@amolnater-qasource Is this issue only happening in cloud, does the upgrade succeed on-prem with this many policies and integrations? |
We have revalidated this issue on on-prem setup and found it reproducible there too. Steps followed:
Elasticsearch Logs: Kibana Logs: Screen Recording: Amol.Self-Win.1.-.ec2-3-91-98-132.compute-1.amazonaws.com.-.Remote.Desktop.Connection.2023-05-25.15-25-24.mp4Please let us know if anything else is required from our end. |
The on-prem error seems to be different, it times out on the
|
On the cloud issue, it seems that kibana was going oom on upgrade, after giving it more memory (4gb), the setup completed successfully. |
Hi Team, We have also validated this issue on upgrading from 8.6.2>8.7.1 production kibana cloud environment and found it not reproducible there. Setup Details:
Observations:
Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-05-25.18-32-00.mp4Please let us know if anything else is required from our end. |
@amolnater-qasource I tried to reproduce that issue against 8.8.1 and it does not seems reproducible (and I saw that some change to SO migrations related to memory usage have been merged #157494) |
@nchaulet Have you tried on self-managed or cloud? |
Hi @nchaulet We have revalidated this issue on upgrade from 8.7.1 to 8.8.1 and found the issue still reproducible with below steps:
Setup Details:
Screen Recording: Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-06-14.12-01-52.mp4Please let us know if anything else is required from our end. |
Investigating on the latest cluster shared by Amol: https://admin.found.no/deployments/c3cf9384322f4686946b78f0e2c6147c In the Kibana logs, seeing a constant restart, every 1-2m or so. Logs are mainly about kibana SO migrations, there is not a clear error. Checked the Kibana metrics dashboard of Application containers, and seeing a 100% CPU that correlates with the restarts: The question is whether this is caused by kibana migration or something in Fleet. I think the cause is not the SO migration in cloud, even though there are logs about it on every restart, it takes about 0ms for each, which means the migration already happened and the logic only checks that there is nothing left to migrate. The issue experienced here is different than the OOMs seen in SDHs, there ES was reaching OOM and logging circuit breaker errors. Here I don't see the ES containers reaching CPU or memory limits, so the issue is on the kibana side. |
I think I found something. There was a change in agent schema version in 8.8 here, so that's why the update is triggered. I've removed the agent policies from the There is already a batch size of 100 for these updates, and configurable: #150688 |
@juliaElastic We know that we have users deployment with more than hundreds of policies, we should probably lower the default batch size value. Wdyt? |
Let me test with a lower setting to see if it helps. Then we can lower the default. |
To investigate further I would suggest first reproduce locally by starting Kibana with Then create some heap snapshots:
(Depending on how you install/run Kibana /run/kibana/kibana.pid might not be available, but you can substitute |
I found a memory leak on the Fleet API that seems to be related to the usage we do of AsyncLocalStorage I created that issue #159762 and will add some heapdump and more details on it. But not sure if it's related to that issue |
Started kibana locally (latest main) with these settings:
I did 2 heapdumps, first after kibana started, then after creating a few integrations. Checked in Chrome dev tools, and seeing some big deltas but so far I don't recognize anything related to fleet. Also tried commenting out the usage of |
@juliaElastic did you reduce the SO batch size or it is not related to the OOM you faced? |
Seeing the same logs in kibana that it is constantly restarted: https://34390efcbaff40e8a6bc54f9f17e93f9.eastus2.staging.azure.foundit.no:9243/app/r/s/n9Dbd So I think the AsyncLocalStorage fix helps with the Kibana Integrations degradation over time as it fixes a memory leak. The stack upgrade issue looks different, something uses more memory during Kibana upgrade, making the 1gb not sufficient. |
@jlind23 Yes I'll work with @juliaElastic and SRE to get a heap snapshot taken. |
Kibana does not run out of heap but gets killed by the OOM-killer after RSS reaches > 1GB. This likely means it's not running out of heap (controlled by javascript) but consuming too much external memory. Likely causes could be unzipping large files or lots of concurrent connections consuming a lot of external memory for the sockets, or page fragmentation. |
Nodejs/v8 does not have any good tools to further analyse why the RSS might be so high. Suggested next steps:
|
What is the latest status for this high impact issue? |
So far @rudolf and @juliaElastic worked on this and here are the findings:
@juliaElastic will reproduce locally as soon as she can and will observe the resident memory allocation and see what can be causing this. |
Started reproducing this locally and seeing a jump about 200MB in heap memory 8.8, and 600MB in RSS after 8.8 upgrade (at 13:14 on the dashboard below). I'm going to test reverting a few features in 8.8 to see if it makes a difference. EDIT: I did another test by commenting out the calls of MessageSigningService added in 8.8 (that generates key pair, and adds signature on agent policies) and repeated the upgrade to 8.8, and not seeing any memory increase compared to 8.7. @elastic/security-defend-workflows Any ideas what could cause the significant increase in memory using message signing? Do you know if there was any profiling done with this feature? EDIT: tried to reproduce the issue again by enabling Message signing and doing upgrade again, but not seeing those high memory values now. |
@juliaElastic - thanks for bubbling this up.
Is the current thinking that the message signing is not causing the problem? @joeypoon - any ideas about the message signing service and potential for high memory usage? |
@kevinlog Yeah I can't confirm that message signing is causing a problem, it is strange that I can't reproduce the high memory usage again. |
We did bump the agent policy schema version for 8.8 so all policies will get upgraded. The upgrading logic itself isn't new but maybe default batch size is too large for smaller instances with many policies? There was discussion about making the policy deploy logic more efficient by introducing bulking logic as it's making individual calls per policy right now. |
@juliaElastic any chance we can change the batch size for the SO migration in order to avoid such a memory increase? |
Thanks for the tip, I think updating the agent policy schema has something to do with this, I tried the same 20 agent policy update with
I think we have to improve the getFullAgentPolicy function, which reads each package from epr, and this function is called for each of the 20 agent policies. |
Testing on cloud with the batch size: 1, found that it can be overridden in Advanced Edit
Good news, the upgrade succeeded this time, RSS reaching about 600MB. |
For reference, found an easier way to reproduce the issue on an existing 8.8 cluster with 20 policies with 5 integrations each.
|
Created 2 draft prs to improve the memory usage:
I would go ahead with the second approach, it just needs some tests and fixing the build. |
Created a KB article with the known issue in 8.8: https://support.elastic.dev/knowledge/view/3687cd1e |
Hi Team, We have revalidated this issue on latest 8.9.0 BC4 kibana cloud environment and found it fixed now. Observations:
Screen Recording: Agent.policies.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-19.13-46-39.mp4Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-07-19.14-00-48.mp4Build details: Hence we are marking this issue as QA:Validated. Thanks! |
|
Kibana Build details:
Host OS and Browser version: All, All
Preconditions:
Steps to reproduce:
What's working fine:
Note:
Expected:
Kibana upgrade should be successful from 8.7.1>8.8.0 BC8 when multiple agent policies with integrations exist.
Screen Recording:
Agents.-.Fleet.-.Elastic.-.Google.Chrome.2023-05-24.15-26-58.mp4
The text was updated successfully, but these errors were encountered: