Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KQL expression parsing is slow #76811

Closed
kobelb opened this issue Sep 4, 2020 · 5 comments · Fixed by #93319
Closed

KQL expression parsing is slow #76811

kobelb opened this issue Sep 4, 2020 · 5 comments · Fixed by #93319
Assignees
Labels
discuss Feature:KQL KQL performance PR sent SharedUX/fix-it-week Bugs that have been groomed and queued up for the team's next fix it week

Comments

@kobelb
Copy link
Contributor

kobelb commented Sep 4, 2020

Parsing even a simple KQL expression string is slow and a complex KQL expression is disastrously slow:

const suite = new Benchmark.Suite();

suite
  .add('parse simple KQL', function () {
    esKuery.fromKueryExpression(
      'not fleet-agent-actions.attributes.sent_at: * and fleet-agent-actions.attributes.agent_id:1234567'
    );
  })
  .add('parse complex KQL', function () {
    esKuery.fromKueryExpression(
      `((alert.attributes.alertTypeId:.index-threshold and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:siem.signals and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:siem.notifications and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:metrics.alert.threshold and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:metrics.alert.inventory.threshold and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:logs.alert.document.count and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_cluster_health and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_license_expiration and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_cpu_usage and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_nodes_changed and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_logstash_version_mismatch and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_kibana_version_mismatch and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:monitoring_alert_elasticsearch_version_mismatch and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:apm.transaction_duration and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:apm.transaction_duration_anomaly and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:apm.error_rate and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:xpack.uptime.alerts.monitorStatus and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:xpack.uptime.alerts.tls and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)) or (alert.attributes.alertTypeId:xpack.uptime.alerts.durationAnomaly and alert.attributes.consumer:(alerts or builtInAlerts or siem or infrastructure or logs or monitoring or apm or uptime)))`
    );
  })
  .on('cycle', function (event) {
    console.log(String(event.target));
  })
  .run({ async: true, minSamples: 10 });
parse simple KQL x 999 ops/sec ±2.00% (74 runs sampled)
parse complex KQL x 4.08 ops/sec ±2.22% (15 runs sampled)

@elastic/kibana-app-arch it appears that a majority of the slowness is coming directly from the code which pegjs is generating. Have you all investigated the performance of KQL previously? It also looks like we're on version 0.9.0 of pegjs, is there any chance that an upgrade to 0.10.0 would improve the performance? There's also a --cache option which has fixed performance issues for at least one person; however, this contradicts the documentation:

--cache — makes the parser cache results, avoiding exponential parsing time in pathological cases but making the parser slower

If we're unable to drastically improve the performance of KQL, it'll be unsuitable to use in a lot of scenarios. When parsing KQL expressions server-side, this can lead to the event-loop being blocked and preventing all other operations from running. The Fleet team has had to switch from using a KQL string to building the KueryNode manually and we can potentially adapt this approach elsewhere. However, we can't use this everywhere if we want end-users to be able to specify the KQL strings.

CC'ing a few people...

@elastic/kibana-platform because SavedObjectsClient#find supports a KQL expression string
@elastic/kibana-alerting-services because your authorization is building a complex KQL expression string here
@elastic/siem because you all are using KQL to query Elasticsearch data-indices

Related: #69649, #89473

@mshustov
Copy link
Contributor

mshustov commented Sep 6, 2020

We could evaluate other alternatives as a last resort. There is an interesting benchmark available https://sap.github.io/chevrotain/performance/

@Dosant
Copy link
Contributor

Dosant commented Sep 8, 2020

Don't have much experience in this area but we could probably try to unblock event loop or actually speed up parsing:

Client:

  1. using JS implementation in web-worker ?
  2. or take c++ / rust PEG implementation and make a wasm module out of it ?

Node:

  1. unblock the loop with worker thread? (experimental in node 10, but stable in 12 :( )
  2. native c++/rust node js module with PEG implementation? maybe would be faster?

@kobelb
Copy link
Contributor Author

kobelb commented Sep 8, 2020

Before implementing either web-workers or worker-threads, I think we should see if we can optimize the implementation itself. Even if it's not blocking the main event-loop, it's still consuming CPU. It's common for Kibana to be deployed to servers with only one or two CPUs that are shared for all operations, so we're still consuming resources that are used for other operations.

@kobelb
Copy link
Contributor Author

kobelb commented Sep 23, 2020

@elastic/kibana-app-arch are there any short-term improvements planned for KQL parsing?

If there aren't any improvements we can make, it's unsuitable for use in a lot of scenarios. At a minimum, we'll likely want to remove the ability for consumers of the SavedObjectsClient::find to specify a KQL string and instead specify a KueryNode to reduce the likelihood of someone else tripping on this hazard.

@lukeelmers
Copy link
Member

We intend to dive further into this, but there are not currently plans to take this on in the short-term (7.11).

If this is something that is blocking you on a project that needs to happen in a 7.11 timeframe, please let us know -- otherwise we will follow up as soon as we have more concrete info on what to expect.

@Dosant Dosant added the SharedUX/fix-it-week Bugs that have been groomed and queued up for the team's next fix it week label Mar 1, 2021
lukasolson added a commit that referenced this issue Apr 27, 2024
## Summary

Resolves #143335.

Some history: A similar issue was reported a few years back
(#76811). The solution
(#93319) was to use the `--cache`
PEG.js [parameter](https://pegjs.org/documentation#generating-a-parser)
when generating the parser. Back when this was added, we were still
manually building the parser on demand when it was changed. Eventually
we added support for dynamically building the parser during the build
process (#145615). I'm not sure
where along the process the `cache` parameter got lost but it didn't
appear to be used when we switched.

This PR re-adds this parameter which increases performance considerably
(metrics shown in ops/sec):

```
Before using cache:

  ● kuery AST API › fromKueryExpression › performance › with simple expression
    Received:   7110.68990544415

  ● kuery AST API › fromKueryExpression › performance › with complex expression
    Received:   40.51361746242248

  ● kuery AST API › fromKueryExpression › performance › with many subqueries
    Received:   17.071767133068473

After using cache:

  ● kuery AST API › fromKueryExpression › performance › with simple expression
    Received:   8275.49109867502

  ● kuery AST API › fromKueryExpression › performance › with complex expression
    Received:   447.0459218892934

  ● kuery AST API › fromKueryExpression › performance › with many subqueries
    Received:   115852.43643466769
```

### Checklist

- [x] [Unit or functional
tests](https://www.elastic.co/guide/en/kibana/master/development-tests.html)
were updated or added to match the most common scenarios
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss Feature:KQL KQL performance PR sent SharedUX/fix-it-week Bugs that have been groomed and queued up for the team's next fix it week
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants