-
Notifications
You must be signed in to change notification settings - Fork 8.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Security Solution] Adds event log telemetry specific for security solution rules #128216
Conversation
Pinging @elastic/security-solution (Team: SecuritySolution) |
x-pack/plugins/security_solution/server/usage/detections/rules/types.ts
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{ | ||
range: { | ||
'@timestamp': { | ||
gte: 'now-24h', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No action is needed here, but I'm starting to wonder if we need to start finding the previous usage collector run timestamp and use that. I have noticed unusual collection cadences within some of the security clusters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pjhampton after #121656 clusters should report only once every 24h. Just FYI :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had the same question above 😅 https://github.com/elastic/kibana/pull/128216/files#r837923596
If consistency can't be guaranteed (runs same time each day, introducing drift), would be nice to have a service that pushes the right day window
datetimes down for these queries to consume.
x-pack/plugins/security_solution/server/usage/queries/utils/get_search_for_custom_rules.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🌔 🚀 ✨ LGTM ✨ 🚀 🌔
Code looks solid as always, but this amount of new data in the usage collector makes me a little nervous for a new release. I'm starting to wonder if we can add this as a task instead. Regardless, as long as Kbn core is good with it so am I!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks good to me. I don't think this increase will make us hit the current payload limit.
However, I'd like to challenge the decision in the schema. Is there a particular reason why we want to report it as 1
, 2
, 3
, ... vs. an array?
IMO, using an array may be more useful because an error message may be top 1 for some clusters and top 10 for some others. However, in the big picture, we would be more interested in tackling the ones that occur in most clusters, wouldn't we?
It will also simplify the schema by removing a lot of fields 😇
What do you think?
top_failed: { | ||
'1': { | ||
message: { | ||
type: 'keyword', | ||
_meta: { description: 'Top 1 failed rule message' }, | ||
}, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Would it make sense to declare these as arrays? This way we can query detection_rule_status.all_rules.top_failed: "error message"
and it will show up regardless if it's the top 1 or top N.
top_failed: { | |
'1': { | |
message: { | |
type: 'keyword', | |
_meta: { description: 'Top 1 failed rule message' }, | |
}, | |
}, | |
top_failed: { | |
type: 'array', | |
items: { | |
message: { | |
type: 'keyword', | |
_meta: { description: 'One of top 10 failed rule messages' }, | |
}, | |
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will update the schema as I have a mistake but what I left out was the count
... It really should be this:
"1": {
"properties": {
"message": {
"type": "keyword",
"_meta": {
"description": "Top 1 partial failed rule message"
}
},
"count": {
"type": "long",
"_meta": {
"description": "Top 1 partial failed rule message"
}
}
}
}
I haven't done a visualization with arrays that contain objects before. Is that well supported if the array contains objects? I know that nested fields are still lacking support and is being tracked here:
#1084
Are nested types supported in this schema? I didn't see an example of it.
If I changed this to an array with an object inside the array I don't know how easy that is going to be to visualize compared to just numbering them and stopping at 10.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are nested types supported in this schema? I didn't see an example of it.
The schema is not a direct ES mapping (even thought it looks pretty similar 😅). You can specify the array of objects like this:
top_failed: { | |
'1': { | |
message: { | |
type: 'keyword', | |
_meta: { description: 'Top 1 failed rule message' }, | |
}, | |
}, | |
top_failed: { | |
type: 'array', | |
items: { | |
message: { | |
type: 'keyword', | |
_meta: { description: 'One of top 10 failed rule messages' }, | |
}, | |
count: { | |
type: 'long', | |
_meta: { description: 'Number of times the message occurred' } | |
} | |
}, |
Then on the receiving end, we can work out different ways of storing it: for instance, I'd say that for easier consumption, we could loop through the array and index one document per object in the array with some common fields for correlation.
This way you get a "simple" event-like index to query and aggregate without depending on tree structures or nested types.
As a separate comment, I was wondering if this type of data makes sense to send it as events with our Event-Based Telemetry client coming up on 8.3 (meta issue here).
@FrankHassanabad @pjhampton @peluja1012 wdyt? Is it worth the wait for one extra minor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I will change this to being an array with objects and update the code and tests with the array. Sounds good to me 👍 .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@afharo Thank you for reviewing this PR and for your suggestion of taking advantage of the new Telemetry client. While I think it might make sense to use this new client in the future, we have been waiting for the Telemetry data this PR provides for a couple of releases. It's important that we start collecting it as soon as possible so we can get ahead of customer issues related to rule health and performance. I suggest we go with the approach presented in this PR in 8.2 and then migrate over to the Event-Based system in a future release after it's ready for consumption.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@peluja1012 sounds like a good compromise to me. All clear! Thank you!
{ | ||
range: { | ||
'@timestamp': { | ||
gte: 'now-24h', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pjhampton after #121656 clusters should report only once every 24h. Just FYI :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thank you @FrankHassanabad for the latest changes to simplify the schema
detection_rule_status: { | ||
all_rules: { | ||
eql: { | ||
failed: { | ||
type: 'long', | ||
_meta: { description: 'The number of failed rules' }, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Real bummer this can't be broken up into smaller files and destructured all nicely so it's more easily consumed. Sounds like there's some telemetry tooling that is preventing this? We should open an issue to fix that tooling so these schemas can be better digested.
Do you have more details on this @afharo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also curious if the existing tooling does anything to 'lock' the file. @FrankHassanabad mentioned in another comment below about how it's very important to not change telemetry keys, and so am wondering if there is any tooling in place for this? Once broken out into separate files, we could always do a server-side hook to lock them in read-only mode perhaps? Same goes with any downstream generated files from these schemas.
type: 'long', | ||
_meta: { description: 'The number of partial failure rules' }, | ||
}, | ||
top_partial_failure: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe make plural? Is this the first top partial failure, or a list (seems it's an array) of top partials. If the latter would be more description to be plural for folks building dashboards.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Will make this plural. Also will change top_failed
So you will have the changes:
top_partial_failure -> top_partial_failures
top_failed -> top_failures
failed -> failures
Wasn't able to verify this data on any of these three servers:
Using this query:
Would be nice if there was a 'flush' feature so I could run this locally and not have to wait 24hrs. Does this exist? Also not really clear where this data would go if I ran this locally anyway... 🤔 Trying to use the Rosetta Stone doc to understand this all, but very hard to grok with how complex telemetry is. |
index_duration: { | ||
max: { | ||
type: 'float', | ||
_meta: { description: 'The max duration' }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ux nit: Should these _meta:description
's include the metric name in it too? Not sure how it shows up in the UI/dashboards, but this wouldn't be helpful if you're hovering on a pie chart or something and it says The max duration
instead of something like The max indexing duration
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't show up. I think these are not used anywhere other than here in the code. This doesn't convert to the EDN files or to mappings as far as I know there as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, well if it doesn't make it to the cluster schema, I question its value here, no? 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would talk to someone from Kibana core about the usefulness and usage of it. I'm not for sure where it's used or if it is used within tooling somewhere else.
"detection_rule_status": { | ||
"properties": { | ||
"all_rules": { | ||
"properties": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my gosh, another one of these!? 😅 I hope this one is generated at least...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is generated with:
node scripts/telemetry_check.js --fix
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, I see there's a README.md in the root folder (telemetry_collection_xpack/schema/README.md
). That appears to be out of date though (says 2 files when there's 3?), and doesn't include that script you just mentioned to generate it...it just says
The X-Pack related schema for the content that will be nested in
stack_stats.kibana.plugins
.
It is automatically generated by@kbn/telemetry-tools
based on theschema
property provided by all the registered Usage Collectors via theusageCollection.makeUsageCollector
API.
Which I guess does link off to another README from there (src/plugins/usage_collection/README.mdx
) and it appears to be mentioned in that one, so all good. Just coming in at a different angle so a little confusing here, all good now! 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is Kibana core's files.
maxTotalSearchDuration: { | ||
value: null, | ||
}, | ||
gapCount: { | ||
value: 0, | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: diff here is that cardinality agg is going to return 0 (if missing), but the max/min/avg aggs will return null
as providing missing
here will mess with the actual value.
💚 Build SucceededMetrics [docs]Unknown metric groupsESLint disabled in files
ESLint disabled line counts
References to deprecated APIs
Total ESLint disabled count
History
To update your PR or re-run it, just comment with: |
}, | ||
}, | ||
}, | ||
elastic_rules: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note, there is no 'space-awareness' in this telemetry, so if a user installs the prebuilt rules in 10 spaces, this telemetry will say there's some 6k elastic rules installed (6xx prebuilt rules shipped x 10).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another note to future self: rules are queried directly to determine elastic vs custom rules, and so are capped to 10k across all spaces:
kibana/x-pack/plugins/security_solution/server/usage/queries/get_detection_rules.ts
Lines 49 to 54 in 7c850dd
const query: SavedObjectsCreatePointInTimeFinderOptions = { | |
type: 'alert', | |
perPage: maxPerPage, | |
namespaces: ['*'], | |
filter, | |
}; |
So if a user has >10k rules between spaces things could start getting wonky -- we should test this and verify functionality and consider an implementation that takes space-awareness into account. Also be good to test these queries where the terms
is gonna be a full 10k as well and see what that does to perf... 😮
@@ -108,6 +122,7 @@ export const getRuleMetrics = async ({ | |||
return { | |||
detection_rule_detail: elasticRuleObjects, | |||
detection_rule_usage: rulesUsage, | |||
detection_rule_status: eventLogMetricsTypeStatus, | |||
}; | |||
} catch (e) { | |||
// ignore failure, usage will be zeroed. We use debug mode to not unnecessarily worry users as this will not effect them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you know that a failure occurred in the actual telemetry? On failures we're returning empty values?
* transformed. If it malfunctions or times out then it will not malfunction other | ||
* parts of telemetry. | ||
* NOTE: This takes in "ruleResults" to filter against elastic rules and custom rules. | ||
* If the event log recorded information about which rules were elastic vs. custom this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, would be nice if we logged more metadata about the rule itself so there's not so much overhead in processing like this.
x-pack/plugins/security_solution/server/usage/queries/utils/get_search_for_all_rules.test.ts
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit to grok at first (especially being new to telemetry), but zoomed with @FrankHassanabad and we pair reviewed the whooole thing. Thanks for implementing some of the nits from my feedback! And note, I was not able to verify any of this telemetry made it to any of the staging servers, but I'm sure we'll be performing some additional validation testing once we get the first builds with this code in, and can always tweak before the final BC. @FrankHassanabad did verify independently during development that they made it, so we should be all good here.
All that said, thanks for taking the time to review this all with me @FrankHassanabad, and LGTM from the Security Solution side! Appreciate all the tests, and how modular and re-usable you structured this code (++ all the tests too!). 👍 🙌 🚀
Summary
Adds event log telemetry specific for security solution rules.
This adds a new section to the telemetry underneath
detectionMetrics
calleddetection_rule_status
.<-- click this text to see a full JSON sample document
manual testing
Add some alerts and malfunction them into partial errors by having them not have their indexes and malfunction them by shutting down Kibana for a while and then starting it back up to have Kibana miss some alerts. Then go to Advanced Settings -> scroll to the bottom and click cluster data. Should see data like this after scrolling down for a while:
Checklist
Delete any items that are not applicable to this PR.