Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Security Solution][Detections] Adoption telemetry #71102

Merged
merged 15 commits into from
Jul 13, 2020

Conversation

rylnd
Copy link
Contributor

@rylnd rylnd commented Jul 8, 2020

Summary

This adds a new UsageCollector that collects adoption data for both Detection Rules and ML Jobs within our solution's group.

The metrics being collected are:

  • number of custom detection rules enabled
  • number of custom detection rules disabled (to calculate the total rules they have created)
  • number of pre-built detection rules enabled
  • number of pre-built detection rules disabled (number of rules change in every release version)
  • number of custom ML jobs enabled
  • number of custom ML jobs disabled
  • number of pre-built ML jobs enabled
  • number of pre-built ML jobs disabled

TODO:

  • verify schema
  • backfill unit tests

Checklist

Delete any items that are not applicable to this PR.

For maintainers

rylnd added 6 commits July 7, 2020 21:40
This uses ML and raw ES calls to query our ML Jobs and Rules, and parse
them into a format to be consumed by telemetry.

Still to come:
* initialization
* tests
The service seems to convert colons to underscores, so let's just use an
underscure.
This allows us to test our adherence to the collector API, focusing
particularly on the fetch function.
@rylnd rylnd added Team:SIEM v8.0.0 release_note:skip Skip the PR/issue when compiling release notes v7.9.0 labels Jul 8, 2020
@rylnd rylnd self-assigned this Jul 8, 2020
We're going to have our usage data under one key corresponding to the
app, so this nests the existing data under a 'detections' key while
allowing another fetching function to be plugged into the
main collector under a separate key.
@rylnd rylnd force-pushed the security_adoption_telemetry branch from e73b13c to bbae701 Compare July 9, 2020 00:05
@rylnd
Copy link
Contributor Author

rylnd commented Jul 9, 2020

@elasticmachine merge upstream

filterPath: ['hits.hits._source.alert.enabled', 'hits.hits._source.alert.tags'],
ignoreUnavailable: true,
index,
size: 10000, // elasticsearch index.max_result_window default value
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ever likely to exceed this limit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it doesn't seem likely that a user'd have 10k rules, but this ensures that if they did, we'd tally the first 10k instead of receiving an error and returning 0s.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was used in another collector, but if you think there's risk here I'm happy to remove it!

Copy link
Contributor

@michaelolo24 michaelolo24 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine. And I hope they don't have more than 10k rules 😅. I was really just curious if there was a way beyond it beyond scroll. I ran into an issue previously where I needed to set this max just to get all of the data I wanted (I think ES has a default max it returns?), but it looks fine.

rylnd added 3 commits July 9, 2020 10:44
* inlines collector options
* inlines schema object
* makes DetectionsUsage an interface instead of a type alias
 Conflicts:
	x-pack/plugins/security_solution/server/plugin.ts
@rylnd rylnd marked this pull request as ready for review July 9, 2020 18:38
@rylnd rylnd requested review from a team as code owners July 9, 2020 18:38
@elasticmachine
Copy link
Contributor

Pinging @elastic/siem (Team:SIEM)

config: {
job_type: 'anomaly_detector',
description:
'SIEM Auditbeat: Looks for unusual destination port activity that could indicate command-and-control, persistence mechanism, or data exfiltration activity (beta)',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checking we haven't migrated these detections to the SecuritySolution naming pattern?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took these from live data and the rename hasn't happened yet. I'll be sure to update these appropriately once that happens 👍

);

if (ruleResults.hits?.hits?.length > 0) {
ruleMetrics = ruleResults.hits.hits.map((hit) => ({
Copy link
Contributor

@michaelolo24 michaelolo24 Jul 9, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really an issue, but why not do the tally here rather than iterating twice? Or alternatively, just return ruleResults and then just iterate once in the usageBuilder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, I'll change these to do a single loop 👍

const moduleJobs = modules.flatMap((module) => module.jobs);
const jobs = await jobServiceProvider(mlCaller).jobsSummary(['siem']);

jobMetrics = jobs.map((job) => ({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same thing here, but not sure how much of an issue it'll be to loop over all this data twice. I know we have about 60s till the usageCollector times out, but I don't think we'll run into that

We were previously performing two loops over each set of data: one to
format it down to just the data we need, and another to convert that
into usage data. We now perform both steps within a single loop.
const jobs = await jobServiceProvider(mlCaller).jobsSummary(['siem']);

jobsUsage = jobs.reduce((usage, job) => {
const isElastic = moduleJobs.some((moduleJob) => moduleJob.id === job.id);
Copy link
Member

@jgowdyelastic jgowdyelastic Jul 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does SIEM apply a prefix to the module when initially setting it up?
If so, this won't match and you'll have to do a partial match on the last part of job id.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that this is not currently an issue, but good to know moving forward.

Copy link
Member

@afharo afharo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kibana Telemetry changes LGTM!

Comment on lines 24 to 33
detections: {
detection_rules_custom_enabled: { type: 'number' },
detection_rules_custom_disabled: { type: 'number' },
detection_rules_elastic_enabled: { type: 'number' },
detection_rules_elastic_disabled: { type: 'number' },
ml_jobs_custom_enabled: { type: 'number' },
ml_jobs_custom_disabled: { type: 'number' },
ml_jobs_elastic_enabled: { type: 'number' },
ml_jobs_elastic_disabled: { type: 'number' },
},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT: Can we use type: 'long' instead (or 'float' if that's the intended format)?
This will help us understand the type of number to be indexed 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NIT 2: Have you considered using a tree structure instead?

    {
      "properties": {
        "detections": {
          "properties": {
            "detection_rules": {
              "custom": { 
                "enabled": { "type": "long" },
                "disabled": { "type": "long" }
              },
              "elastic": { 
                "enabled": { "type": "long" },
                "disabled": { "type": "long" }
              }
            },
            "ml_jobs": {
              "custom": { 
                "enabled": { "type": "long" },
                "disabled": { "type": "long" }
              },
              "elastic": { 
                "enabled": { "type": "long" },
                "disabled": { "type": "long" }
              }
            }
          }
        }
      }
    }

@afharo afharo mentioned this pull request Jul 13, 2020
2 tasks
@rylnd
Copy link
Contributor Author

rylnd commented Jul 13, 2020

@elasticmachine merge upstream

@kibanamachine
Copy link
Contributor

💚 Build Succeeded

Build metrics

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@rylnd rylnd merged commit 1afb0c4 into elastic:master Jul 13, 2020
@rylnd rylnd deleted the security_adoption_telemetry branch July 13, 2020 18:18
rylnd added a commit that referenced this pull request Jul 13, 2020
* style: sort plugin interface

* WIP: UsageCollector for Security Adoption

This uses ML and raw ES calls to query our ML Jobs and Rules, and parse
them into a format to be consumed by telemetry.

Still to come:
* initialization
* tests

* Initialize usage collectors during plugin setup

* Rename usage key

The service seems to convert colons to underscores, so let's just use an
underscure.

* Collector is ready if we have a kibana index

* Refactor collector to generate options in a function

This allows us to test our adherence to the collector API, focusing
particularly on the fetch function.

* Refactor usage collector in anticipation of endpoint data

We're going to have our usage data under one key corresponding to the
app, so this nests the existing data under a 'detections' key while
allowing another fetching function to be plugged into the
main collector under a separate key.

* Update our collector to satisfy telemetry tooling

* inlines collector options
* inlines schema object
* makes DetectionsUsage an interface instead of a type alias

* Extracts telemetry mappings via scripts/telemetry_extract

* Refactor detections usage logic to perform one loop instead of two

We were previously performing two loops over each set of data: one to
format it down to just the data we need, and another to convert that
into usage data. We now perform both steps within a single loop.

* Refactor detections telemetry to be nested

* Extract new nested detections telemetry mappings

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>

Co-authored-by: Elastic Machine <elasticmachine@users.noreply.github.com>
@MindyRS MindyRS added the Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. label Sep 23, 2021
@elasticmachine
Copy link
Contributor

Pinging @elastic/security-solution (Team: SecuritySolution)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release_note:skip Skip the PR/issue when compiling release notes Team: SecuritySolution Security Solutions Team working on SIEM, Endpoint, Timeline, Resolver, etc. Team:SIEM v7.9.0 v8.0.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants