[RFC] Data Source Categorization Fields #958

jamiehynds · 2020-08-27T14:45:22Z

Have you signed the contributor license agreement? ✅
Have you followed the [contributor guidelines] (https://github.com/elastic/ecs/blob/master/CONTRIBUTING.md)? ✅
For proposing substantial changes or additions to the schema, have you reviewed the RFC process? ✅
If submitting code/script changes, have you verified all tests pass locally using make test? N/A
If submitting schema/fields updates, have you generated new artifacts by running make and committed those changes? N/A
Is your pull request against master? Unless there is a good reason otherwise, we prefer pull requests against master and will backport as needed. ✅
Have you added an entry to the CHANGELOG.next.md? N/A

ebeahan

LGTM 👍

ebeahan · 2020-08-27T18:10:33Z

rfcs/text/0000-data-source-categorization-fields.md

+<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->
+
+- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->
+- Date: **August 26 2020** <!-- The ECS team sets this date at merge time. This is the date of the latest stage advancement. -->


Will update right before we merge to reflect current date.

ebeahan · 2020-08-27T18:12:07Z

rfcs/text/0000-data-source-categorization-fields.md

+- Web server
+
+## Usage
+Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source.  Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).


Suggested change

Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: #845 (comment).

Categorization fields in ECS can govern how we categorize these data source, but only a limited set of event.category values are supported by the schema today. The event categorisation fields are catered to individual events, but don't categorise the data source. Expanding the values we support, allows us to align the user experience from ECS, Ingest Manager and the Elastic Website (elastic.co/integrations). Some additional context here: [#845 (comment)](https://github.com/elastic/ecs/pull/845#issuecomment-651414817).

Looks like the Markdown link got lost in the copy/paste.

andrewstucki · 2020-08-31T15:58:03Z

rfcs/text/0000-data-source-categorization-fields.md

+- productivity
+- proxy
+- queue/message queue
+- security


How wide/all-encompassing are these feels intended to be? It looks like a mixture of pretty narrow as well as pretty wide categories. For example, would all firewall, audit, edr, ids/ips, threat intelligence, and vulnerability scanner categories also be marked security?

Similar thoughts with things like proxy, application, and cloud.

Good point. We included some generic categories to allow for searching/correlation across these categories, e.g. show me events across all my security data sources, cloud sources, etc. It cloud also open up the possibility for subcategories e.g. AWS being cloud, but within AWS, CloudTrail could fall under security.

leehinman · 2020-09-01T14:35:45Z

rfcs/text/0000-data-source-categorization-fields.md

+
+The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:
+
+- apm


Small thing. I suggest we standardize on the capitalization and naming. For example we have an event.category of "iam" but a proposed data_stream.category of "Identity and access management". Also we have an example of "ids" for observer.type and a proposed data_stream.category of "IDS".

leehinman · 2020-09-01T14:36:40Z

rfcs/text/0000-data-source-categorization-fields.md

+Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
+-->
+
+Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.


I'm wondering if the allowed values data_stream.category and observer.type should be the same?

Good idea to bring this up.

I'm not sure I would go this direction. I think we should establish a list of allowed values, and make sure sources and pipelines populate based on this predictable list. Otherwise we could get all sorts of arbitrary differences in capitalizations and ways of writing things.

jamiehynds · 2020-09-08T14:42:03Z

@mostlyjason would you mind reviewing the list of proposed categories and suggest any additional categories on the o11y side, if any. Thanks!

jamiehynds · 2020-09-15T16:15:16Z

@cosiomoises @paulewing Would you mind taking a look at the proposed security data source categories? These categories may eventually be used to suggest relevant detection rules based on enabled integrations. Would be great to get your thoughts.

mostlyjason · 2020-09-16T14:41:26Z

@jamiehynds how will these categories be used? Also, how do they relate to our existing categories here https://www.elastic.co/integrations. I see several new ones added and several missing. Is this intended to be a replacement?

jamiehynds · 2020-09-17T09:08:29Z

@mostlyjason The intent is to provide alignment across the entire user experience from the Elastic web site (integration page), to Elastic in-product experiences (e.g. ingest manager), to index patterns, to ECS. ECS can govern that alignment via these proposed fields.

It also opens up the possibility of aligning detection rules to enabled data sources, e.g. if a user has added a firewall data source, we can suggest appropriate detection rules that related to firewalls. Maybe there's a similar use case for alerts on the o11y side?

These categories are intended to replace the existing integration categories. We haven't included existing categories such as AWS, Azure and Kubernetes as ECS doesn't use vendor names in the schema.

ruflin · 2020-09-22T07:57:45Z

rfcs/text/0000-data-source-categorization-fields.md

@@ -0,0 +1,75 @@
+# 0000: Data Source Categorization Fields


In Ingest Management we had many iterations on naming on data source has also some history in it. I'm wondering what exactly we categorise here. Is it the data itself which is in data_streams? Do we category the data_streams? Do we categorize the source from where the data is coming from?

Hi @ruflin - the intent here is to categorize the source from where the data is coming from.

ebeahan · 2020-10-19T16:39:42Z

Thanks everyone for the great feedback and discussion.

With this being a stage 0 candidate, the only criteria required for advancement is agreement that the premise has utility and could be an appropriate addition to ECS. Unless there are objections, I propose we capture the shared feedback and concerns in the proposal doc and begin refining and addressing concerns in the subsequent stages.

I've captured this summary of feedback and concerns:

How broad or narrow do these categories intend to be? Should categories and subcategories be considered? [RFC] Data Source Categorization Fields #958 (comment)
Standardize on capitalization and naming? [RFC] Data Source Categorization Fields #958 (comment)
Relationship between data_stream.category and observer.type? [RFC] Data Source Categorization Fields #958 (comment)
Clarify intended usage of these categories [RFC] Data Source Categorization Fields #958 (comment)
Clarify the intent to categorize the source where the data is from [RFC] Data Source Categorization Fields #958 (comment)

@jamiehynds - is there anyone else's feedback we may need to capture at this stage?

jamiehynds · 2020-10-20T09:46:47Z

Thanks @ebeahan. Before we proceed, I'd like to get some insight from @ruflin on previous discussions around categorising of data sources and whether we should include data_stream categorisations here too.

ruflin · 2020-10-21T07:22:09Z

Two question from my side:

How does this map to https://github.com/elastic/package-registry/blob/master/util/package.go#L27 These are the categories we use on our website and for packages.
data_stream field: Is there a future, where this field will be used in documents not part of an Elasticsearch data stream?

webmat

Thanks everyone for the feedback so far.

Thanks as well @ebeahan, I agree we should capture these concerns in the RFC document itself. As we reach a conclusion on some, we can document conclusions in the RFC. The RFC document should stand on its own -- including the concerns & resolutions -- without needing to refer to the PRs themselves too much.

I think the criteria for stage 0 has been met a long time ago (this is appropriate in ECS).

With all of the questions in the air at the moment, I suggest we retarget this PR to stage 1. This way we can get closure in this PR, rather than carrying over the discussion to the next PR.

webmat · 2020-10-22T19:39:18Z

rfcs/text/0000-data-source-categorization-fields.md

+# 0000: Data Source Categorization Fields
+<!-- Leave this ID at 0000. The ECS team will assign a unique, contiguous RFC number upon merging the initial stage of this RFC. -->
+
+- Stage: **0 (strawperson)** <!-- Update to reflect target stage. See https://elastic.github.io/ecs/stages.html -->


I suggest we retarget to stage 1, since there's been so much discussion already.

webmat · 2020-10-22T19:45:39Z

rfcs/text/0000-data-source-categorization-fields.md

+
+Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.
+
+The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:


I might have been the one suggesting data_stream.category as a possibility, a while ago.

But as the data_stream RFC is progressing, I no longer think this is the right approach.

I think the data_stream fields should be only dedicated to the indexing strategy itself, such as "how the index name is created".

I agree that a way of categorizing data sources is needed, but I think we should have this be another field, that would also makes sense in the 7.x monolithic indices. Having an out of place data_stream.category field there would not be appropriate.

webmat · 2020-10-22T19:49:57Z

rfcs/text/0000-data-source-categorization-fields.md

+Stage 0: Provide a high level summary of the premise of these changes. Briefly describe the nature, purpose, and impact of the changes. ~2-5 sentences.
+-->
+
+Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.


Good idea to bring this up.

I'm not sure I would go this direction. I think we should establish a list of allowed values, and make sure sources and pipelines populate based on this predictable list. Otherwise we could get all sorts of arbitrary differences in capitalizations and ways of writing things.

webmat · 2020-10-22T19:55:14Z

rfcs/text/0000-data-source-categorization-fields.md

+## References
+
+* https://github.com/elastic/ecs/issues/901
+* https://github.com/elastic/ecs/pull/845


Could you add the link provided by @ruflin to the references, please?

Thanks for providing it, Nic 👍

However let's make sure the link stands the test of time, and link via the latest tag, rather than master:

Suggested change

* https://github.com/elastic/ecs/pull/845

* https://github.com/elastic/ecs/pull/845

* https://github.com/elastic/package-registry/blob/v0.12.1/util/package.go#L27

ebeahan · 2021-03-29T17:56:10Z

Discussed with @jamiehynds out-of-band, and we're not moving forward with this effort at this time.

Create 0000-Data-Source-Categorization-Fields.md

90ca1ff

jamiehynds added the RFC label Aug 27, 2020

This was referenced Aug 27, 2020

[RFC] Data source categorization field values #954

Closed

Data Source Categorization Fields #901

Closed

ebeahan changed the title ~~[RFC] 0000 Data Source Categorization Fields~~ [RFC] Data Source Categorization Fields Aug 27, 2020

ebeahan added 2 commits August 27, 2020 13:02

update stage 0 PR reference

2155e4e

lowering case filename for consistency

c49647a

ebeahan previously approved these changes Aug 27, 2020

View reviewed changes

andrewstucki reviewed Aug 31, 2020

View reviewed changes

leehinman reviewed Sep 1, 2020

View reviewed changes

Update 0000-data-source-categorization-fields.md

873512c

jamiehynds dismissed ebeahan’s stale review via 873512c September 17, 2020 08:50

ruflin reviewed Sep 22, 2020

View reviewed changes

ruflin mentioned this pull request Sep 30, 2020

[RFC] data_stream fields #980

Merged

webmat reviewed Oct 22, 2020

View reviewed changes

mostlyjason mentioned this pull request Jan 19, 2021

Add streaming category elastic/package-registry#671

Closed

ebeahan closed this Mar 29, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Data Source Categorization Fields #958

[RFC] Data Source Categorization Fields #958

jamiehynds commented Aug 27, 2020 •

edited by webmat

Loading

ebeahan left a comment

ebeahan Aug 27, 2020

ebeahan Aug 27, 2020

andrewstucki Aug 31, 2020

jamiehynds Sep 1, 2020

leehinman Sep 1, 2020

dainperkins Oct 21, 2020

leehinman Sep 1, 2020

webmat Oct 22, 2020

jamiehynds commented Sep 8, 2020

jamiehynds commented Sep 15, 2020

mostlyjason commented Sep 16, 2020

jamiehynds commented Sep 17, 2020

ruflin Sep 22, 2020

jamiehynds Sep 22, 2020

ebeahan commented Oct 19, 2020

jamiehynds commented Oct 20, 2020

ruflin commented Oct 21, 2020

webmat left a comment

webmat Oct 22, 2020

ebeahan Oct 22, 2020

webmat Oct 22, 2020

webmat Oct 22, 2020

webmat Oct 22, 2020

ebeahan commented Mar 29, 2021


		The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:

		- apm


		Elastic currently supports ingestion of data from 180+ sources, and growing. However, we do not have a coherent way to categorise these sources. This has resulted in a disconnect in how we categorize these sources from the Elastic website, in-product experiences and ECS.

		The fieldset we use to describe the data source is up for discussion, data_stream.category is a possibility. Here are proposed allowed values:

	* https://github.com/elastic/ecs/pull/845
	* https://github.com/elastic/ecs/pull/845
	* https://github.com/elastic/package-registry/blob/v0.12.1/util/package.go#L27

[RFC] Data Source Categorization Fields #958

[RFC] Data Source Categorization Fields #958

Conversation

jamiehynds commented Aug 27, 2020 • edited by webmat Loading

ebeahan left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jamiehynds commented Sep 8, 2020

jamiehynds commented Sep 15, 2020

mostlyjason commented Sep 16, 2020

jamiehynds commented Sep 17, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeahan commented Oct 19, 2020

jamiehynds commented Oct 20, 2020

ruflin commented Oct 21, 2020

webmat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebeahan commented Mar 29, 2021

jamiehynds commented Aug 27, 2020 •

edited by webmat

Loading