Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[resource/host] Add semantic convention for IP addresses of a host #203

Merged
merged 15 commits into from
Oct 9, 2023

Conversation

mx-psi
Copy link
Member

@mx-psi mx-psi commented Jul 24, 2023

Adds host.ipv4.addresses and host.ipv6.addresses resource semantic conventions to specify the IPv4 and IPv6 addresses of a host.

Adds host.ip resource attribute for specifying the IPs of a host, ensuring well-recognized formats for IPv4 and IPv6 addresses

Updates #131 (missing host.mac)

Reference implementation on system detector: open-telemetry/opentelemetry-collector-contrib/pull/24450

@mx-psi mx-psi requested review from a team July 24, 2023 14:17
docs/resource/host.md Outdated Show resolved Hide resolved
@AlexanderWert
Copy link
Member

AlexanderWert commented Jul 24, 2023

Do we need the explicit differentiation of ipv4 and ipv6 addresses through different attributes?
What would be the use cases to explicitly differentiating those?

Also, in ECS we have the host.ip (array-typed) field.
Couldn't we just go with that?

cc @ChrsMark

@mx-psi
Copy link
Member Author

mx-psi commented Jul 24, 2023

Do we need the explicit differentiation of ipv4 and ipv6 addresses through different attributes? What would be the use cases to explicitly differentiating those?

We discussed this in the system semantic convention WG as well as in #131 (comment). Personally the main benefit I see is that we make semantics of the attributes more specific; I have been bit by very general attributes like host.name and host.id more than once in the past and want to avoid that generally. What is the benefit of being ambiguous?

Discussed #131 (comment)

Also, in ECS we have the host.ip (array-typed) field. Couldn't we just go with that?

I find it at least possible that host.ip (or host.ipv4/host.ipv6) would be used as a namespace in the future and thus I think we should avoid just host.ip per the "Names SHOULD NOT coincide with namespaces" guidance here

@Oberon00
Copy link
Member

The v4/v6 name split is also inconsistent with our network.* semantic conventions.

Copy link
Member

@Oberon00 Oberon00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to call out also on the PR that AFAIK we intentionally left out these attributes because we want to delay defining them until after we sort out how to implement mutable/added-after-start resource attributes: open-telemetry/opentelemetry-specification#1298

Personally I think given the long dormancy of that issue and the usefulness of capturing IPs, it may be OK to ignore that. But maybe this would also be a good opportunity to get the discussion there going again?

@mx-psi
Copy link
Member Author

mx-psi commented Jul 24, 2023

The v4/v6 name split is also inconsistent with our network.* semantic conventions.

Would you mind elaborating on which conventions are inconsistent with these and how?

I want to call out also on the PR that AFAIK we intentionally left out these attributes because we want to delay defining them until after we sort out how to implement mutable/added-after-start resource attributes:

I think this issue is important but I don't see how this relates to these attributes specifically. The main way we would add these today would be through an OpenTelemetry Collector processor (there are several processors that can modify the resource of telemetry passing through the Collector), from a brief reading of the issue I don't think the issue covers this case but is rather talking about SDKs

@AlexanderWert
Copy link
Member

We discussed this in the system semantic convention WG as well as in #131 (comment). Personally the main benefit I see is that we make semantics of the attributes more specific; I have been bit by very general attributes like host.name and host.id more than once in the past and want to avoid that generally. What is the benefit of being ambiguous?

One benefit of being "ambiguous" here is that querying and interoperating with the data is easier, i.e. does not require knowledge about the IP version. For example, let's say you want to correlate data by an IP. With the above proposal, before querying data you'd need to parse the value of an IP first to understand which version it is, before being able to query on the right attribute (host.ipv4.addresses vs host.ipv6.addresses).

Attributes with comparable, intentional ambiguity are server.address, client.address, source.address, destination.address, etc. For the above reasons, in those cases the attributes are intentionally defined in a non-prescriptive way so that server.address can be either a FQDN or an IP, or the destination.address an IP or a UNIX socket name.

I find it at least possible that host.ip (or host.ipv4/host.ipv6) would be used as a namespace in the future and thus I think we should avoid just host.ip per the "Names SHOULD NOT coincide with namespaces" guidance here

Interesting, haven't thought about that. Do you have examples of additional attributes that we'd like to have under an host.ip namespace?

@Oberon00
Copy link
Member

Oberon00 commented Jul 25, 2023

The v4/v6 name split is also inconsistent with our network.* semantic conventions.

Would you mind elaborating on which conventions are inconsistent with these and how?

@mx-psi See here: https://github.com/open-telemetry/semantic-conventions/blob/main/docs/general/attributes.md#network-attributes

Especially source.address, destination.address, client.socket.address, server.socket.address.

(just now while listing these, I stumbled upon / created #206)

@mx-psi
Copy link
Member Author

mx-psi commented Jul 25, 2023

One benefit of being "ambiguous" here is that querying and interoperating with the data is easier, i.e. does not require knowledge about the IP version. For example, let's say you want to correlate data by an IP. With the above proposal, before querying data you'd need to parse the value of an IP first to understand which version it is, before being able to query on the right attribute (host.ipv4.addresses vs host.ipv6.addresses).

@AlexanderWert (also @Oberon00) My main goal was to ensure that if someone wants to determine whether an address is IPv4/IPv6 they can unambiguously recover that information.

I think Alex's is a fair point and prior attributes are also a good argument against splitting per version. A compromise would be to have a single attribute but provide strict guidance on what the IP format should be for each version (current wording includes references to "quad-dotted notation" and to RFC 5952, that seems enough to me) so that one can still recover the IP version. Would that be okay with you?

Interesting, haven't thought about that. Do you have examples of additional attributes that we'd like to have under an host.ip namespace?

I don't have anything particular in mind (maybe someday we want to express CIDR blocks? or something about IP headers?), but I wouldn't want my lack of imagination to put us in the future in a spot where we have to contradict the spec guidance when we can easily avoid it.

@ChrsMark
Copy link
Member

Hey @mx-psi, I'm +1 for the one single attribute (host.ip). Regarding the potential additional attributes I'm not sure how possible that would be so I would avoid converting the convenient/compact host.ip to host.ip.addresses. I would just stick with the ECS version of the field :).

@mx-psi
Copy link
Member Author

mx-psi commented Jul 31, 2023

I merged both attributes into host.ip. I still think host.ip.addresses would be slightly better in terms of future-proofing but I understand conciseness is also a factor here. I left the wording regarding formatting of IPv4 and IPv6 addresses.

Do we have any guidance regarding plurals in attribute names? For example, we have process.command_args and in general we use plurals elsewhere. Should we do that here (so that would be host.ips)?

@mx-psi mx-psi changed the title [resource/host] Add semantic conventions for IP addresses of a host [resource/host] Add semantic convention for IP addresses of a host Jul 31, 2023
@MikePaquette
Copy link

Hi Folks,
Mike P. from Elastic jumping in here with a few thoughts:

I merged both attributes into host.ip.

+1 In general, it’s best for existing users of ECS if we maintain compatibility (field name equivalence) with widely deployed ECS fields, like host.ip unless there is a strong reason to do otherwise.

I still think host.ip.addresses would be slightly better in terms of future-proofing but I understand conciseness is also a factor here. I left the wording regarding formatting of IPv4 and IPv6 addresses.

I hope we can avoid all situations where we might add an extension (like .addresses) to an existing ECS field. This provides two benefits to our users 1) a consistent practice where a “leaf” field name is always a leaf field name. (e.g., host.ip, source.ip, client.ip, destination.ip, etc.). - this helps analysts remember and/or deduce field names. 2) It avoids potential field data type mismatches in Elasticsearch where a given field name cannot be an object and a “leaf” field in the same index. During the development of ECS, we made one such error, making host an object (i.e., the leading part of host.ip), when previously Logstash used host as a “leaf” field name. We had to deal with reconciling these mismatches for years in mixed user environments.

Do we have any guidance regarding plurals in attribute names? For example, we have process.command_args and in general we use plurals elsewhere. Should we do that here (so that would be host.ips)?

ECS provides a set of guidelines for field names, which includes the rather obvious “Use singular and plural names properly to reflect the field content. For example, use requests_per_sec rather than request_per_sec.”, but this is not very helpful in this case, as we all know that a host can indeed have multiple IP addresses. ECS has another guideline which is to avoid abbreviations where possible, except when the name used for the concept is widely accepted in the abbreviation form, such as ip field, or field sets like os or geo. In this case, given the widespread acceptance of ip, and the ambiguity of ips (Intrusion Prevention System, anyone?) the familiar ip would win out.

If we anticipate that some systems adopting Otel SemConv will not support a combined IPv4/IPv6 field data type, then we could entertain proposals to add new fields, such as host.ipv4 and host.ipv6, but host.ip should always continue to be populated.

@mx-psi
Copy link
Member Author

mx-psi commented Aug 1, 2023

/easycla

Copy link
Member

@ChrsMark ChrsMark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I have filed open-telemetry/opentelemetry-specification#3641 to further discuss and clarify the pluralization guidelines.

@joaopgrassi
Copy link
Member

joaopgrassi commented Aug 11, 2023

I generally also agree that having attributes separated by version would hurt how one search for it - you would need to either know which version is being used OR do a query with both and see which one pop up with data so having a single seems better to me.

What I'm not so sure is using a singular term for an array, since that goes against our recommendation. Adding an exception just opens up for more interpretations to "what is wildly known" (https://github.com/open-telemetry/opentelemetry-specification/pull/3641/files).

In another hand, we said we would take ECS fields and using something other than host.ip will break it. I guess either way we will have to make a compromise - either compromise OTel naming conventions with plural attributes and go with host.ip or compromise ECS and go with host.ip.addresses or host.ip_addresses or the like. I guess we need to pick the one that causes less problems for both sides.

@mx-psi
Copy link
Member Author

mx-psi commented Aug 14, 2023

If we compromise ECS, I would favor host.ip.addresses instead of host.ip_addresses, since it seems at least conceivable to me that we may want to add further items to a host.ip namespace at some point in the future.

@MikePaquette
Copy link

Sorry for the delay, catching up from some time off

What I'm not so sure is using a singular term for an array, since that goes against our recommendation. Adding an exception just opens up for more interpretations to "what is wildly known" (https://github.com/open-telemetry/opentelemetry-specification/pull/3641/files).

If our focus is on the success of analysts interacting with our data, which I think it is, then we should favor exceptions for widely-known terms over strict-adherence to rules like all array type field names must use plural form.

While strict adherence can have some advantages when it comes to automated validation of fields in a schema, in this case, since English language supports irregular plurals, even a simple automated check such as “if type = array, check name ends in ’s’” will require an exception path for irregular plurals, for example, child-children, mouse-mice, radius-radii, etc., so if we’d need to have exceptions for automated checks anyway, it makes more sense to apply the exception in favor of the consumer of our data.

In another hand, we said we would take ECS fields and using something other than host.ip will break it. I guess either way we will have to make a compromise - either compromise OTel naming conventions with plural attributes and go with host.ip or compromise ECS and go with host.ip.addresses or host.ip_addresses or the like. I guess we need to pick the one that causes less problems for both sides.

From the Elastic side, we are strongly in favor of making the exception to the plural attributes naming convention, and avoiding the breaking change. We think that current and future users will appreciate the consistency.

Member
Author
mx-psi commented 2 weeks ago
If we compromise ECS, I would favor host.ip.addresses instead of host.ip_addresses, since it seems at least conceivable to me that we may want to add further items to a host.ip namespace at some point in the future.

If we can’t agree with the above recommendation to make the exception to the plural attributes naming convention, and we feel we must make a breaking change for the thousands of users already using ECS host.ip, I would again ask us to consider that there are two kinds of breaking changes:

  • A name change only, whereby a new field, with the same semantics is created - e.g., host.ip_addresses - this could be represented as an “alias” for host.ip in environments as they migrate from ECS to the merged schema.
  • A change that creates potential mapping conflicts between Elasticsearch “leaf” and “object” fields - e.g., host.ip.addresses - We have learned from experience that this type of change is particularly troublesome, and difficult to work around, so we would ask strongly to never “extend” existing leaf fields like .ip by adding a .anything.

@mx-psi
Copy link
Member Author

mx-psi commented Sep 13, 2023

I want to discuss this on next week's Specification SIG meeting, the main point of contention seems to be if/how much we should deviate from ECS, and I think a general framework like what is discussed on #181 (comment) would help move this forward. Please attend if you want/can so we can move this foward :)

@ChrsMark
Copy link
Member

So I see 3 options here:

A) we use host.ip as is from ECS
B) we use host.ip_addresses
C) we use host.ip.addresses

Pros/Cons per option

A) host.ip

++

  1. We use what ECS has been using for years now which is proven to work
  2. We don't break ECS users which will help with Otel adoption
  3. That naming makes sense

--

  1. We "violate" the plural VS singular rule but with solid reasoning behind. Since the plural VS singular is mostly a guideline and not super strict rule I guess that we can decide to not follow it in specific use-cases when it actually makes sense. Eventually the type will be correct so it's just about a naming convention.
  2. We don't preserve the host.ip as a namespace for the future.

B) host.ip_addresses

++

  1. We honor the plural VS singular rule

--

  1. We break ECS users but only on naming not the schema
  2. We don't preserve the host.ip as a namespace for the future.

C) host.ip.addresses

++

  1. We honor the plural VS singular rule
  2. We preserve the host.ip as a namespace for the future

--

  1. We break ECS schema completely making the migration plan from ECS to Otel very tricky (see [resource/host] Add semantic convention for IP addresses of a host #203 (comment))

Trade-offs

So based on the above the trade-offs are mostly about:

  1. Breaking ECS or not
  2. Honoring the plural VS singular rule
  3. Making it future proof by preserving the host.ip* namespace

From my perspective :

  • item 3 is sth we can just by-pass since we don't have strong evidence that we would like to add more attributes in the future. So breaking the already used ECS just for a possibility might not be our best option here. Again since ECS host.ip has been widely used so far and it's proven to work it's an extra sign that we can rely on this.
  • item 2 is also sth that we can compromise on since we have solid and good reasoning for doing so. Of course we will be following that rule for every attribute/field we are introducing but this should be mostly a tool/guideline for us and not a strict line. Hence I would be ok with "violating" this in that particular case. We can of course adjust the guideline if needed: https://github.com/open-telemetry/opentelemetry-specification/pull/3641/files
  • item 1 is sth that is quite solid on its own. In general avoiding breaking changes is sth we should care about. When there is the trade-off of breaking ECS or breaking Otel SemConv then things are more tricky and we need to evaluate more aspects. However since the IP field that we are talking about is a new field for Otel then introducing a breaking change for ECS would be sth to carefully consider. Note that host.ip is a widely used field (top 5). A solid example is that Filebeat's users will be affected by this (through the add_host_metadata processor)

So to summarize my personal opinion for this would be to make the compromise and decide on using the host.ip to avoid breaking ECS users and ease the adoption of Otel in the future since the cons 2 and 3 are not super crucial in that particular case. Giving priority to ECS when there is no direct conflict will be sth that will help Otel adoption and ease the ECS-Otel merger.

@mx-psi mx-psi requested a review from a team September 18, 2023 15:26
@Oberon00
Copy link
Member

Oberon00 commented Sep 18, 2023

I don't have a horse in this race, but based on the pros/cons, I it seems logical to suggest:

D) host.inet.addresses

++

  1. We honor the plural vs singular rule
  2. We reserve inet as a namespace for ip-related attributes as a namespace for the future

--

  1. We break ECS users but only on naming not the schema
  2. We use a slightly unusual namespace (inet is used for IP in POSIX/libc (https://www.gnu.org/software/libc/manual/html_node/Internet-Address-Formats.html) other alternatives not equal to ip would also work)

Having said that, I agree that host.ip seems fine, and host.inet or another alternative can then be used later on if we do find we need more ip-related attributes other than the ip address itself.

@ChrsMark
Copy link
Member

ChrsMark commented Sep 19, 2023

Thank's @Oberon00 for chiming in!

Having said that, I agree that host.ip seems fine, and host.inet or another alternative can then be used later on if we do find we need more ip-related attributes other than the ip address itself.

That one sounds good to me! Looks like a hybrid 5th option here that covers 2 of the 3 main trade-offs we are discussing:

++

  1. use host.ip, and be ECS compliant
  2. reserve host.inet.* for potentially additional attributes in the future (backup plan, just in case we meet this need in the future)

--

  1. we break the plural VS singular rule but to my mind this should not be a big deal and we could even revisit this rule as part of the Provide guidance when there is conflict with ECS conventions #329 and fine tune the rule so as to be a helpful guideline and not a strict rule that blocks us from choosing meaningful names (ip looks more convenient compared to ip_addresses for example)

What do the rest of @open-telemetry/semconv-system-approvers think here?

@mx-psi
Copy link
Member Author

mx-psi commented Sep 21, 2023

I haven't publicly thanked @ChrsMark for the summary so thanks again! :) and I just want to state that I also feel like host.ip is the best option given the pros/cons list, and that host.inet could be used in the future as a namespace if more conventions in this space are needed without breaking ECS or our own rules.

@joaopgrassi
Copy link
Member

@open-telemetry/specs-semconv-approvers @open-telemetry/specs-semconv-maintainers We need more reviews on this, given the discussions about ECS and etc.

@AlexanderWert AlexanderWert merged commit 0af7e0e into open-telemetry:main Oct 9, 2023
Copy link
Member

@frzifus frzifus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Damn I guess I am a bit late.

I still think host.ip.addresses would be slightly better in terms of future-proofing but I understand conciseness is also a factor here. I left the wording regarding formatting of IPv4 and IPv6 addresses.

I hope we can avoid all situations where we might add an extension (like .addresses) to an existing ECS field. This provides two benefits to our users 1) a consistent practice where a “leaf” field name is always a leaf field name. (e.g., host.ip, source.ip, client.ip, destination.ip, etc.). - this helps analysts remember and/or deduce field names. 2) It avoids potential field data type mismatches in Elasticsearch where a given field name cannot be an object and a “leaf” field in the same index. During the development of ECS, we made one such error, making host an object (i.e., the leading part of host.ip), when previously Logstash used host as a “leaf” field name. We had to deal with reconciling these mismatches for years in mixed user environments.

While I see this point, I am a bit confused. Isnt the current spec promoting fields like client.address or server.address - I noticed this from the http space open-telemetry/opentelemetry-specification#3163? In that case its clear that we expect a single address. And here / in ECS e.g. destination and server provide an address field. I assume that a spec consistent with its recommendations would also help e.g. analyst too.

@MikePaquette
Copy link

While I see this point, I am a bit confused. Isnt the current spec promoting fields like client.address or server.address - I noticed this from the http space open-telemetry/opentelemetry-specification#3163? In that case its clear that we expect a single address. And here / in ECS e.g. destination and server provide an address field. I assume that a spec consistent with its recommendations would also help e.g. analyst too.

Thanks @frzifus, please let me try to clarify, my point above is not objecting generally to the field names *.address or *.addresses, (Indeed, as you point out, ECS includes both *.ip and *.address fields already.) My objection is to adding to or extending an existing "leaf" field, like any *.ip with an extension of .address or .addresses or any other dot-something, as this will create the downstream field datatype issues I described above. I hope this is clearer.

mx-psi added a commit to open-telemetry/opentelemetry-collector-contrib that referenced this pull request Nov 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

10 participants