Standardize process metric reporting #4232

tigrannajaryan · 2021-10-20T18:25:47Z

The Collector has a hostmetrics receiver which emits process metrics. The Collector also emits its own process metrics.

These 2 sources of process metrics today don't follow the same conventions for metric names. We need to come up with a standard / semantic conventions for emitting process metrics, make it part of the specification (see this issue) and them make sure both hostmetrics receiver and Collector's own metrics are emitted according to the defined conventions.

djaglowski · 2021-10-22T20:44:51Z

I've completed an initial diff of the two metric sources. Here are my notes, for those interested. I will continue this task soon by proposing specific changes:

Process metrics collected by `hostmetricsreceiver`

Name	Instrument	Units	Description	Labels
`process.cpu.time`	monotonic sum	s	Total CPU seconds broken down by different states.	`state` SHOULD be one of: `system`, `user`, `wait`
`process.memory.physical_usage`	non-monotonic sum	By	The amount of physical memory in use.
`process.memory.virtual_usage`	non-monotonic sum	By	Virtual memory size.
`process.disk.io`	monotonic sum	By	Disk bytes transferred.	`direction` SHOULD be one of: `read`, `write`

Process metrics collected for internal telemetry

Name	Instrument	Units	Description
`process/cpu_seconds`	monotonic sum	s	Total CPU user and system time in seconds
`process/memory/rss`	non-monotonic sum	By	Total physical memory (resident set size)
`process/uptime`	monotonic sum	s	Uptime of the process.
`process/runtime/heap_alloc_bytes`	non-monotonic sum	By	Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc.')
`process/runtime/total_alloc_bytes`	monotonic sum	By	Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
`process/runtime/total_sys_memory_bytes`	non-monotonic sum	By	Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')

Notes:

process.cpu.time and process/cpu_seconds are nearly equivalent, but:
- process.cpu.time is broken down by state, and depends on OS
  - linux: system, user, wait
  - windows: system, user
  - other: metric is not collected
- process/cpu_seconds represents a total
  - total != system + user + wait because there are additional states
process.memory.physical_usage is exactly equivalent to process/memory/rss
process.memory.virtual_usage has no equivalent in internal telemetry, but would be trivial to add if desired
process.disk.io has no equivalent in internal telemetry, but should be easy enough to add if desired
process/uptime has no equivalent in hostmetricsreceiver. It is tracked from process start using time.Now(), so should be easy to add if desired
- might be slightly inaccurate if initial timestamp is captured within hostmetricsreceiver (probably <1s in most cases, so close enough for most use cases)
process/runtime/* metrics have no equivalent in hostmetricsreceiver. They are pulled from runtime.MemStats, so should be easy to add if desired

djaglowski · 2021-10-25T14:40:46Z

[Proposed] Unified process metrics

Name	Instrument	Units	Description	Labels
`process.cpu.time`	Asynchronous Counter	s	Total CPU seconds broken down by different states.	`state`, if specified, SHOULD be one of: `system`, `user`, `wait`. A process SHOULD be characterized either by data points with no `state` labels, or only data points with `state` labels.
`process.memory.physical_usage`	Asynchronous UpDownCounter	By	The amount of physical memory in use.
`process.memory.virtual_usage`	Asynchronous UpDownCounter	By	Virtual memory size.
`process.disk.io`	Asynchronous Counter	By	Disk bytes transferred.	`direction` SHOULD be one of: `read`, `write`
`process.uptime`	Asynchronous Counter	s	Uptime of the process.
`process.runtime.memory.heap.allocated`	Asynchronous UpDownCounter	By	Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc.')
`process.runtime.memory.heap.allocated_cumulative`	Asynchronous Counter	By	Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
`process.runtime.memory.system`	Asynchronous UpDownCounter	By	Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')

Notes on proposed changes

None of the internal telemetry metrics adhere to OTel's instrument naming requirements.
The process/runtime/* metrics all include units in the name, which is against the semantic convention guidelines for units.
The process.memory.*_usage metrics are basically encoding a label into the names. The metric semantic conventions specifically document this scenario.
- Edit: As Tigran has pointed out below, aggregation of physical and virtual memory is not meaningful, so these should be kept as separate metrics.

Detailed proposed changes

process.cpu.time
- Change process/cpu_seconds to process.cpu.time (with no state label)
  - Document that the metric SHOULD either contain data points with no state labels, or only data points with state labels. (i.e. that aggregation of total with "subtotals" yields a non-meaningful sum)
Change process/memory/rss to process.memory.physical_usage
No changes to process.disk.io
Rename process/uptime to process.uptime
Rename process/runtime/heap_alloc_bytes to process.runtime.memory.heap.allocated
Rename process/runtime/total_alloc_bytes to process.runtime.memory.heap.allocated_cumulative
Rename process.runtime.total_sys_memory_bytes to process.runtime.memory.system

Open Questions

In particular, I'd appreciate scrutiny on the heap metric names. I believe they are organized in a logical hierarchy, but they are very verbose, and perhaps unnecessarily so:

Is the process.runtime namespace meaningfully different than just process?

process metrics are provided by the OS
process.runtime metrics are provided by the go runtime

Does the process.runtime.memory.heap namespace need to be nested under process.runtime.memory?

Consider process.runtime.heap (or just process.heap, depending on the first question)

Is the process.cpu.time metric designed in a valid way?

One use case is interested in the total cpu time, while the other is interested in subtotals.
Does it need to be two separate metrics in order to avoid invalid aggregation?

djaglowski · 2021-10-26T14:30:40Z

I've updated the above proposal w/ a change to process.cpu.time.

tigrannajaryan · 2021-10-26T15:15:38Z

@djaglowski thanks, I'll take a look and comment.

tigrannajaryan · 2021-10-26T15:15:42Z

@dmitryax please review this to see how it can affect Signalfx translations, assuming we adopt these changes for hostmetrics receiver. Do we have a good way to handle this such that it is not a breaking change for our customers?

tigrannajaryan · 2021-10-26T15:54:33Z

process.cpu.time

I assume we mimic system.cpu.time, right? I don't see that system.cpu.time allows the state attribute to be omitted. Can you provide the rationale for allowing this? Should we also suggest the same for system.cpu.time?

tigrannajaryan · 2021-10-26T15:58:56Z

monotonic sum

Please use the instrument terminology defined in the spec. I think Asynchronous Counter is the new term instead of 'monotonic sum' according to the spec.

tigrannajaryan · 2021-10-26T15:59:27Z

(Meta: it may be worth putting the proposal in a Google doc to make commenting and discussion easier).

tigrannajaryan · 2021-10-26T16:05:27Z

process.memory.usage
Change process.memory.physical_usage to process.memory.usage with total = physical
Change process.memory.virtual_usage to process.memory.usage with total = virtual

Does this violate the Prometheus rule of thumb about aggregations?

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

Neither sum() nor avg() seem to be meaningful aggregations in this case. I am also not sure the dimension name total reflects the intent.

djaglowski · 2021-10-26T16:55:54Z

I assume we mimic system.cpu.time, right?

I think process.cpu.time fits the description of a time metric, as described here. Otherwise, it's not necessarily meant to mimic system.cpu.time, though there is obvious similarity and we should align where possible for obvious reasons.

I don't see that system.cpu.time allows the state attribute to be omitted. Can you provide the rationale for allowing this? Should we also suggest the same for system.cpu.time?

I think that our two alternate implementations for process.cpu.time demonstrate a case that could be generalized to all usage metrics, including system.cpu.time. In short, are dimensions ever optional? My proposed design assumes that they may be. By omitting a label, one is essentially collapsing a dimension, indicating that there is no differentiation along that dimension.

If labels are always required, then I think we have two options:

Expect a single label to sometimes represent the entire dimension (e.g. state: total)
Expect that in cases like this, the "same" metric must be split into two metrics (e.g. process.cpu.time and process.cpu.time_by_category)

I don't feel strongly that my initial proposal is the only option, but it seemed to me the best one given the alternatives I was able to identify. I'd definitely appreciate other opinions on this.

djaglowski · 2021-10-26T16:58:31Z

process.memory.usage
Change process.memory.physical_usage to process.memory.usage with total = physical
Change process.memory.virtual_usage to process.memory.usage with total = virtual

Does this violate the Prometheus rule of thumb about aggregations?

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

Neither sum() nor avg() seem to be meaningful aggregations in this case. I am also not sure the dimension name total reflects the intent.

I think you're right. I'll update the proposal to keep them seperate. This also has the benefit of preserving all metrics in hostmetricreceiver.

tigrannajaryan · 2021-10-26T17:02:03Z

I think that our two alternate implementations for process.cpu.time demonstrate a case that could be generalized to all usage metrics, including system.cpu.time. In short, are dimensions ever optional? My proposed design assumes that they may be. By omitting a label, one is essentially collapsing a dimension, indicating that there is no differentiation along that dimension.

I understand the proposal and it seems reasonable, but I am not an expert in metrics, so would like to some other opinions. This needs to be posted as Otel spec proposal anyway, so we should see more feedback there. Let's go with what you suggest, but explicitly ask feedback on this issue.

djaglowski · 2021-10-26T17:07:34Z

This needs to be posted as Otel spec proposal anyway, so we should see more feedback there. Let's go with what you suggest, but explicitly ask feedback on this issue.

Ok, will do. Thanks for advising.

Resolves #1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

Resolves open-telemetry#1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

tigrannajaryan mentioned this issue Oct 20, 2021

Agent Management Tasks #4227

Open

6 tasks

tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Oct 20, 2021

tigrannajaryan assigned djaglowski Oct 25, 2021

This was referenced Oct 26, 2021

Add semantic conventions for process metrics open-telemetry/opentelemetry-specification#2061

Merged

Optional Metric Attributes open-telemetry/opentelemetry-specification#2077

Open

tigrannajaryan pushed a commit to open-telemetry/opentelemetry-specification that referenced this issue Jan 25, 2022

Add semantic conventions for process metrics (#2061)

4dacae3

Resolves #1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

ChengJinbao added a commit to ChengJinbao/opentelemetry-specification that referenced this issue Nov 16, 2022

Add semantic conventions for process metrics (#2061)

0bcd35c

Resolves #1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

jsuereth pushed a commit to jsuereth/otel-semconv-test that referenced this issue Apr 19, 2023

Add semantic conventions for process metrics (#2061)

724378a

Resolves #1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

jsuereth pushed a commit to open-telemetry/semantic-conventions that referenced this issue May 11, 2023

Add semantic conventions for process metrics (#2061)

7a2a718

Resolves #1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

github-actions bot added the Stale label Oct 29, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2023

joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024

Add semantic conventions for process metrics (#2061)

a167e51

Resolves open-telemetry#1817 Prerequisite for [(collector) #4232](open-telemetry/opentelemetry-collector#4232)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Standardize process metric reporting #4232

Standardize process metric reporting #4232

tigrannajaryan commented Oct 20, 2021

djaglowski commented Oct 22, 2021 •

edited

Loading

djaglowski commented Oct 25, 2021 •

edited

Loading

djaglowski commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021 •

edited

Loading

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

djaglowski commented Oct 26, 2021 •

edited

Loading

djaglowski commented Oct 26, 2021 •

edited

Loading

tigrannajaryan commented Oct 26, 2021

djaglowski commented Oct 26, 2021

Standardize process metric reporting #4232

Standardize process metric reporting #4232

Comments

tigrannajaryan commented Oct 20, 2021

djaglowski commented Oct 22, 2021 • edited Loading

Process metrics collected by hostmetricsreceiver

Process metrics collected for internal telemetry

djaglowski commented Oct 25, 2021 • edited Loading

[Proposed] Unified process metrics

Notes on proposed changes

Detailed proposed changes

Open Questions

djaglowski commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021 • edited Loading

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

tigrannajaryan commented Oct 26, 2021

djaglowski commented Oct 26, 2021 • edited Loading

djaglowski commented Oct 26, 2021 • edited Loading

tigrannajaryan commented Oct 26, 2021

djaglowski commented Oct 26, 2021

djaglowski commented Oct 22, 2021 •

edited

Loading

Process metrics collected by `hostmetricsreceiver`

djaglowski commented Oct 25, 2021 •

edited

Loading

tigrannajaryan commented Oct 26, 2021 •

edited

Loading

djaglowski commented Oct 26, 2021 •

edited

Loading

djaglowski commented Oct 26, 2021 •

edited

Loading