Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Standardize process metric reporting #4232

Closed
Tracked by #4227
tigrannajaryan opened this issue Oct 20, 2021 · 13 comments
Closed
Tracked by #4227

Standardize process metric reporting #4232

tigrannajaryan opened this issue Oct 20, 2021 · 13 comments
Assignees
Labels
help wanted Good issue for contributors to OpenTelemetry Service to pick up Stale

Comments

@tigrannajaryan
Copy link
Member

The Collector has a hostmetrics receiver which emits process metrics. The Collector also emits its own process metrics.

These 2 sources of process metrics today don't follow the same conventions for metric names. We need to come up with a standard / semantic conventions for emitting process metrics, make it part of the specification (see this issue) and them make sure both hostmetrics receiver and Collector's own metrics are emitted according to the defined conventions.

@tigrannajaryan tigrannajaryan added the help wanted Good issue for contributors to OpenTelemetry Service to pick up label Oct 20, 2021
@djaglowski
Copy link
Member

djaglowski commented Oct 22, 2021

I've completed an initial diff of the two metric sources. Here are my notes, for those interested. I will continue this task soon by proposing specific changes:

Process metrics collected by hostmetricsreceiver

Name Instrument Units Description Labels
process.cpu.time monotonic sum s Total CPU seconds broken down by different states. state SHOULD be one of: system, user, wait
process.memory.physical_usage non-monotonic sum By The amount of physical memory in use.
process.memory.virtual_usage non-monotonic sum By Virtual memory size.
process.disk.io monotonic sum By Disk bytes transferred. direction SHOULD be one of: read, write

Process metrics collected for internal telemetry

Name Instrument Units Description Labels
process/cpu_seconds monotonic sum s Total CPU user and system time in seconds
process/memory/rss non-monotonic sum By Total physical memory (resident set size)
process/uptime monotonic sum s Uptime of the process.
process/runtime/heap_alloc_bytes non-monotonic sum By Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc.')
process/runtime/total_alloc_bytes monotonic sum By Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
process/runtime/total_sys_memory_bytes non-monotonic sum By Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')

Notes:

  • process.cpu.time and process/cpu_seconds are nearly equivalent, but:
    • process.cpu.time is broken down by state, and depends on OS
      • linux: system, user, wait
      • windows: system, user
      • other: metric is not collected
    • process/cpu_seconds represents a total
      • total != system + user + wait because there are additional states
  • process.memory.physical_usage is exactly equivalent to process/memory/rss
  • process.memory.virtual_usage has no equivalent in internal telemetry, but would be trivial to add if desired
  • process.disk.io has no equivalent in internal telemetry, but should be easy enough to add if desired
  • process/uptime has no equivalent in hostmetricsreceiver. It is tracked from process start using time.Now(), so should be easy to add if desired
    • might be slightly inaccurate if initial timestamp is captured within hostmetricsreceiver (probably <1s in most cases, so close enough for most use cases)
  • process/runtime/* metrics have no equivalent in hostmetricsreceiver. They are pulled from runtime.MemStats, so should be easy to add if desired

@djaglowski
Copy link
Member

djaglowski commented Oct 25, 2021

[Proposed] Unified process metrics

Name Instrument Units Description Labels
process.cpu.time Asynchronous Counter s Total CPU seconds broken down by different states. state, if specified, SHOULD be one of: system, user, wait. A process SHOULD be characterized either by data points with no state labels, or only data points with state labels.
process.memory.physical_usage Asynchronous UpDownCounter By The amount of physical memory in use.
process.memory.virtual_usage Asynchronous UpDownCounter By Virtual memory size.
process.disk.io Asynchronous Counter By Disk bytes transferred. direction SHOULD be one of: read, write
process.uptime Asynchronous Counter s Uptime of the process.
process.runtime.memory.heap.allocated Asynchronous UpDownCounter By Bytes of allocated heap objects (see 'go doc runtime.MemStats.HeapAlloc.')
process.runtime.memory.heap.allocated_cumulative Asynchronous Counter By Cumulative bytes allocated for heap objects (see 'go doc runtime.MemStats.TotalAlloc')
process.runtime.memory.system Asynchronous UpDownCounter By Total bytes of memory obtained from the OS (see 'go doc runtime.MemStats.Sys')

Notes on proposed changes

  • None of the internal telemetry metrics adhere to OTel's instrument naming requirements.
  • The process/runtime/* metrics all include units in the name, which is against the semantic convention guidelines for units.
  • The process.memory.*_usage metrics are basically encoding a label into the names. The metric semantic conventions specifically document this scenario.
    • Edit: As Tigran has pointed out below, aggregation of physical and virtual memory is not meaningful, so these should be kept as separate metrics.

Detailed proposed changes

  • process.cpu.time
    • Change process/cpu_seconds to process.cpu.time (with no state label)
      • Document that the metric SHOULD either contain data points with no state labels, or only data points with state labels. (i.e. that aggregation of total with "subtotals" yields a non-meaningful sum)
  • Change process/memory/rss to process.memory.physical_usage
  • No changes to process.disk.io
  • Rename process/uptime to process.uptime
  • Rename process/runtime/heap_alloc_bytes to process.runtime.memory.heap.allocated
  • Rename process/runtime/total_alloc_bytes to process.runtime.memory.heap.allocated_cumulative
  • Rename process.runtime.total_sys_memory_bytes to process.runtime.memory.system

Open Questions

In particular, I'd appreciate scrutiny on the heap metric names. I believe they are organized in a logical hierarchy, but they are very verbose, and perhaps unnecessarily so:

  1. Is the process.runtime namespace meaningfully different than just process?
  • process metrics are provided by the OS
  • process.runtime metrics are provided by the go runtime
  1. Does the process.runtime.memory.heap namespace need to be nested under process.runtime.memory?
  • Consider process.runtime.heap (or just process.heap, depending on the first question)
  1. Is the process.cpu.time metric designed in a valid way?
  • One use case is interested in the total cpu time, while the other is interested in subtotals.
  • Does it need to be two separate metrics in order to avoid invalid aggregation?

@djaglowski
Copy link
Member

I've updated the above proposal w/ a change to process.cpu.time.

@tigrannajaryan
Copy link
Member Author

@djaglowski thanks, I'll take a look and comment.

@tigrannajaryan
Copy link
Member Author

@dmitryax please review this to see how it can affect Signalfx translations, assuming we adopt these changes for hostmetrics receiver. Do we have a good way to handle this such that it is not a breaking change for our customers?

@tigrannajaryan
Copy link
Member Author

tigrannajaryan commented Oct 26, 2021

process.cpu.time

I assume we mimic system.cpu.time, right? I don't see that system.cpu.time allows the state attribute to be omitted. Can you provide the rationale for allowing this? Should we also suggest the same for system.cpu.time?

@tigrannajaryan
Copy link
Member Author

monotonic sum

Please use the instrument terminology defined in the spec. I think Asynchronous Counter is the new term instead of 'monotonic sum' according to the spec.

@tigrannajaryan
Copy link
Member Author

(Meta: it may be worth putting the proposal in a Google doc to make commenting and discussion easier).

@tigrannajaryan
Copy link
Member Author

process.memory.usage
Change process.memory.physical_usage to process.memory.usage with total = physical
Change process.memory.virtual_usage to process.memory.usage with total = virtual

Does this violate the Prometheus rule of thumb about aggregations?

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

Neither sum() nor avg() seem to be meaningful aggregations in this case. I am also not sure the dimension name total reflects the intent.

@djaglowski
Copy link
Member

djaglowski commented Oct 26, 2021

I assume we mimic system.cpu.time, right?

I think process.cpu.time fits the description of a time metric, as described here. Otherwise, it's not necessarily meant to mimic system.cpu.time, though there is obvious similarity and we should align where possible for obvious reasons.

I don't see that system.cpu.time allows the state attribute to be omitted. Can you provide the rationale for allowing this? Should we also suggest the same for system.cpu.time?

I think that our two alternate implementations for process.cpu.time demonstrate a case that could be generalized to all usage metrics, including system.cpu.time. In short, are dimensions ever optional? My proposed design assumes that they may be. By omitting a label, one is essentially collapsing a dimension, indicating that there is no differentiation along that dimension.

If labels are always required, then I think we have two options:

  1. Expect a single label to sometimes represent the entire dimension (e.g. state: total)
  2. Expect that in cases like this, the "same" metric must be split into two metrics (e.g. process.cpu.time and process.cpu.time_by_category)

I don't feel strongly that my initial proposal is the only option, but it seemed to me the best one given the alternatives I was able to identify. I'd definitely appreciate other opinions on this.

@djaglowski
Copy link
Member

djaglowski commented Oct 26, 2021

process.memory.usage
Change process.memory.physical_usage to process.memory.usage with total = physical
Change process.memory.virtual_usage to process.memory.usage with total = virtual

Does this violate the Prometheus rule of thumb about aggregations?

As a rule of thumb, either the sum() or the avg() over all dimensions of a given metric should be meaningful (though not necessarily useful). If it is not meaningful, split the data up into multiple metrics.

Neither sum() nor avg() seem to be meaningful aggregations in this case. I am also not sure the dimension name total reflects the intent.

I think you're right. I'll update the proposal to keep them seperate. This also has the benefit of preserving all metrics in hostmetricreceiver.

@tigrannajaryan
Copy link
Member Author

I think that our two alternate implementations for process.cpu.time demonstrate a case that could be generalized to all usage metrics, including system.cpu.time. In short, are dimensions ever optional? My proposed design assumes that they may be. By omitting a label, one is essentially collapsing a dimension, indicating that there is no differentiation along that dimension.

I understand the proposal and it seems reasonable, but I am not an expert in metrics, so would like to some other opinions. This needs to be posted as Otel spec proposal anyway, so we should see more feedback there. Let's go with what you suggest, but explicitly ask feedback on this issue.

@djaglowski
Copy link
Member

This needs to be posted as Otel spec proposal anyway, so we should see more feedback there. Let's go with what you suggest, but explicitly ask feedback on this issue.

Ok, will do. Thanks for advising.

tigrannajaryan pushed a commit to open-telemetry/opentelemetry-specification that referenced this issue Jan 25, 2022
ChengJinbao added a commit to ChengJinbao/opentelemetry-specification that referenced this issue Nov 16, 2022
jsuereth pushed a commit to jsuereth/otel-semconv-test that referenced this issue Apr 19, 2023
jsuereth pushed a commit to open-telemetry/semantic-conventions that referenced this issue May 11, 2023
@github-actions github-actions bot added the Stale label Oct 29, 2023
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 15, 2023
joaopgrassi pushed a commit to dynatrace-oss-contrib/semantic-conventions that referenced this issue Mar 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Good issue for contributors to OpenTelemetry Service to pick up Stale
Projects
None yet
Development

No branches or pull requests

2 participants