diff --git a/opentelemetry/proto/profiles/v1development/eventprofile.md b/opentelemetry/proto/profiles/v1development/eventprofile.md new file mode 100644 index 00000000..6db88a02 --- /dev/null +++ b/opentelemetry/proto/profiles/v1development/eventprofile.md @@ -0,0 +1,284 @@ +# Proposal: Support Timestamped Profiling Events + +**Author(s):** @felixge in collaboration with @thegreystone, @jbachorik et al. + +**šŸš§ WORK IN PROGRESS šŸš§:** This proposal is still in the early stages. I'm mostly looking for feedback from active participants in the profiling SIG before soliciting feedback from the wider community. + +## Summary + +The OpenTelemetry profiling format is currently designed to record aggregated summaries of events. This proposal suggests to pivot the format toward recording individual events with timestamps while retaining the ability to record aggregated profiling data as well. + +This will enable the following use cases. + +1. Richer analysis of On-CPU as well as Off-CPU data. +2. Powerful visualizations such as as Thread Timelines, Flame Charts, and FlameScopes. +3. Better compatibility with JFR, Go Execution Traces and Linux perf. + +## Motivation + +Most of the current OpenTelemetry profiling format, including the concept of aggregating stack traces, is inherited from pprof. +However, given that we have [decided](https://github.com/open-telemetry/opentelemetry-proto/issues/567#issuecomment-2286565449) that strict pprof compatibility is not a goal for the OpenTelemetry, we are now free to design a more flexible and extensible format that can be used for a wider range of profiling use cases than pprof. + +One use case that recently came up is the [collection of Off-CPU profiling data for the ebpf profiler](https://github.com/open-telemetry/opentelemetry-ebpf-profiler/pull/144). The attendees of the [Sep 5](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit?tab=t.0#heading=h.lenvx4xd62c6) SIG meeting agreed that aggregated profiling data is not ideal for this use case, as it increases the difficulty to reason about the collected data and to correlate it with application behavior. This is especially true when it comes to the analysis of tail latency problems caused by outliers. So instead aggregation, it is much more useful to record this data as individual events with timestamps as well as additional context such as thread id, trace id and span id. This becomes even more powerful when On-CPU samples are also recorded with timestamps, as it allows users to identify spikes and stalls of CPU activity as well as Off-CPU examplars that explain the cause of the stalls and resulting latency (or lack of throughput). + +In addition to the manual analysis of such data, collecting timestamps also enables powerful visualizations such as [Thread Timelines](https://www.datadoghq.com/blog/continuous-profiler-timeline-view/), [Flame Charts](https://developer.chrome.com/docs/devtools/performance/reference#flame-chart), and [FlameScopes](https://www.brendangregg.com/blog/2018-11-08/flamescope-pattern-recognition.html). See [this document](https://docs.google.com/document/d/1285Rp1pSu6eOTInzPynCiJvDoG-r2cjqxWwTewaTncs/edit?tab=t.0#heading=h.jze26hdz58dy) or [this presentation](https://youtu.be/53UIPZfz-_U?si=x4uYpMtOCIJy8ihY) for more details. + +Last but not least, first-class support for timestamps allows greater compatibility with other profiling tools such as JFR, Go Execution Traces and Linux perf that also record timestamps for profiling events. + +## Design + +This proposal comes with a prelimary sketch for the new format to illustrate the feasibility of supporting timestamped as well as aggregated profiling data within the same format. However, the final design will be subject to further discussion and refinement and the changes can be merged via smaller PRs if needed. + +The goals for the design are to be a strict superset of pprof and a useful subset of JFR, Go Execution Traces and Linux perf. Efficiency is also a goal, but it is secondary to the goal of providing a flexible and extensible format that can be implemented by a wide range of profilers and analysis tools. + +TODO: Finish describing the design from a high level. + +## Examples + +TODO: Add example for Go mutex and goroutine profiles. + +### Aggregated CPU Profile + +Below is a simple example of an aggregated CPU profile that demonstrates how the new format can operate in the same way as pprof. + +```txtpb +event_types: [ + { # index 0 + name_string_index: 1 # cpu_sample + unit_string_index: 2 # ms + default_value: 10 # each Event.count represents 10ms of CPU time - this avoid redundant + } +] +strings: [ + "", # index 0 + "cpu_sample", # index 1 + "ms" # index 2 +] + +events: [ + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 1 # main;serve_request + count: 2 # samples, i.e. 20ms of CPU time + } + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 2 # main;foo + count: 1 # samples, i.e. 10ms of CPU time + }, +] + +# stacks, locations, functions, mappings, etc. omitted for brevity +``` + +### Timestamped CPU Profile + +```txtpb +event_types: [ + { # index 0 + name_string_index: 1 # cpu_sample + unit_string_index: 2 # ms + default_value: 10 # each event represents 10ms of CPU time + clock_id: 0 + } +] +clocks: [ + { # index 0 + frequency : 1_000_000_000 # 1 Ghz clock, i.e. 1 cycle = 1 ns + time_unix_nanos: 1_257_894_000_000_000_000 + } +] +attributes: [ + { # index 0 + key_index: 4 # thread.id + value: {int_value: 1} # 1 + }, + { # index 1 + key_index: 4, # thread.id + value: {int_value: 2} # 2 + } +] +strings: [ + "", # index 0 + "cpu_time", # index 1 + "milliseconds", # index 2 + "unix_epoch", # index 3 + "thread.id" # index 4 +] + +events: [ + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 1 # main;serve_request + time: 10_500_000 # 10.5ms since clock start + attributes: [0] # thread_id: 1 + } + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 2 # main;foo + time: 300_000 # 0.3ms since previous event + attributes: [1] # thread_id: 2 + }, + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 1 # main;serve_request + time: 9_200_000 # 9.2ms since previous event + attributes: [0] # thread_id: 1 + } +] +# stacks, locations, functions, mappings, etc. omitted for brevity +``` + +### Timestamped CPU Profile + Off-CPU Thread State Profile + + +```txtpb +event_types: [ + { # index 0 + name_string_index: 1 # "cpu_sample" + unit_string_index: 2 # "ms" + default_value: 10 # each event represents 10ms of CPU time + clock_id: 0 + } + { # index 1 + name_string_index: 2 # "thread_state" + clock_id: 0 + } +] +clocks: [ + { # index 0 + frequency : 1_000_000_000 # 1 Ghz clock, i.e. 1 cycle = 1 ns + time_unix_nanos: 1_257_894_000_000_000_000 + } +] +attributes: [ + { # index 0 + key_index: 4 # thread.id + value: {int_value: 1} # 1 + }, + { # index 1 + key_index: 4, # thread.id + value: {int_value: 2} # 2 + } + { # index 2 + key_index: 5, # thread.state + value: {key_index: 6} # mutex + }, # index 3 + { + key_index: 5, # thread.state + value: {key_index: 7} # sleep + } +] +strings: [ + "", # index 0 + "cpu_sample", # index 1 + "thread_state", # index 2 + "ms", # index 3 + "thread.id", # index 4 + "thread.state", # index 5 + "mutex", # index 6 + "sleep", # index 7 +] + +events: [ + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 1 # main;serve_request + time: 10_500_000 # 10.5ms since clock start + attributes: [0] # thread_id: 1 + } + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 2 # main;foo + time: 300_000 # 0.3ms since previous event + attributes: [1] # thread_id: 2 + }, + { + event_type_index: 0 # cpu_sample (can be omitted on the wire) + stack_index: 1 # main;serve_request + time: 9_200_000 # 9.2ms since previous event + attributes: [0] # thread_id: 1 + }, + { + event_type_index: 1 # thread_state + stack_index: 1 # main;serve_request;mutex_lock + time: 300_000_000 # 300ms since previous event + duration: 50_000_500 # 50ms + attributes: [0, 2] # thread_id: 1, thread.state: mutex + }, + { + event_type_index: 1 # thread_state + stack_index: 1 # main;foo;sleep + time: 5_000_000 # 5ms since previous event + duration: 1_000_000_500 # 1s + attributes: [1, 3] # thread_id: 2, thread.state: sleep + }, +] +# stack, location, function, mapping, etc. omitted for brevity +``` +### Aggregated Memory Profile + +Below is an example of an aggregated memory profile like those available in Go. + +```txtpb +event_types: [ + { # index 0 + name_string_index: 1 # allocations + unit_string_index: 3 # By (bytes) + } + { # index 1 + name_string_index: 2 # liveheap + unit_string_index: 3 # By (bytes) + } +] +strings: [ + "", # index 0 + "allocations", # index 1 + "liveheap", # index 2 + "By" # index 3 +] + +events: [ + { + event_type_index: 0 # allocations (can be omitted on the wire) + stack_index: 1 # main;new_user + count: 100 # objects allocated + sum: 4_800 # bytes allocated + } + { + event_type_index: 1 # liveheap + stack_index: 1 # main;new_user + count: 12 # objects alive + sum: 576 # bytes alive + } + { + event_type_index: 0 # allocations (can be omitted on the wire) + stack_index: 2 # main;new_post + count: 300 # objects allocated + sum: 76_800 # bytes allocated + } + { + event_type_index: 1 # liveheap + stack_index: 2 # main;new_post + count: 73 # objects alive + sum: 18688 # bytes alive + } +] + +# stacks, locations, functions, mappings, etc. omitted for brevity +``` + +## Semantic Conventions + +The following semantic conventions attributes should be added as part of this proposal: + +* `thread.state`: The state of the thread, e.g. "running", "unscheduled", "sleeping", "mutex". +* `go.goroutine.id`: The id of the goroutine that the event was recorded on. +* `go.goroutine.state`: Like thread.state, but with goroutine specific wait states, e.g. "chansend". +* `go.proc.id`: The id of the go processor that the event was recorded on. + +The following semantic conventions already exist and should be used where applicable: + +* [`system.cpu.logical_number`](https://opentelemetry.io/docs/specs/semconv/system/system-metrics/) can be used to record which CPU an event was recorded on. +* [`thread.id`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/thread/) can be used to record the id of the thread that the event was recorded on. +* [`thread.name`](https://opentelemetry.io/docs/specs/semconv/attributes-registry/thread/) can be used to record the name of the thread that the event was recorded on. \ No newline at end of file diff --git a/opentelemetry/proto/profiles/v1development/eventprofile.proto b/opentelemetry/proto/profiles/v1development/eventprofile.proto new file mode 100644 index 00000000..bff45ccb --- /dev/null +++ b/opentelemetry/proto/profiles/v1development/eventprofile.proto @@ -0,0 +1,230 @@ +// Copyright 2024, OpenTelemetry Authors +// +// Licensed under the Apache License, Version 2.0 (the "License"); +// you may not use this file except in compliance with the License. +// You may obtain a copy of the License at +// +// http://www.apache.org/licenses/LICENSE-2.0 +// +// Unless required by applicable law or agreed to in writing, software +// distributed under the License is distributed on an "AS IS" BASIS, +// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +// See the License for the specific language governing permissions and +// limitations under the License. + +syntax = "proto3"; + +package opentelemetry.proto.profiles.v1development; + +import "opentelemetry/proto/common/v1/common.proto"; + +// Profile holds a collection of events that represent a profile. +message Profile { + // Data + repeated Event events = 1; + + // Dictionary Tables + repeated EventType event_types = 3; // first element is the default event type + repeated Stack stacks = 4; // first element should be an empty stack (no frames) + repeated Location locations = 5; + repeated Function functions = 6; + repeated Mapping mappings = 7; + repeated KeyValue attributes = 8; + repeated Link links = 10; + repeated string strings = 11; // first element SHOULD be "" + repeated Clock clocks = 12; + reserved 13 to 15; // keep some 1-byte encodable field numbers for future use + + // Meta Data + fixed64 time_unix_nano = 16; // The end time of the profile in nanoseconds since the Unix epoch. + int64 duration_nanos = 17; // The duration of the profile in nanoseconds. + repeated uint32 comment_string_index = 18; + // TODO: missing period +} + +// Event holds a single event in a profile. It comes in two flavors: +// +// 1. Aggregate Event: The count is greater than 0. +// - time and duration are optional. Their default value is +// represented by by Profile.time_unix_nano and Profile.duration_nanos. +// +// 1. Individual Event: The count value is 0 (omitted on the wire). +// - The time field must be used and the EventType must have a Clock associated +// with it. +// +// TODO: The above is probably a bit too simplistic. There are 32 combinations +// for the fields time, duration, value, count and sum when considering that +// they can be unset (0) or holding a value (>1). We need to be clear which +// combinations have use cases and which should be considered invalid. +message Event { + // The index of EventType associated with this event. The most frequent event + // type should use index 0 so it can be omitted on the wire. + uint32 event_type_index = 1; + + // The index of the Stack associated with this event. + uint32 stack_index = 2; + + // The index of the Link associated with this event if available. + optional uint32 link_index = 3; + + // The indices of the attributes associated with this event. + repeated uint32 attribute_indices = 4; + + // time represents the timestamp of the event. For events without a duration, + // this is the end time of the event. Time values for the same Clock must be + // delta encoded. This means that the time value of the first Event on the + // wire is the number of clock cycles since the start of the associated Clock. + // For subsequent Event messages for the same Clock, the time value is the + // number of clock cycles since the previous message on the wire. + int64 time = 5; + + // The duration of the event. This is typically used for Off-CPU wait events. + // CPU samples are point in time events that have a value representing their + // magnitude, but a duration of 0. the time period during which the CPU + // time acrued is explicitly known. The unit of the duration is defined by the + // Clock associated with the EventType. When no Clock is associated with the + // the EventType, this should be 0 (omitted). + uint64 duration = 6; + + // The value of the event in the unit specified in the EventType. If there is + // a Clock for the EventType, the value defaults to EventType.default_value + // when it is 0 (omitted). If there is no Clock for the EventType, the value + // defaults to Event.count * EventType.default_value. The latter case allows + // for the efficient aggregation of CPU samples with equal magnitudes. + uint64 value = 7; + + // The number of events that were aggregated into this event. A count value of + // 0 should be used for individual events to reduce message size, but 1 is + // also allowed and considered to be equivalent. A count larger than 1 + // indicates that this event is an aggregate of multiple events. + uint64 count = 8; + + // The the sum the values of the aggregated events. If an event has a count + // and a EventType.default_value, the sum field should be 0, and clients + // should derive it by multiplying the count with the default value. + uint64 sum = 9; + + // TODO: Add more statistics for aggregated events (e.g. min, max, + // stddev, etc) or even histogram sketches? +} + +// Clock describes the clock format used by the EventType. Clocks are expected +// to be monotonic and include the time when a system is suspended. +message Clock { + // The number of clock cycles per second (Hz). To represent a wall clock with + // nanosecond resolution, this would be 1_000_000_000. For a TSC clock, the + // value will depend on the TSC frequency of the CPU. The value of this field + // impacts the interpretation of Event.time and Event.duration. + uint64 frequency = 1; + // The start time of the clock in nanoseconds since the Unix epoch. + // Implementations SHOULD avoid chosing a start time that falls within a leap + // second as the corresponding unix timestamps are ambiguous relative to UTC. + fixed64 time_unix_nano = 2; + // The indices of the attributes associated with this clock. + repeated uint32 attribute_indices = 3; +} + +message EventType { + // The name of the event type. Semantic conventions define the names of common + // profiling events as well as how they map to existing profiling formats such + // as JFR and Go Execution Traces. + uint32 name_string_index = 1; + // The unit of the Event.value and Event.sum field. Follows the format + // described by http://unitsofmeasure.org/ucum.html and MUST be specified + // using the UCUM case sensitive (ā€œc/sā€) variant. The units "By" for bytes and + // "s" for seconds as well as the prefixes "n" (nano), "u" (micro), "m" + // (milli), "k" (kilo), "M" (mega) and "G" (giga) SHOULD be supported by all + // implementations. + uint32 unit_string_index = 2; + // The index of the Clock associated with this event type. The absence of this + // field indiciates that events associated with this event type are expected + // to be aggregates containing a Event.count for the number of events that + // were aggregated into this event and an Event.value representing the sum of + // the values of the aggregated events. + optional uint32 clock_index = 3; + // The default value for the Event.value field. This is used when the value + // field is the same for all events of this type. For example CPU samples may + // typically represent the same value of 10ms. + uint64 default_value = 4; + // A list of attributes that further describe the event type. These attributes + // are described using semantic conventions. + repeated uint32 attribute_incidices = 7; +} + +message Stack { + // A Stack is a list of Location indices. The first element is the leaf frame. + // An empty stack indiciates that an Event did not contain a stack trace. + repeated uint32 location_index = 1; +} + +message Location { + uint32 mapping_index = 1; + uint64 address = 2; + repeated Line line = 3; + bool is_folded = 4; + int32 type_string_index = 5; + repeated uint32 attribute_index = 6; +} + +message Link { + bytes trace_id = 1; + bytes span_id = 2; +} + +message Mapping { + uint64 memory_start = 1; + uint64 memory_limit = 2; + uint64 file_offset = 3; + uint32 filename_string_index = 4; + uint32 build_id_string_index = 5; + BuildIdKind build_id_kind = 6; + repeated uint64 attributes = 7; + bool has_functions = 8; + bool has_filenames = 9; + bool has_line_numbers = 10; + bool has_inline_frames = 11; +} + +message Line { + uint32 function_index = 1; + int64 line = 2; + int64 column = 3; +} + +message Function { + uint64 id = 1; + uint32 name_string_uindex = 2; + uint32 system_name_string_index = 3; + uint32 filename_string_index = 4; + int64 start_line = 5; +} + +// KeyValue is semantically equivalent to opentelemetry.proto.common.v1.KeyValue +// but is defined here in order to allow referencing the Profiles.strings table +// to reduce the size of the serialized data, especially repetitive key strings. +message KeyValue { + uint32 key_string_index = 1; + AnyValue value = 2; +} + +// AnyValue is semantically equivalent to opentelemetry.proto.common.v1.ArrayValue, +// see KeyValue message for more details. +message AnyValue { + oneof value { + uint32 string_index = 1; + bool bool_value = 2; + int64 int_value = 3; + double double_value = 4; + ArrayValue array_value = 5; + KeyValueList kvlist_value = 6; + bytes bytes_value = 7; + } +} + +// ArrayValue is semantically equivalent to opentelemetry.proto.common.v1.ArrayValue, +// see KeyValue message for more details. +message ArrayValue { + repeated AnyValue values = 1; +} + +// TODO: Can we use something like https://github.com/bufbuild/protovalidate?