Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduces Profiling Data Model #237

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

petethepig
Copy link
Member

@petethepig petethepig commented Sep 7, 2023

OTel Profiling SIG has been working on this for a while and we're ready to present our Data Model. So far we've done a number of things to validate it:

We're seeking feedback and hoping to get this approved.


For (a lot) more details, see:

@petethepig petethepig requested a review from a team September 7, 2023 15:47
Copy link
Contributor

@jsuereth jsuereth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great first step in profiling -> Stacktraces + correlation of them to other telemetry types.

Added a few non-blocking comments.

text/profiles/0237-profiles-data-model.md Show resolved Hide resolved

### Semantic Conventions

TODO: describe things like profile types and units
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is an OTEP - the only section you need here is describing that you plan to leverage Semantic Conventions to provide "open enums", i.e. a set of known types + enums that can expand over time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Updated the OTEP with some words about how we're planning to leverage Semantic Conventions, and provided a few examples (for profile type and units).


## Open questions

Client implementations are out of scope for this OTEP. At the time of writing this we do have a reference implementation in Go, as well as a working backend and collector, but they are not yet ready for production use. We are also working on a reference implementation in Java. We are looking for contributors to help us with other languages.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like you have enough prototypes for us to evaluate for actual Specification inclusion.

I'd recommend as soon as we start see approvals landing on this OTEP to open the DataModel specification markdown with just the datamodel components.


There are two main ways relationships between messages are represented:
* by embedding a message into another message (standard protobuf way)
* by referencing a message by index (similar to how it's done in pprof)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually suspect that table-lookup is a technique we MAY want to apply to other signals for significant reduction in size. However, given the state of OTLP-Arrow, it could be a moot point.

IF we see successful implementations of profiling "processors" in the otel collector, I'd suggest we think about adding this capability to other Signals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to be careful with this. In my experiments I have seen in some cases for compressed payloads string lookup tables result in a regression (bigger compressed payloads).

Copy link
Member

@tylerbenson tylerbenson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great otep! Very well reasoned.

@lizthegrey
Copy link
Member

Promising work so far. I am curious what input the go pprof team has on this design for the data format / whether they'd consider native export in this format - if not, why not?

text/profiles/0237-profiles-data-model.md Outdated Show resolved Hide resolved
- function_index: 1
- line:
- function_index: 2
profile_types:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand this is a simplified example focused on showing how data is linked, however, adding at least type_index and unit_index (which I think aren't optional) would make it clearer to me. Without those, it's not clear what is measured.

Copy link
Member Author

@petethepig petethepig Sep 8, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Updated the OTEP

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for putting this together @petethepig

I did a quick pass, but would like to take a more thorough look one more time, so let's keep this open for a few days (also to make sure we have more eyes on this).

text/profiles/0237-profiles-data-model.md Outdated Show resolved Hide resolved
repeated Link links = 11;

// A lookup table of AttributeSets. Other messages refer to AttributeSets in this table by index. The first message must be an empty AttributeSet — this represents a null AttributeSet.
repeated AttributeSet attribute_sets = 12;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit confusing that we have both attributes and attribute_set. Maybe use a different name or explain explicitly the difference?

text/profiles/0237-profiles-data-model.md Show resolved Hide resolved
// List of indices referring to AttributeSets in the Profile's attribute set table. Each attribute set corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length. [Optional]
repeated uint32 attribute_set_indices = 12;

// List of values. Each value corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be useful to clarify what a value is.

text/profiles/0237-profiles-data-model.md Outdated Show resolved Hide resolved
}

// Represents a single profile type. It implicitly creates a connection between Stacktraces, Links, AttributeSets, values and timestamps. The connection is based on the order of the elements in the corresponding lists. This implicit connection creates an ephemeral structure called Sample. The length of reference lists must be the same. It is acceptable however for timestamps, links and attribute set lists to be empty. It is not acceptable for stacktrace or values lists to be empty.
message ProfileType {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This essentially describes a list of samples, right? Would SampleList be a better name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, ProfileType is somewhat confusing. I like SampleList. I'll update the proto and OTEP.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Another possible name is just Samples.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ended up going with SampleList. I like Samples more, but Samples is plural, less specific, and more ambiguous than SampleList. So I'm worried calling it Samples is more confusing.

// The object this entry is loaded from. This can be a filename on
// disk for the main binary and shared libraries, or virtual
// abstractions like "[vdso]". Index into string table
uint32 filename_index = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to show a benchmark that shows how much these and other string lookup tables save. If the savings are small it may be useful to just use the string values directly. We typically compress the OTLP payloads anyway which will very likely result in negligible savings. In my experiments in some cases I have seen GZIP/ZSTD compressed payloads increase because of using lookup tables since numeric varints compress worse than duplicate strings.

Comment on lines 329 to 330
SYMBOL_FIDELITY_UNSPECIFIED = 0;
SYMBOL_FIDELITY_FULL = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments to explain these.

Comment on lines +361 to +367
uint32 mapping_index = 1;
// The instruction address for this location, if available. It
// should be within [Mapping.memory_start...Mapping.memory_limit]
// for the corresponding mapping. A non-leaf address may be in the
// middle of a call instruction. It is up to display tools to find
// the beginning of the instruction if necessary.
uint64 address = 2;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least in Go this ordering typically results in unused bytes in memory due to struct field alignment (on a 64bit platform we will have 4 unused bytes here). It helps to sort fields in the decreasing field size order.

text/profiles/0237-profiles-data-model.md Show resolved Hide resolved

### Benchmarking

[Benchmarking results](https://docs.google.com/spreadsheets/d/1Q-6MlegV8xLYdz5WD5iPxQU2tsfodX1-CDV1WeGzyQ0/edit#gid=0) showed that `arrays` representation is the most efficient in terms of CPU utilization, memory consumption and size of the resulting protobuf payload. Some notable benchmark results are showcased below:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what the benchmark actually did and what CPU utilization and memory consumption are referring to? Did you measure serialization (pprof to protobuf), or where there also some deserialization benchmarks done?

petethepig and others added 3 commits September 8, 2023 13:27
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
Co-authored-by: Tigran Najaryan <4194920+tigrannajaryan@users.noreply.github.com>
// Source file containing the function. Index into string table
uint32 filename_index = 3;
// Line number in source file.
uint32 start_line = 4;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful to clarify whether this points to function definition, call site, or is something that profilers are free to choose from?

// The id of the corresponding profile.Function for this line.
uint32 function_index = 1;
// Line number in source code.
uint32 line = 2;
Copy link

@morrisonlevi morrisonlevi Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand shrinking many numbers from 64 bit to 32 bit, but I wanted to double-check this case. I've definitely seen auto-generated code get quite large, for example, but even 32-bit is large for "lines in a file" so it may be okay. Was there any discussion or analysis around this specifically? Maybe some compilers publish limits we can look at?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you concerned about 32 bits not being enough or being too large?

Copy link

@morrisonlevi morrisonlevi Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My concern is about potentially not being big enough for all languages. I've heard of programs in the millions of lines of code (always autogenerated code), and so I was curious if there was any discussion or analysis about this. Still, "millions" is still not 4 billion, so to be clear I'm not saying 32-bit line is a problem, just that we should be cautious about that. It's a nice win if it's defendable, and I think likely is.

Comment on lines +392 to +399
uint32 name_index = 1;
// Name of the function, as identified by the system.
// For instance, it can be a C++ mangled name. Index into string table
uint32 system_name_index = 2;
// Source file containing the function. Index into string table
uint32 filename_index = 3;
// Line number in source file.
uint32 start_line = 4;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these fields mandatory? Can I omit for example system_name_index? Are 0 values valid indices? Should 0 values be invalid so that the fields can be optional? This is important if using a ProtoBuf implementation that does not support the notion of field "presence" and it is impossible to distinguish between the 0-value of the field and the absence of the field.

Or do you use the fact that the string at 0 index of Profile.string_table is an empty string as a way to deal with this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zero-th element in string table is always empty string ("") to be able to encode nil strings.

repeated int64 values = 13;

// List of timestamps. Each timestamp corresponds to a Stacktrace in stacktrace_indices list. Length must match stacktrace_indices length.
repeated uint64 timestamps = 14;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these nanoseconds since UNIX epoch? If so a fixed64 may be a better choice since I think for the typical current values the varint encoding is larger than fixed 8 bytes and is more expensive to encode/decode.

Comment on lines +194 to +200
fixed64 start_time_unix_nano = 2;

// end_time_unix_nano is the end time of the profile.
// Value is UNIX Epoch time in nanoseconds since 00:00:00 UTC on 1 January 1970.
//
// This field is semantically required and it is expected that end_time >= start_time.
fixed64 end_time_unix_nano = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the purpose of these times? Are sample timestamps expected to fall in this range?

@thomasdullien
Copy link

thomasdullien commented Sep 13, 2023

Hey all,

First off, apologies for the long silence -- I was largely out-of-commission for most of this year due to family/health related issues, and was not involved much, so sorry if I am missing some details.

I'd like to chime in on the 'why not use pprof' discussion. In general, I think that the idea of an industry-wide profiling format (that is both good on the wire and good on disk) is a nice one.

That said, the first thing to be clear is about "what that means (for Go)": Profiling serves many different runtimes (often in the same process), and Go is one. There are also use-cases that profiling supports that Go's pprof will not support without additional information (calculating inferred spans from high-frequency profiling stack traces being one example).

So the obvious organisational question that crops up: How would the co-evolution of a format that serves many customers work?

On the technical side, there are a number of things that are important to have in a format that pprof does not currently provide (FWIW, the OTel draft does not provide all of them either yet, as we're still in discussions around it):

(Note before I start: I am not an expert in Pprof, so @rsc and @felixge and everybody else please correct me on any points in which I misunderstand the format)

  1. Support for efficient high-fidelity timestamps. This is important for a variety of latency-relevant use cases (inferring spans being one of many). Those should be a first-order citizen (e.g. placing them in "Label" seems like a terrible idea), but optional.
  2. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)
  3. The BuildID in the Mapping proto needs to be changed to be a sort-of-hash of parts of the binary. There are too many examples out there that abuse the ELF BuildID in a way that kills any (even probabilistic) notions of uniqueness (or presence, see Alpine). My ideal perspective is we standardize the way to calculate it from a given ELF mapping (my suggestion is to hash the ELF header plus a few deterministically-randomly selected pages).
  4. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

Lastly, there is the "wire format vs file format" question: The main difference will be "streaming vs bulk data", and it should be kept in mind that some folks want to stream profiling data continuously from machines, perhaps aggregated for 2-3 seconds on each machine. Is the proposal to just send a pprof file blob? Wouldn't the overhead of resending all the mappings every 2-3 seconds be a bad idea? Are there other side effects from file-format-vs-wire-format that we haven't thought about?

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

My (possibly unfounded) worry is that pprof makes a very Google/Go-specific assumption: Stack traces that do not span very large numbers of executables. This is true if you assume a toolchain like Google/Go's that strongly favors a few enormous binaries, it is not true in other scenarios (the most brutal case would be a system where someone uses fine-grained ASLR, but I'll admit that's a very extreme scenario).

There's also an argument to be made that details about the memory mappings are actually pretty uninteresting (aside from "what offset of the original text section was the sample collected"), so they could be optional?

(Also apologies if not all of my worries are addressed by the current Otel draft -- me & my team argued vehemently for the stateful approach, and the shift to non-stateful happened while I was out)

@felixge
Copy link
Member

felixge commented Sep 14, 2023

@thomasdullien thanks for your comments. Here are some thoughts from my end:

  1. Support for efficient high-fidelity timestamps. This is important for a variety of latency-relevant use cases (inferring spans being one of many). Those should be a first-order citizen (e.g. placing them in "Label" seems like a terrible idea), but optional.

Using labels for timestamps is definitely sub-optimal, and it's definitely a weakness of pprof today. But after gzip compression it's not as terrible as one might expect. See Efficiency section in this PR.

Anyway, based on the discussion above, it doesn't seem out of question to evolve pprof in this direction.

  1. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)

👍 This would probably be a very easy change for pprof. I'd guess putting this info on the Mapping would make sense.

  1. The BuildID in the Mapping proto needs to be changed to be a sort-of-hash of parts of the binary. There are too many examples out there that abuse the ELF BuildID in a way that kills any (even probabilistic) notions of uniqueness (or presence, see Alpine). My ideal perspective is we standardize the way to calculate it from a given ELF mapping (my suggestion is to hash the ELF header plus a few deterministically-randomly selected pages).

👍

  1. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

👍 Yes. But Gzip seems to do a fairly good job "fixing" this problem as well.

Lastly, there is the "wire format vs file format" question: The main difference will be "streaming vs bulk data", and it should be kept in mind that some folks want to stream profiling data continuously from machines, perhaps aggregated for 2-3 seconds on each machine. Is the proposal to just send a pprof file blob? Wouldn't the overhead of resending all the mappings every 2-3 seconds be a bad idea? Are there other side effects from file-format-vs-wire-format that we haven't thought about?

I think this proposal implicitly assumes that clients will typically send profiling data every ~60s, but doesn't mention it anywhere. @petethepig maybe this could be added to the OTEP?

Generally speaking, the smaller the profiling interval, the better the efficiency. Using a profiling period of 2-3s would be very inefficient and arguably not practical under the current proposal.

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

That's interesting. I haven't encountered such a situation before. If you think it's a common problem, could you share a bit more about this?

There's also an argument to be made that details about the memory mappings are actually pretty uninteresting (aside from "what offset of the original text section was the sample collected"), so they could be optional?

In my experience pprof tools only use the mapping for displaying profiling info on an instruction level. But that's an optional feature as it only works if the binary is available. So I think clients can certainly omit Mapping details if they want.

petethepig added a commit to petethepig/opentelemetry-proto that referenced this pull request Sep 16, 2023
petethepig added a commit to petethepig/opentelemetry-proto that referenced this pull request Sep 16, 2023
@tedsuo tedsuo added the triaged label Sep 18, 2023
uint32 dropped_attributes_count = 2;
}

// A stacktrace is a sequence of locations. Order of locations goes from callers to callees. Many stacktraces will point to the same locations. The link between stacktraces, attribute sets, links, values and timestamps is implicit and is based on the order of the elements in the corresponding tables in ProfileType message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC pprof Sample.location_id (and Java StackTraceElement[]) both order from leaf to entry point i.e. callee to caller. We should probably check for consistency here.

uint32 unit_index = 4;

// List of indices referring to Stacktraces in the Profile's stacktrace table.
repeated uint32 stacktrace_indices = 10;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of many cases where _index values are uint32. This is potentially painful in languages like Java that may deserialize to arrays with smaller index range (because signed ints), making some data inaccessible. Probably a non-issue, as the dominant use case is wire messages that will cap at 4MB total size anyhow, so odds of it ever happening are small. OTOH if the format is used for files too and you build one by aggregating many smaller files (common in pprof) then you potentially wind up with data that can be handled in some languages but not others.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We work with stack trace datasets that may exceed a 32-bit index and require a 64-bit index, especially after aggregation (trillions vs billions). What's the story for huge datasets like this? Would it not be in this format directly?

@tigrannajaryan
Copy link
Member

It still seems like there would be strong ecosystem reasons to arrange to use a single format. Then runtimes like Go wouldn't have to choose between Pprof and OpenTelemetry. They could just generate the one standard format. That's a win for everyone.

@rsc I think this is a great suggestion and can be definitely a win for everyone. We would need to find the right way to make this happen while staying true to OpenTelemetry's vendor-neutrality mission. This is likely in the purview of Otel's Technical Committee and Governance Committee. It would be great to meet and talk if you are interested.

OpenTelemetry Technical Committee discussed this topic today. Here are the decisions made:

  • Recommend to Profiling WG to perform technical analysis to understand if pprof format can be extended in a way that satisfies OpenTelemetry requirements (e.g. record trace id / span id).
  • Recommend to Profiling WG to add more details to the profiling data format proposal OTEP 237 to make clearer what other existing data formats were considered and analyzed.
  • Technical Committee and Governance Committee will consider if a pprof-based profiling format in OpenTelemetry satisfies the vendor-neutrality stance of OpenTelemetry. Attendees present in the Technical Committee meeting today so far have no objection to move forward with exploring this idea.

@thomasdullien
Copy link

Quick warning: this post is written in semi-brainfried state. May need to correct myself later. I'll also follow up with a separate post that outlines some concerns I have about the general design we're at now.

Using labels for timestamps is definitely sub-optimal, and it's definitely a weakness of pprof today. But after gzip compression it's not as terrible as one might expect. See Efficiency section in this PR.

Oh, shiny, that's a super valuable PR and discussion. Thanks!!
My aversion to labels-for-timestamps is one of type safety; I think "labels for timestamps" is a hack at best. I'm also sceptical of the "let gzip sort it out" approach, more on this in a separate reply later. There's a political consideration, even: I consider timestamps to be a sufficiently important feature that if it shows that we can't get that into pprof, we probably don't want to hitch our wagon to that train :-)

Anyhow; I'll read the thread you posted in more detail, and mull this over a bit. More in a future reply.

Anyway, based on the discussion above, it doesn't seem out of question to evolve pprof in this direction.

True, and I think that this will be key.

  1. Support for "location type"/"frame type": If you profile any mixed workloads (e.g. Python calling into native libraries, NodeJS calling into native libraries etc.) you will by necessity require the ability to distinguish between the different runtimes from which you've obtained stack frames. If you look at the kernel commits Meta has been doing, it's pretty clear that they have mixed stack traces for their PHP/HHVM workloads in their tooling (btw, do we have someone from Meta in the discussion?)

👍 This would probably be a very easy change for pprof. I'd guess putting this info on the Mapping would make sense.

I don't know - Mapping pretty clearly references memory areas, and that isn't necessarily applicable for the mixed frame mechanism. Think about bytecode interpreters (for example the Baseline compiler for a JS engine such as V8) that reside somewhere on the heap; using Mapping here to indicate that seems incorrect. It'd seem cleaner to add "type" as a field to the "Location" message?

  1. Having a way of de-duplicating identical stack unwinds would be ideal, as @morrisonlevi mentioned. Particularly when doing high-frequency sampling, it is pretty common to have many samples that end up identical (even more so if you'd ignore the leaf frame of course).

👍 Yes. But Gzip seems to do a fairly good job "fixing" this problem as well.

Fair, but this approach honestly gives me the creeps. I'll discuss this in the follow-on post.

I think this proposal implicitly assumes that clients will typically send profiling data every ~60s, but doesn't mention it anywhere. @petethepig maybe this could be added to the OTEP?

We should definitely document that assumption. My reasoning for strongly preferring a "stream many messages continuously" over "accumulate large memory buffers to compress" is partially driven by trying to push work from the production machines on which the profiler runs to the backend, and by avoiding having to keep large buffers of events...

We've had customers that were pretty continuously mapping / unmapping different ELF files, with thousands of mappings in an address space and ~4GB+ of text, and stack traces spanning many files.

That's interesting. I haven't encountered such a situation before. If you think it's a common problem, could you share a bit more about this?

So this is a FaaS vendor that allows their customers to spin up specialized FaaS (essentially tiny ELFs, I presume) on their servers on the Edge. So if I understand correctly, when the FaaS is invoked, they create the mapping, link it to other areas in the address space, run the FaaS, and (I don't know based on what criteria) unmap again. The net result was thousands of mappings, and lots and lots of executable sections.

So this isn't a common setup, but I don't think we should design a protocol that unnecessarily restricts the design space of people writing their code.

In my experience pprof tools only use the mapping for displaying profiling info on an instruction level. But that's an optional feature as it only works if the binary is available. So I think clients can certainly omit Mapping details if they want.

Well, in the current pprof design, you need the mapping information if you want to do any form of ex-post symbolization, too, as the pprof design does not place an executable identifier into the Location message. So this would force on-prod-machine symbolization, which is another anti-pattern we shouldn't force users of the protocol into.

@thomasdullien
Copy link

Hey all,

ok, so the following post will be a bit longer, and I hope I don't upset people, but I think I have a number of concerns with the design that we've so far converged onto. The fact that we have converged on something that is sufficiently similar to pprof to debate whether pprof can't do the job reminded me about all the implicit assumptions we've made, and all the things that this will imply for implementations.

So I'll outline my concerns here; forgive me if I am re-hashing things that were discussed previously, and also if this isn't the ideal place to document the concerns.

This will be both lengthy and highly subjective; apologies for that.

Philosophy section

Design philosophy for profiling tools

Profiling tools (if timestamps are included) allow the synthetic generation of fine-grained spans (see @felixge's great description in UC1 here. The upshot of this is that the ability to perform pretty high-frequency sampling on demand has high value, 10000Hz is better than 100Hz in such a scenario. While you won't always need this, you don't want to carelessly limit yourself unnecessarily.

So this means that profiling tools should strive to do their jobs with the minimum number of instructions, data movement, etc. - the more lightweight the data collection is, the more profiling can be afforded given a certain budget, and the increased value does not taper off quickly. This is very different from e.g. metrics; you do not gain much benefit from collecting those at a higher frequency. If you make the collection of a stack trace 3x as expensive, you limit your max frequency to 1/3rd of what it could be, depriving users of value.

Design philosophy for a profiling wire format

In my view, a profiling wire format should be designed as to enable the profiling code on the machines-to-be-profiled to be lightweight in the sense of "working in a slim CPU and memory budget". Whenever sensible, work should be pushed to the backend.

The protocol should also be designed in a way that it can be easily implemented by both userspace runtime-specific profilers, and by eBPF-based kernelspace ones, without forcing one of the two into compromising their advantages.

Concrete section

What's good about using eBPF vs. the old perf way of copying large chunks of the stack?

The big advantage of eBPF is that it allows cheap in-kernel aggregation of measurements (including profiling data). The old perf way of doing stack collections when no frame pointers were present was just dumping a large chunk of stack to disk, and then trying to recover the stack later on (this is a bad idea for obvious reasons).

If your profiling can be done by the OS's timer tick just handing control to some code in kernel space that unwinds the stack in kernel space and aggregates data there without having to round-trip to userspace or copy a lot of data into userspace, you have a pretty lean operation. You walk the stack, you hash the relevant bits, you increment a kernel-level counter or alternatively concatenate a timestamp with the hash into a data structure that userspace can later read (ring buffer, eBPF map, whatnot).

Such a construct is always preferable to having to keep the entire stacktrace around each time it happens, and having to copy it into userspace in it's entirety to be processed there. Let's say we have 40 layers of stack; the difference between copying a 16-byte identifier or 40 stack frames isn't trivial.

Implications of stateless vs. stateful protocol

Early on in the discussion, we proposed a stateful protocol (e.g. a protocol where the individual machine is allowed to send just the ID of a stack trace vs. the entire stack trace, provided it has sent the entire stack trace previously). The reasons for this design in our infrastructure were:

  1. Reduction of network traffic (important particularly when streaming profiling data out of a cloud deployment, but also for high-frequency sampling of Java processes that feature very deep stacks).
  2. Making it "on average unnecessary" for the profiler to copy around a lot of data; if the profiler can determine that it has in recent memory sent the stack trace with this hash, it can immediately stop copying the entire trace around and just handle the ID.

This means that a kernel-based / eBPF-based profiler can simply increment a counter, and be done with it; or send out an identifier and a timestamp. Both are very cheap, but also imply that you don't have to copy lots of data from kernel space to user space (where you also have to manage it, and then potentially gzip it etc.).

If you move to a stateless protocol where you always have to send the entire stack trace, the profiler itself has to move the entire stack trace around, and also pay for the memory of the stack trace until it has been sent.

This is fine if you're sampling at 20Hz - let's say the average trace has 40 frames, and representing a frame takes 32 bytes, a trace is ~1k; that is still 20-40 times more than you'd handle if you were just handling an ID. This will have immediate repercussions for the frequency at which you'll be able to sample.

You will not only need to copy more data around, you'll also need to run some LZ over the data thereafter; you'll pay when you send the data out etc.

Implications of "file format" vs "streaming protocol"

pprof is a file format, in the sense that it is designed for a scenario where a machine collects a bunch of profiling data over a certain timeframe, and then writes the data into a pprof file, which is self-contained enough that it can be processed elsewhere.

File formats assume that the recipient is a disk, something that cannot or will not do any processing, and just stores the data for further processing in the future. Designing a file format means you expect the creator of the file to keep all the data around to write it in a coherent and self-contained manner.

A streaming protocol can assume some amount of processing and memory capability on the receiving side, and can hence move some work from the writer of the data to the recipient of the data. If your recipient is smarter, you can be more forgetful; if your recipient is smart and aggregates data from many writers, each individual writer can be less reliable.

Another aspect is that file formats tend to be not designed for a scenario where a machine tries to continuously send out the profiling events without accumulating many of them in memory.

If we look at the existing OTLP Spec, my cursory reading is that the protobufs do not penalize you much for sending data in a too-fine-grained manner (it's not free, obviously, but it seems to be less problematic than using pprof for a handful of samples).

A spec that is designed for "larger packages of data" forces the profiler to accumulate more events in memory, send them out more rarely, and perform de-duplication by LZ'ing over the data once more. This also has repercessions for the frequency at which you'll be able to sample.

Implications for kernel-based eBPF profilers

The current trajectory of the protocol design is highly disadvantageous to kernel-based eBPF profilers. Instead of being able to accumulate data in the kernel where it is cheap, it forces data to be

  1. Collected in the kernel.
  2. Unnecessarily copied into userspace.
  3. Unnecessarily being de-duplicated by running an LZ compressor (gzip or LZ4 or zstd) over it.

This may not matter if you're already in userspace, already in a separate process, and you're already copying a lot of data back and forth between different address spaces; but if you have successfully avoided wasteful copies like that, it's not great. It forces implementers into objectively worse implementations.

At almost every decision, the protocol opts for "let's push any work that the backend may normally be expected to do into the profiler onto the production machine, because we are fearful of assuming the backend is more sophisticated than a dumb disk drive", leading to more computing cycles being spent on the production machines than necessary.

In some sense, we're optimizing the entire protocol for backend simplicity, and we're doing it by pushing the cost of this onto the future user of the protocol. This feels backward to me, in my opinion, we should optimize for minimum cost for the future user of the protocol (as there will be many more users of profiling than backend implementors).

Summary

I would clearly prefer a protocol design that doesn't force profilers to copy a lot of data around unnecessarily, and that prioritizes "can be implemented efficiently on the machine-to-be-profiled" and "uses cycles sparingly as to maximize the theoretical sampling frequency that can be supported by the protocol".

I think by always optimizing for "dumb backend", we converged on pprof-with-some-modifications, as we were subjecting us to the same design constraint ("recipient is a disk, not a server"). I am convinced it's the wrong design constraint.

@felixge
Copy link
Member

felixge commented Sep 22, 2023

Thanks for your writeup @thomasdullien, that was a very interesting read.

One question I had while reading was whether or not it's possible for eBPF profilers to aggregate ~60s worth of profiling data (stack traces and their frames) in kernel memory? If yes, I don't understand why you'd have to copy this data to user space all the time? You could just deduplicate stack traces and frames in kernel and only send the unique data into user space every once in a while (~60s)?

Another thing I'm curious about is your perception on the cost of unwinding ~40 frames relative to copying ~1KiB between user space and kernel (assuming it's needed)? Unless you're unwinding with frame pointers, I'd expect the unwinding to be perhaps up to 90% of the combined costs. I'm asking because while I love the idea of building a very low overhead profiling protocol, I think we also need to keep the bigger overhead picture in mind during these discussions. Related: How much memory do eBPF profilers use for keeping unwinding tables in memory?

Edit: Unrelated to overhead, but also very important: Using a stateful protocol would make it very hard for the collector to support exporting profiling data in other formats which is at odds with the design goals of OpenTelemetry, see below.

CleanShot 2023-09-22 at 13 46 02@2x

@tigrannajaryan
Copy link
Member

@thomasdullien you may want to add one more factor to your analysis: the Collector. It is often an intermediary between the producer of the telemetry and the backend.

By design the Collector operates on self-contained messages. If a stateful protocol is used then the Collector receiver has to restore that state and turn the received data into self-contained pdata entries so that internally the processors can operate on them. When sending out in the exporter the Collector has to perform the opposite and encode the self-contained messages into the stateful protocol again to send to the next leg (typically backend, but can be another Collector). This is possible but adds to the processing cost.

We went through this with OTLP and Otel Arrow. OTLP operates with self-contained messages. Otel Arrow was proposed later and uses gRPC streaming and is stateful. I think a similar stateful, streaming protocol is possible to add in the future for profiles.

If you want to go this route I would advise you to make a proposal for a streaming, stateful protocol for profiling data in the form of an OTEP with benchmarks showing what its benefits are and what the penalty (if any) for extra processing in the Collector is.

@thomasdullien
Copy link

@thomasdullien you may want to add one more factor to your analysis: the Collector. It is often an intermediary between the producer of the telemetry and the backend.

Good point, I'll read myself into it and then revert :)


#### Message `Stacktrace`

A stacktrace is a sequence of locations. Order of locations goes from callers to callees. Many stacktraces will point to the same locations. The link between stacktraces, attribute sets, links, values and timestamps is implicit and is based on the order of the elements in the corresponding tables in ProfileType message.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's possible to link unique timestamps and attributes to a shared stacktrace? Some of the projects I work on at Microsoft have huge amounts of stacktraces, and we've found it very useful to represent a set of common stacktraces and reference them. Is this possible in the current design?

We do both stack aggregation without time (IE: Flamegraphs) and we also show timelines (IE: Gantts) with threads and other processes on the system using this approach and the reason I'm asking if both time and attributes could point to a common stacktrace (IE: Thread 123 in Process 23 had Stacktrace X at Time T).

Copy link

@beaubelgrave beaubelgrave Oct 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems you can do this with stacktrace_indices, but the text here seems to imply the ordering is implicit between the stacktraces, attributes, links, values and timestamps. Is this a typo and the stacktrace_indices, attribute_set_indices, etc. be called out here instead?

#### Message `Mapping`

Describes the mapping from a binary to its original source code. These are stored in a lookup table in a Profile. These are referenced by index from other messages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Operating systems, like Windows, allow mappings to come and go. Typically, we collect mappings with a "load/unload" event with a timestamp. It seems like in this format while writing the data the linking of the load/unload times vs the sample times to determine the right mapping for a given process must be done ahead of time.


#### `Sample` structure

Sample is an ephemeral structure. It is not explicitly represented as a protobuf message, instead it is represented by stacktraces, links, attribute sets, values and timestamps tables in `ProfileType` message. The connection is based on the order of the elements in the corresponding tables. For example, AttributeSet with index 1 corresponds to a Stacktrace located at index 1 in stacktraces table, and a Value located at index 1 in values table. Together they form a Sample.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have cases where we profile events that do not have stacks, instead they have raw data to help support what is going on, such as a context switch, exception, critical error, etc. Is there a place in this format to put a sample with 1) raw binary data and 2) a way to link to the format of this raw binary data (IE: thread_id is at offset 8 in the data)?

@thomasdullien
Copy link

Hey all,

a longer discussion between @jhalliday, @brancz, @athre0z, me, and some others had ensued on Slack, and the consensus was that it'd be useful to keep this conversation/discussion archived:

Thomas Dullien: with regards to input data whose conversions into the different formats we're testing: Do we have a standardized data set? I was thinking about generating some profiles for a 15-20 minute period on a few reasonably beefy machines running a representative elasticsearch benchmark; that'd give us data that involves both native code (kernel etc.) and java, and a little bit of python
(...)
Jonathan Halliday:
Why 15-20 minutes? The other OTel signals buffer a lot less that that on the nodes before pushing to the collector, for which message size is capped at 4MB. Likewise why a beefy machine? Is a small VM not more representative of typical cloud workloads? I'm not necessarily against either of those choices, I'm against 'standardizing' on them without rationale.
Thomas Dullien:
ah, perhaps both in that case? so 15-20 minutes is relevant because it doesn't only matter how much is buffered locally, but also to estimate the total amount of traffic reduction I think and a "beefy" machine is relevant because we'll need to make sure our protocol works for modern machines, too? like, as core counts increase, so does the amount of profiling data emitted from these machines
but yes, it'll make sense to look at data both from small and large machines and see how it trends
(I recently started profiling on an 80-core altera ARM box, and the data volumes are clearly different there)
Jonathan Halliday
Right. Whilst there undoubtedly are 80 core workloads in the cloud, the greater likelihood is that hardware being split into e.g. 20x 4core VMs, so you get more, smaller streams into the collector instead. Likewise the total traffic flow over time depends how you buffer at source - you can compress better if you send fewer, larger messages so an hour of data sent as 1 message will use less bandwidth than 60 messages with 1 minute of data. However, when troubleshooting a perf issue in prod I don't have 60 minutes to waste waiting for telemetry data, I need it RIGHT NOW and that's not even taking into account the problem of memory pressure on node, or the collector if all the nodes are sending it huge messages at once.
I also wonder how much sampling is from all cores - running JVM workloads we tend to be sampling a thread at random on each occasion, not stopping the world to sample all simultaneously, because that's painful from a latency spike perspective. With that model the problems of splitting 80 core hardware into 20 VMs are actually worse - you're then running 20 samples per interval as each VM samples separately, instead of 1 from the 80 core box, and you can't aggregate them even at the collector, because they are potentially for different workloads and users. So in some cases perhaps the largest monolith is not the biggest optimization issue to watch out for.
I'm all in favor of benchmarking a range of scenarios, but I think some of those you're modelling feel closer to edge cases and I'd prefer a protocol optimized for the more dominant mainstream cases, so yes, please try a range of values for relevant parameters and if you happen to base it on data about e.g. the distribution of core counts your customer population actually deploys so we can empirically determine what a typical use case actually is, then so much the better.
Thomas Dullien:
so the protocol needs to work for people that compute
which in my mind includes people like fastly or facebook
so the idea that you'll split an 80 core machine into 20 4-core vms isn't a reality, vertical scaling exists and is used for good reasons
computing will also further move to more cores, not fewer
and if you design a protocol that cannot work properly on a reasonably normal on-prem server, I don't think that's good design work
sampling from all cores is the norm for bpf-based profilers, you're not stopping the world to do so, the cores can generate their own timer ticks iirc
btw, this brings me to another worry: With the existing 4 megabyte message size limit, and a goal of 1 minute aggregation times, a 64-core or 80-core box will already now require the 1-minute aggregation to be split into smaller pieces. And as you split the aggregations smaller, the current protocol's overhead for re-sending the mappings will go up.
as for the issue of buffering on the server: My argument has been the entire time that we should buffer much less, not more, so I think we're in agreement that nobody wants to wait 60 minutes to get their data.
anyhow, what I'll do is: I'll wrap my head around the data we have so far in the benchmarks
then try to obtain representative data
should be able to update everybody on progress on the next meeting
perhaps also in the github case (?)
Jonathan Halliday:
Yes, vertical scaling is a thing and needs to be supported, but not necessarily optimized for. Here in CNCF land it's not the dominant thing. Cloud native workloads tend towards horizontally scaled design. Hardware is adding more cores, the workloads on it ironically using fewer per deployment as fragmentation into microservices gains more mindshare. I recall spending a lot of time making JavaEE app servers scale, especially with GC on large heaps, whereas now the effort is around making Quarkus fit in ever smaller VMs...
I'm inclined to agree with you on less buffering. Targeting something around 1-2minutes feels about right, though less may be better. I'll also note that part of the existing proposal includes allowing for the original data (e.g. JFR file) to be included in the message, so the actual space available for the other fields may be substantially less than 4MB. Realistically I think the practical answer is probably sampling less, not raising message size. There is a limit to how much utility you get from the additional samples and at some point it's outweighed by the data handling costs. I'm just not yet clear on where that point is. If enough of our users start abandoning JFR for eBPF because they want all the threads instead of a random one, I guess I'll have my answer.
Thomas Dullien:
so I think I disagree. K8s clusters tend toward bigger machines, and microservices do not imply VMs; VMs are disproportionately heavily used in the Java space because the JVM and K8s/docker played poorly together in the past.
So for anyone that will do whole-system profiling, supporting large machines is a must.
The JFR/eBPF juxtaposition is not really applicable. JFR is a JVM-world construct, without a spec outside of a java-centric reference implementation; I am not aware of anyone outside of the Java ecosystem really using it (? -- I may be wrong). People that work on Rust, Go, NodeJS, C/C++ etc. really don't.
The precise number of samples you need can be calculated pretty easily; there's good math for how many samples you need to estimate a flamechart with acceptable accuracy. Reducing the number of samples then increases the latency that users will have to tolerate until their profiling results are useful. In general, the utility curve for samples is kinda U-shaped: You want enough samples to generate flamecharts in reasonable time for your services, then there is a long trough where additional samples don't provide much value, and then utility goes up again as your sampling rate is so high that it can be used to perform latency analysis (this requires fine-grained timestamps on samples though).
All this said: Perhaps it'd have been a good idea to -- before embarking on protocol design -- flesh out the user personas we think the protocol needs to support 🙂. My view is clearly informed by the infrastructures I have seen/worked with:People running reasonable-sized K8s clusters, possibly even large K8s clusters, with heterogeneous workloads, many different runtimes, many different services, and the desire to not have to do per-service instrumentation.
Jonathan Halliday:
Hmm. Very interesting and useful discussion.
I'm not sure I agree with you on the cluster sizes, but perhaps we're just working from different data. Hardware used to build them is certainly trending larger as chip design is, but the isolation unit, which is the scope for profiling, is trending smaller. Doesn't really matter if it's VMs in the vmware/KVM sense, containers, or some lightweight isolation (firecracker, gvisor etc), or even language runtimes like the JVM, there tends to be a several of them on the physical node and profiling of each is independent because the workloads they are running are. Our user's Java workloads are increasingly K8s (actually OpenShift) and for new build apps more than a handful of cores per isolation unit is unusual. There are legacy workloads shifted to the cloud with a 'just wrap it in a VM' approach and those can be bigger, but the dominant logical deployment unit size is not big as it was a few years ago, even as hardware has grown in the meanwhile.
Whilst JFR is very much a Java thing, the design pattern of picking a thread at random to observe is decidedly not. Until eBPF came along it was pretty common for sampling profilers, because when all the kernel gives you is a SIGPROF sized hammer, then everything must be a randomly chosen nail.
The optimal number of samples you need is indeed rooted in math, but also somewhat influenced by practical economics. Plus what you're getting is not quite as straightforward to determine in all use cases, because for some cases you can e.g. stitch together samples from different instances of the same app. If you deploy three replicas, do you want to sample from all of them at the same time? With sufficient fidelity that you can reason about individual instance performance, or only just enough that you can work from the aggregate? From one randomly chosen one in some form of rotation? There is a side discussion to be had on dynamic configuration and the ability to push profiler config changes to node that probably need to wait until OpAMP matures, but stepping up the sampling on a suspect node is definitely an interesting direction to reduce costs.
Users wanting to avoid having to do per-service instrumentation is missing the point. They do want per-service data and extracting it cleanly from whole-system samples is a pain (and in multi-tenant environments often completely inappropriate). Users will already be deploying the OTel agent on a per-service basis for other telemetry signal types. That's the logical deployment model for profiling too. There is no additional problem to using it that way, beyond what they have already accepted when adopting it for other signal types, whereas switching to a whole-system model or mix and matching it with the per-service model is a pain. The aggregation point is at the collector, not the originating physical hardware, because you want to aggregate things that are logically related, not that just happen to be living on the same silicon.
Beau Belgrave:
This is indeed an interesting discussion, maybe I missed it, but are we really only interested in continuous profiling and aggregations? We (and I assume most folks in production systems) have periods of time where we want to take a single profile and mark it as "bad" or "anomalous", and due to the size of the machines running, it is not cost efficient to always be collecting this data.
I'd like to ensure something like the above is possible with the protocol, and I agree that 4MB will be very limiting for these scenarios, especially on large machines that are doing system wide profiling (That's what we do for every machine).
Joel Höner
Assuming that a “service” would typically correspond to a container, the per-service deployment model that you (Jonathan) are proposing there is quite fundamentally incompatible with any profiler built around eBPF. Any kernel instrumentation with eBPF is always system-wide (per VM) and needs root permissions on the host system. Even if that wasn’t the case: we consider the user experience where you just deploy the profiling agent across your fleet and then get profiling for your entire workload one of the major upsides why you would want to do continuous profiling in general: it gets rid of the classic “oh, surely this service is the one that needs optimization” fallacy and gives you an unbiased overview of what is burning cycles without requiring making any prior assumptions. (edited)
Thomas Dullien:
Agreed on this being an interesting and useful discussion 🙂 (I sometimes think we should perhaps have these discussions in a Github case, because Slack will not be discoverable, and when people try to understand the why of the spec, it'll be useful for them to have the context).
I think the claim that "users want per-service data" is not as clear-cut. Companies like Spotify run regular efficiency weeks where they seek out the heavy hitters in their infrastructure and then optimize, and if you look at the things GWP did at Google (and Facebooks fleet-wide profiler does): There's a tipping point where your heaviest libraries will be heavier than your heaviest application. And unless you can do system-wide profiling and have both the ability to slice per service and to aggregate in total, you won't see your heaviest libraries. The biggest impacts at Google were in optimizing common libraries. It is also exceedingly common to find "unknown unknowns" wasting your CPU time. Without joking, we have a customer that runs an enormous cloud footprint and intends to haggle with the cloud provider about clawing back money for CPU time spent by the cloud providers in-VM agents.
And to some extent this loops back to what I said earlier about "perhaps we should've defined our user personas". We didn't, and that leads to us having extremely divergent viewpoints of what the protocol needs to provide...
Fleet-wide whole-system profiling is pretty clearly something that is here to stay (and will become more important now that BOLT is upstream in LLVM etc.). Parca, Pyroscope, and us are all seeing a good bit of adoption, and if you factor in GWP and Facebook, continuous whole-system profiling is pretty important. So designing / standardizing a protocol should not prescribe a particular implementation; my worry is that we're standardizing on something because it happens to be convenient to implement for some existing tools while locking in certain design decisions for the profiler that simply don't serve a large fraction of the profiling use-case. Extracting per-service profiles from whole-system samples has also not been a pain for us (a sample is a sample) ... (edited)
Austin Parker:
this is a fascinating discussion that I would strongly suggest get recorded in a GitHub issue for posterity :)
Frederic Branczyk:
I guess unsurprising I mostly agree with Thomas since Parca is also mostly a whole system profiler. A few more datapoints, when I was running openshift telemetry (all openshift clusters send a decent amount of data about clusters and cluster health back), there were a lot of very large node clusters so the point of small nodes being the primary way is simply factually false. I’m not allowed to share number but large nodes are very common.
Second, extracting per service samples from whole system profiling is not a protocol concern, the storage/query engine should handle this. Inserting service semantics is adding storage/query engine semantics into the protocol. (edited)
Thomas Dullien: At least Google-internally, even when I was leaving in 2018 the trend was toward big nodes on which to run Borg (read: K8s, even if it's not quite the same) workloads. The reason why people that run DCs choose big nodes is per-core power efficiency; the container orchestrators then shuffle workloads around between machines. The trend toward large nodes accelerated noticeably between 2011 (when I joined Goog) and 2018 when I left. From what I know of other companies that have a section of their infrastructure on-prem, the setup and trends seem similar (FB, Crowdstrike, Walmart, Spotify etc.)
Jonathan Halliday:
Bigger nodes yes, but smaller pods. To fit with the OTel architecture, the instrumentation is running as part of the guest workload inside the pod, not as part of the platform. The profiler is deployed via the workload language sdk, not with the kubelet. The profiler will never see the entire physical hardware. At most it sees a VM/container (almost) the same size as the hardware, in the case that a single pod occupies the whole node. More usually there will be N heterogeneous profilers, one per pod, perhaps for different languages and never, ever seeing one-another's data. It's unacceptable to break the isolation even in single tenant clusters, as the other guests may be for different apps with e.g. regulatory or security requirements. Profiles are always per-pod. If they are whole system, then the profiler must be intimately connected with the hypervisor/container so it can label, at a per-thread granularity, which pods the capture relates to, so that the backend can break them out in a way that makes sense for users who want to (or should only be permitted to) inspect their own workloads. That's a very different engineering and maintenance challenge to bundling the profiler with the OTel language SDK.
Frederic Branczyk:
I don’t understand why that has to influence the wire protocol though, for what it’s worth, the last part exactly what we do in Parca Agent, we attach kubernetes metadata and labels to profiling data, it’s not easy but certainly possible and actively being done.
We have highly regulated customers doing multi tenancy with labeling and/or routing of data on top of this.
On an entirely separate note, and maybe this is just an otel thing that we have to specify a proto, but it appears that most of the otel project is gravitating towards defining things in apache arrow, but the current proposal is in proto. Why not start out with the thing that appears to be the future of otel?
Thomas Dullien:
Is there an official statement that "all of otel is always app-specific"? Because it's news to me, and then the question is why us, or parca, or pyroscope, or anyone that supports perf record is even participating?
and fwiw, yes, we label on a per-pod granularity what pod the traces belong to
(...)
btw, jon, could you explain a bit more how whole-system profiling breaks isolation? Because it's not like any service sees any data for any other service?
(perhaps I am confused by stuff like https://github.com/open-telemetry/opentelemetry-ebpf) which clearly gathers whole-system data
anyhow, stepping back for the moment: Is the argument "the protocol doesn't need to support 64 cores"? 🙂
Jonathan Halliday:
Profiling is per isolation unit (in k8s terms, per-pod) because how and what to observe is a per-deployment choice and independent of what platform that deployment is to. Observation is not something the platform provides for things deployed to it. It provides isolation (as a core feature, not a bug) and that's it. I should be able to move my deployment from bare metal to VM to public cloud and back to private cloud, all without ever breaking my obs instrumentation or having to integrate it with the platform. The platform, particularly on public cloud, should not be observing my workload, in much the same way there should not be a camera in my hotel room unless I put it there myself.
Sure it's possible to make other technical architectural (or societal/cultural) choices, but those are the ones current cloud design has come up with and changing them is beyond the scope of our effort here. In time perhaps the expectation will shift to having built-in profiling be an important feature of the platform, but that's not where we are today, or even where we're trying to get to.
What we have currently is a user expectation that they deploy the OTel agent with their app and it gives them observability data, regardless of where they deploy that bundle to. A profiling standard developed under OTel, which in turn is under the CNCF, should fit the established architectural model of those organizations. That is also why fairly early on we ruled out the model where profiling data is held at point of origin until needed, and possibly even then queried in-situ, rather than being pushed eagerly to a backend. It's a perfectly viable architecture with some attractive qualities, particularly in bandwidth saving, but it's alien to the cloud model of 'things are stateless by default' and the OTel model of 'data is periodically shunted from the agent to the collector'. Making a pod stateful just for observability data is not worth pain of the mismatch it creates. Making the collector stateful just for profiling wire protocol is likewise not worth the pain. Breaking isolation, or platform independence of deployments, is not worth the pain. If we want to ever ship something, trying not to swim against the tide may make life easier.
Thomas Dullien:
FWIW, I think there's a lot of "citation required" in what you are writing. I'll respond in more depth after lunch 🙂.
Jonathan Halliday:
yeah, I should prefix every other sentence of it with 'It is my opinion that....', but I'd need a keyboard macro to automate it or I'd get very tired of typing 🙂 Some of it is also deliberately provoking debate, because we could use a bit more of that, and likewise provoking others to prove me wrong with hard data, because that's even better. If it's demonstrated empirically that e.g. a stateful protocol would save ${x}% bandwidth, I'm still not necessarily going to agree it's worth the pain, but I'm going to be very happy indeed to know what the value of x actually is.
Thomas Dullien:
Awesome, thanks for that clarification.
I don't think that the case that "profiling is per pod" is clear-cut. I can deploy system-wide profilers like ours into my K8s cluster without problem, and I can have these K8s clusters live on bare metal, a VM, public cloud, private cloud etc. -- and the benefit is that the person running the cluster gets full observability for all workloads. There is zero breaking of obs instrumentation or special integration with the platformm and no breaking of inter-app isolation. In general, there are three "user personas" for profiling data: (1) SREs for troubleshooting (2) App developers to optimize their own app (3) platform/performance teams that manage overall efficiency of the fleet. For both persona (1) and persona (3), fleet-wide system-level profiling is more valuable than per-app profiling, and these are important user personas.
Google has done fleet-wide profiling since 2011, Facebook a little bit later, commercial products have seen adoption since 2021, and I am not sure how you intend to provide any data about what the expectation of the wider computing community with regards to profiling is. There are strong reasons for fleet-wide system-wide profiling as soon as your footprint exceeds a certain size, and large benefits both on the economonical and ecological level.
The established architectural model of CNCF involves a service mesh; if Cilium or any eBPF-based service mesh fits the CNCF model, so does a cluster-wide profiler.
When we discuss stateful vs. stateless, we really come up with a clear definition of what we even mean with that. APM traces aren't stateless in OTel if we define stateless as "the data doesn't need to be reassembled from multiple sources to be useful". Perhaps better terminology of what we're discussing is "send full inline stack traces vs. send fingerprints of stack traces"; this more accurately reflects what we're discussing.
I think in the broader sense, I'd like to decouple the question of "full inline stack traces vs. fingerprints" from the broader discussion of faults in the current protocol spec. When I pointed out that the current spec won't be able to support 1-minute aggregation of data on a beefy host, and that the current protocol spec will add much more overhead once you have to fragment messages more, I didn't necessarily mean this as a strong argument for fingerprints, but rather a "oh wow, this protocol needs more eyes, more effort, better-documented and wider benchmarks, otherwise we run the risk of standardizing a dog" (or perhaps not a dog, but a protocol with pitfalls and bad assumptions/corner cases that could've been avoided). (edited)
Joel Höner:
It’s also perhaps notable at this point that profiling traces for native applications will always inherently be stateful in the sense that you’ll typically not be able to assign any meaningful symbol information until they are ingested and processed in the observability system. The alternative requires shipping all executables with debug information, which is very inefficient in terms of disk and bandwidth use.
Thomas Dullien:
I think we really need to define what we even mean with "stateful" 🙂
because we're using that word, and I am not sure whether we all agree on it's meaning.
Jonathan Halliday:
Stateful to me means that to do processing on the data in the same way it does for other signals, the collector need to hold (in-memory, since it doesn't necessarily have disk) more than one message worth of data. Or to put it another way, it can't necessarily handle one message without knowledge of prior messages or reference to external data. With that comes a larger footprint and headaches with crash/restart and load balancing. None of which the collector currently worries about and all of which add undesirable engineering effort or deployment cost.
The bit we haven't really settled is: what processing should the collector be expected to be able to do? If it's pure pass-through, it doesn't care if the stream tunneling through it is stateful or stateless, it just passes it along. If it's adding metadata, likewise it probably doesn't need too care much. If it's trying to do really strict security filtering, it cases a lot. Personally I think the sweet spot is 'can annotate or filter based on OTel metadata such as KeyValues, per sample', which is why I'm in favor of an API that exposes and handles the Sample as an annotated entity (which is not necessarily the same as the way the proto encodes it) but doesn't expose the trace data in that sample. As such, I don't much care if the trace in that sample is symbolized or otherwise resolved by some previously sent hash lookup table or not, nor is that encoding exposed to the collector as pdata, so it can't tell and won't care if extra information was required to resolve it.
In keeping with my previous 'don't deviate from the existing architectural choices' position, it's worth noting this ignores my own advice. For other OTel signal types, the collector has near 100% visibility into the data, at least to whatever extent e.g. string pattern matching is sufficient. This gives it a lot of expressive power to manipulate the data. I'm willing to take the deviation on the basis that keeping strictly in line with the existing model would require a lot more work and potentially use a lot more resources at runtime, with no apparent use case beyond 'because that's just how it's done'. Even then, as noted, without symbolization you're limited to matching on some things that aren't really meaningfully matchable, much like trying to filter compressed strings without decompressing them.
Where this is going to break is that if you have a backend that is non-stateful, but the data flowing into the collector is stateful, then the collector needs to buffer and transform the stream before forwarding it. Likewise in the other direction, but I anticipate the dominant case is using the OTel protocol between originating node and collector, then a foreign export protocol to a backend that doesn't speak native OTel yet. For that you'll need the collector to host an exporter that can act as a pseudo-backend to reassemble the stream, which brings its internal pipeline back to being stateful overall even if its transformation stages aren't. At least that means the problem is limited to just the users wanting to do the transform, not all users of the collector.

This is the current state of the discussion. The really important thing I agree on is: We need to really define what computation the collector needs to perform on the data as it comes in. I also think we should replace the word "stateful" in our discussion with "self-contained messages", because it is more precise.

As next steps, for the meeting on Thursday, I will provide some data on the implications of both the gRPC message size limit and the desire to be self-contained, because I see a lot of problems arising from precisely this interaction.

@thomasdullien
Copy link

thomasdullien commented Nov 2, 2023

Joint work with @christos68k and @athre0z.
Hey all,

as promised, a bit more data/revisiting of the discussion around self-contained messages vs. non-self-contained messages.

The goal of this post is to document the exact trade-offs involved in a self-contained vs. a non-self-contained protocol design.

Bottom line up front (BLUF) - details below

  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).
  • We will exceed gRPC default message sizes, possibly dramatically (32x+ ? - to be confirmed) at the extremes. While the default message size can be adjusted, this has implications for the sizing and performance of the Otel collector, and we should ask what they're comfy with.

Before we get into the meat of things, I noticed a small design issue with the current proposal:

How to represent two versions of the same Java (or other HLL code) in the current proposal?

The current proposal does not have an obvious way of distinguishing between two versions of the same HLL code. For native code, the mappings contain a build ID, and according to our discussions, this build ID should be a hash of some parts of the .text section for executables, so this will distinguish different versions of the same native code. For HLL code, there is no .text section in memory, and AFAICT the only way to distinguish two versions of the same Java method in memory is by hashing the bytecode (which is what we do). Reflecting the hash of the bytecode in the pprof-style proto design isn't trivial though -- one could use the Mapping fields, but this would mean that you end up creating a separate Mapping per Java method that is hit, and Mappings have to be re-sent on each collection interval. We should discuss how to address this design issue, and where to best add extra information that allows distinguishing two different versions of the same HLL code.

The test machine and workload

I am worried about us designing a protocol that can't deal with a real-world server running a whole-system profiler, so for this test-setup, we are running tests on an Altera ARM server with 80 cores. This generates a fair bit of data, but there are 224 core machines, so this isn't even near the top-of-the-line, and we can expect a top-end server to generate 2.5x+ more data. Workload-wise, we tested running an ElasticSearch benchmark (esrally with http_logs) locally, and also a run of some multicore workloads from the Renaissance JVM benchmark. We focused on the JVM because in our experience JVM workloads generate more profiling data than most other workloads -- the stacks are deeper than in many other languages, the function names, file names, and classnames are longer, and all of the work of processing this data has to happen on the production machine.

What we did for testing

We did three things:

  1. Converting our existing profiling data into the pprof format (work in progress). Once this is done, we will also be able to break out more statistics about which fields in the protobuffer contribute how much, which will help us hopefully identify areas for improvement. I hope to have this working by the meeting tonight, if not, tomorrow.

  2. Hacking a self-contained version of our existing protocol, comparing traffic both compressed and uncompressed (thank you Christos, this has been invaluable). The protocol spec is described here: https://docs.google.com/document/d/1oh5UmV5ElQXIKG6TX0jvvkM--tRUoWrtZ7LXlJUNF1s/edit

  3. Estimating the additional gains we'd obtain for a "non-self-contained protocol which separates out the leaf frame".

Exceeding the default gRPC message size limit (possibly by 32x)

It's almost certain that the current design will exceed the gRPC message size limit. When considering uncompressed self-contained messages, our preliminary testing showed ~3 megs of uncompressed data for a 5-second interval on our 80-core machine running at approximately 60% load. This implies that a 1-minute interval may reach 60 megs, and if you factor in load and higher core counts, 128 megs may definitely be within reach. Caveat: This is our internal self-contained-protocol, not the current proposal. I am still in the process of converting our own profiling data into the proposed format to get a better estimate for the average size of a stack trace for our workload in the proposed format, and expect to have that later today or tomorrow, but before even adding the traces, the size-per-trace is ~100 bytes, so there's about 100 bytes of overhead per trace without even the trace data.

Network traffic, compressed and uncompressed, empirical results

In our testing, the difference in network traffic after compression is a factor of approximately 2.3-2.4x -- for 100 megs of profiling traffic in the non-self-contained protocol, the self-contained protocol generates 230-240 megs after compression does it's magic.

How much magic is the compressor doing here?

time="2023-11-02T15:08:08.546674827+01:00" level=warning msg="NON-SELF-CONTAINED  OUT: 32837007  270815603 8.25"
time="2023-11-02T15:08:08.546716588+01:00" level=warning msg="SELF-CONTAINED OUT: 77053361 1983014544 25.74"

The compressor does a 25x reduction here, we're performing compression on 1.9 gigs of data to end up with 77 megs of traffic. FWIW, this will also be reflected on the backend: You'll need to process vastly more data; the ratio on both the backend and the individual hosts is about 7.3x.

Further reductions from separating out leaf frames

The above non-self-contained design hashes the entire stack trace. Most of the variation of the stack traces is in the leaf frame, so the above can be further boosted by separating the leaf frame out of the trace, and then hashing "everything but the leaf frame". This reduces the number of traces to be sent by a factor of ~0.67 in this workload. This means that for ~67 megs of profiling traffic, the self-contained protocol will send about 230-240 megs, meaning the likely network traffic overhead is about 3.5x post compression.

The ratio of uncompressed data to process grows to about 10x (it's 7.3x before, divided by the 0.68, yielding about 10.7x).

Summary of the data:

  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).
  • We will exceed gRPC default message sizes, possibly dramatically (32x+ ? - to be confirmed) at the extremes. While the default message size can be adjusted, this has implications for the sizing and performance of the Otel collector.

Next step

  • Estimate the average-cost-per-trace in the proposed format once the conversion is complete to derive more accurate estimates for the expected gRPC message size
  • Do more fine-grained accounting of "where in the proposed messages is all our space spent"

@thomasdullien
Copy link

Ok, I think I converted data from our example workload to pprof. I took a few shortcuts (e.g. skipped inlined frames, which accounts for about 10% of all frames), and I end up with an average cost-per-trace/cost-per-sample of 180-200 bytes.

When sampling at 20Hz per core, we see somewhere between a 1:2 to a 1:4 reduction in traces vs. events, e.g. for 100 sampling events, you get between 25 and 50 different stack traces (often differing only near the leaves).

Note: This data does not attach any labels, metadata, etc. to any samples so it may be underestimating the real size by a bit. I'd wager that 200 bytes will be near the norm for Java workloads, with things like heterogeneity of the workload and some parts of the code adding variance. I suspect it'll be rare to end up with more than 400 bytes per trace, unless someone begins to attach stuff like transaction IDs to samples.

This means our message size will be approximately as follows:

Cores Max samples per Minute Max Msg Size (200 bytes p.s.) Max Msg Size (400 bytes p.s.) Discounted by 0.6 (load) Discounted by 0.5 (dedup)
64 76800 14.65 mbytes 29.3 mbytes 17.6 mbytes 8.7 mbytes
80 96000 18.31 mbytes 36.62 mbytes 21.97 mbytes 10.99 mbytes
128 153600 29.3 mbytes 58.6 mbytes 35.16 mbytes 17.57 mbytes
256 307200 58.6 mbytes 117.18 mbytes 70.31 mbytes 35.15 mbytes

This means - as discussed in the meeting - it will be possible (though very unlikely) for a single message to reach 100+ megs on a large machine; it will be likely that we exceed 8-10 megabytes routinely on reasonably common machine configurations.

I think a really crucial thing to understand as next step is: What is the memory overhead from converting a gRPC proto into an in-memory representation?

My big worry is that big messages of this type can easily lead to a lot of memory pressure for collectors, and my expectation of a good protocol design is that a single collector can easily service hundreds of machines-to-be-profiled (the protocol shown by Christos above easily dealt with hundreds to thousand+ nodes and tens of thousands of cores on IIRC 2-3 collector processes -- I'd have to dig up the details though).

@tigrannajaryan
Copy link
Member

  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).

@thomasdullien these numbers show the performance difference between stateless and stateful formats described in this doc, right?

@christos68k
Copy link
Member

christos68k commented Nov 29, 2023

  • Network traffic overhead for self-contained messages is estimated to be about 3.5x, after compression (and neglecting any cost of compression and decompression).
  • The uncompressed data size is about 7.3x-11x, which means each host on which we are profiling needs to compress 7.3x-11x more data (and the backend needs to do the decompression, too).

@thomasdullien these numbers show the performance difference between stateless and stateful formats described in this doc, right?

Correct (we did some small changes to the benchmarked protocols, namely removing some fields we no longer use such as SourceIDs, I updated the document to reflect these changes).

@felixge
Copy link
Member

felixge commented Nov 30, 2023

@thomasdullien thanks for the analysis in your two comments above. I spend some time reviewing this today, and here are my summarized conclusions (I hope to discuss this in the SIG meeting starting now).

  • No code, no data, not reproducible: I know it's difficult for y'all to provide this because your profiler is proprietary, but it's really difficult to verify the claims without being able to reproduce them 😢.
  • 3.5x more efficient after compression: You've acknowledged that this is using your own ad-hoc stateless format, but I looked at it, and I don't understand the design choices for it. Why didn't you just flush your client state every 60s? This should lead to a data stream that can be split in 60s chunks of self-contained data and would seem to produce a much more fair comparison that's also simpler to implement for you?
  • pprof claims: I struggled to compare this to the numbers in your first comment. E.g. how big was the pprof data after compression when running the same workload from your first comment?

To be clear: I would take a 3.5x increase in efficiency on the wire very seriously when it comes to the stateful vs stateless discussions. So I wouldn't mind being convinced of it :).

@thomasdullien
Copy link

The data used for the 2nd comment (the pprof stuff) are 1-minute intervals of profiling data in JSON format that I converted to the pprof format by hacking around in the existing benchmarking code. I have attached an example JSON file here (about 400k zstd-compressed for 1 minute of data). I have 1 hour worth of these files for the testing.

I'll have to dig to find the go code I hacked to convert it, give me a day or two - it's been a few weeks.

elasticsearch-server-profiles-minute-1.json.zst.gz
.

@felixge
Copy link
Member

felixge commented Dec 1, 2023

Thanks @thomasdullien, this is great. About the code for the pprof stuff: Don't worry too much. You clarified in the SIG meeting that the pprof experiment was mostly for discussing the gRPC message size issue, rather than comparing stateful against stateless in general, and I'm fully convinced of that argument – we need to investigate this problem further.

What I'm really looking forward to is seeing the results of terminating the stateful protocol every 60s (flushing the client caches) and comparing the data size this produces against the normal operations of the stateful protocol.

jsuereth pushed a commit that referenced this pull request Feb 23, 2024
This is second version of the Profiling Data Model OTEP. After [we've
gotten feedback from the greater OTel
community](#237) we went
back to the drawing board and came up with a new version of the data
model. The main difference between the two versions is that the new
version is more similar to the original pprof format, which makes it
easier to understand and implement. It also has better performance
characteristics. We've also incorporated a lot of the feedback we've
gotten on the first PR into this OTEP.

Some minor details about the data model are still being discussed and
will be flushed out in the future OTEPs. We intend to finalize these
details after doing experiments with early versions of working client +
collector + backend implementations and getting feedback from the
community. The goal of this OTEP is to provide a solid foundation for
these experiments.

So far we've done a number of things to validate it:
* we've written a new profiles proto described in this OTEP
* we've documented decisions made along the way in a [decision
log](https://github.com/open-telemetry/opentelemetry-proto-profile/blob/main/opentelemetry/proto/profiles/v1/decision-log.md)
* we've done benchmarking to refine the data representation (see
Benchmarking section in a [collector
PR](petethepig/opentelemetry-collector#1))

* diff between original pprof and the new proto:
[link](open-telemetry/opentelemetry-proto-profile@2cf711b...petethepig:opentelemetry-proto:pprof-experiments#diff-9cb689ea05ecfd2edffc39869eca3282a3f2f45a8e1aa21624b452fa5362d1d2)

We're seeking feedback and hoping to get this approved. 

---

For (a lot) more details, see:
* [OTel Profiling SIG Meeting
Notes](https://docs.google.com/document/d/19UqPPPlGE83N37MhS93uRlxsP1_wGxQ33Qv6CDHaEp0/edit)

---------

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
Co-authored-by: Christos Kalkanis <christos.kalkanis@elastic.co>
Co-authored-by: Felix Geisendörfer <felix@felixge.de>
Co-authored-by: Reiley Yang <reyang@microsoft.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.