Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add OpenTelemetry performance benchmark spec #748

Merged
merged 19 commits into from
Nov 10, 2020

Conversation

ThomsonTan
Copy link
Contributor

Documented the performance benchmark requirement for language libraries. The most important metrics like throughput/CPU/memory are included.

@ThomsonTan ThomsonTan requested a review from a team July 28, 2020 14:54

### Number of outstanding events in local cache

If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to be a hard-coded number? I would expect this to be a configuration option with some reasonable default.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not seem like a performance benchmark, but a requirement of the sdk?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tigrannajaryan I'll remove the hard-coded numbers which could be hard to apply to languages and sustain over time.

@dyladan this spec is supposed to recommend SDKs to implement some common performance benchmark (like the basic metrics listed here) which the users could easily get (by running the benchmark program locally).


### CPU time

Under extreme workload, the library should not take more than 5% of total CPU time for event tracing, or it should not take more than 1 full logical core in multi-core system.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To guarantee this may require the library to monitor its own CPU usage and begin throttling, sampling or otherwise limits its own resource consumption.
Otherwise it is pretty easy to to write an application that does nothing but make OpenTelemetry calls and the library will consume close to 100% of CPU.


### Memory consumption

Under extreme workload, the peak working set increase caused by tracing should not exceed 100MB.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to the number of items I would expect this to be configurable if it is a limit that is honored by the library. However, it is not clear if we want to put a limitation on the total memory usage. Unlike the number of events the total memory usage by tracing is much more difficult to calculate accurately.

It is also not clear what the interaction between the limits on the number of events and on the memory usage is. Should the library begin dropping when either of these limits is hit? If that is the intent then it would be useful to call it out explicitly. This likely belongs to a functional specification of the libraries, not necessarily to the performance spec that this PR describes.

I think it will be better to have a separate document for functional requirements for libraries in the specification and refer to it as necessary from the performance specification document.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree that providing numbers does not seem like a cross-language issue - different languages have way too different performance characteristics. How about providing guidelines on what metrics should be looked at without targets? Also we can add tips / tricks about high performance tracing - for example, and this is my opinion, but if others agree, observability is an area that deserves possibly gross microoptimizations since users don't want to pay cost for observability, they want their money to go into serving users. This file seems like a good avenue for providing observability best practices.

@reyang reyang added release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:miscellaneous For issues that don't match any other spec label labels Jul 29, 2020
@iNikem
Copy link
Contributor

iNikem commented Jul 30, 2020

As @tigrannajaryan mentioned, do we want to measure overhead during production runtime and somehow react to it? Or these are guidelines for performance testing that happens development time?


### Number of events could be produced per second

For application uses with C/C++/Java library, it can produce 10K events without dropping any event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why special mention of c/cpp/java?


### Number of events could be produced per second

For application uses with C/C++/Java library, it can produce 10K events without dropping any event.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit vague.. Events could be dropped by the Exporter based on the specific exporter needs/design. Is this about all the exporters? or only for the otlp exporter?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please clarify what makes one event. Is it a span? is it a span with 10 attributes? or span with 50 attributes? or span with 0 attributes? Depending on the implementation, perf can vary, and unless explicitly documented here. everyone end up with their own choice of event.


### Number of outstanding events in local cache

If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no mention of local cache in any specs so far. Its upto individual exporters to deal with local storage, if any. Or is this specifically about the OTLP exporter?


## Benchmark Report

The implementation language libraries need add the typical result of above benchmarks in the release page.
Copy link
Member

@cijothomas cijothomas Jul 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this can describe the type of the application along with the spans expected from it - then every language can implement the same type of application. Otherwise there be no consistency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@cijothomas cijothomas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments. The specs are bit vague now.

@dyladan
Copy link
Member

dyladan commented Aug 4, 2020

It can be difficult to measure exactly how much CPU and memory overhead is caused by tracing, and it can be wildly dependent on how the tracing is used. As mentioned by @tigrannajaryan, an application which generates 1-2 spans per request will have a much different tracing overhead than an application which generates tens or hundreds of spans per request. In order to put hard limits on CPU and memory overhead, too much is required to be known about the system being traced. In my opinion, it would be more useful to define requirements for a sample application, then using that sample application define throughput requirements as a percentage increase.

For example, you may require a sample app which generates 10 spans for each request it receives, each of which has 10 attributes with randomly generated data. You could then require that a load test on this particular sample application be able to handle 98% of the throughput of the same application without tracing enabled. This would provide a clear path to SIGs on how to implement and interpret a benchmark.

FWIW, I think it will be difficult or impossible to come up with a set of restrictions which makes sense to apply to all language SIGs.

@yurishkuro
Copy link
Member

At one of Zipkin workshops (around 2016) Spoons from Lightstep gave a presentation on the methodology of measuring tracing overhead under varying conditions of the application's own behavior. Perhaps @tedsuo can ask him to share that presentation.

@yurishkuro
Copy link
Member

Also, Lightstep published a tool for benchmarking https://lightstep.com/blog/understanding-tracer-performance-with-lightstep-benchmarks/

@tigrannajaryan
Copy link
Member

@ThomsonTan to make progress on this I think it is important to clarify what the intent of this PR is.

Is it to define performance requirements for OpenTelemetry SDK implementations or to define how to perform performance measurements in a consistent manner across different OpenTelemetry SDKs?

If the goal is to define performance requirements I think we can discuss 2 distincts parts:

  1. Behavioral performance requirements, such as what happens when the application continuously generates spans as fast as it can, but the SDK is unable to send them to the configured destination. Is it OK to crash, OOM in this case or certain failsafe behavior is prescribed, such as queuing of spans up to certain number of spans, and then dropping the newly generated one's (or the old ones). There is a number of things that a well-behaved production-ready library needs to do and arguably we could make the expected behavior part of the specification.

  2. Requirement to consume resources within specified limits. In theory the limits can be part of the spec, however in practice I find it a bit problematic, because the resource consumption highly depends on the language and on the machine/environment where the benchmarking is performed. I think it is undesirable to make this part of the specification.

If the goal is to define the guidelines for performing SDK benchmarking then we need to define the list of scenarios about which we care the most. To me the important scenarios would be several that are within the normal operational modes (e.g. generating 1000 spans per second, sending via OTLP to a server which is fast enough not to block the SDK) plus some other scenarios which are abnormal (e.g. generating at 10x higher rate then the SDK is capable of exporting).

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Aug 26, 2020
@ThomsonTan
Copy link
Contributor Author

@tigrannajaryan as discussed in previous meeting, it is hard to define a consistent performance requirement across languages and hardware, the goal here is asking SDKs to provide a consistent benchmark program to the users to get a sense of what performance he/she could expect from a specific SDK. So we'd like to define performance measurement in consistent manner across different OpenTelemetry SDKs, much like the Benchmarks we have in OpenTelemetry .NET SDK.

This spec is meant to define the list of scenarios we and the users care most, I'll incorporate the scenario you listed to the spec.

@tigrannajaryan
Copy link
Member

@ThomsonTan sounds good, please ping me when this is ready for another review round.

@github-actions github-actions bot removed the Stale label Sep 2, 2020
@ThomsonTan ThomsonTan requested a review from a team September 8, 2020 08:41
@ThomsonTan
Copy link
Contributor Author

@ThomsonTan sounds good, please ping me when this is ready for another review round.

@tigrannajaryan I updated the doc based the feedback here, could you please help review again?

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ThomsonTan

I left a few comments and my assumption is that we are aiming to have comparable performance measurement results across languages. If that is not a goal then my comments may not be applicable.


### Create Span

Number of spans which could be created in 1 second per logical core and over all
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is sufficient to measure this per one logical core. If we do want to have multicore measurements we have to specify how many cores to use so that results are comparable.

I think it is also necessary to specify some information about the composition of the Span, e.g. how many attributes, of what type and length. This can affect performance significantly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that when we go from 1 core to N cores, then we don't actually want to compare raw performance anymore. We want to compare scalability. If we have 1 core performance established, then it is our unit of work and moving to N cores we want to see the curve.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, measure N cores will shows the scalability based on 1 core, also there are big.LITTLE architecture so performance over all cores is not a *N over a single core.


### CPU Usage

With given number of events throughput specified by user, e.g. 1000 spans per
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we don't use the word "event" here since it is an actual and different concept at OpenTelemetry (Span Events) which can be confusing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. I wonder if 1000 spans per second are too small number for fast languages. For example Collector can do receive/batch/export of 10k spans per sec using less than 10% of one CPU and this is in Go. I expect C++ or Rust to be even more efficient so the CPU usage may be too small for reliable measurement.
    On the other hand I am worried if we specify too large a number slow languages may not be able to sustain it.
    If this turns out to be the case we may need to measure CPU usage at more than one rate to get useful results.

  2. We should probably also specify the duration of the test over which to measure the CPU usage.

  3. We need to specify which processors and exporters are used. If exporter actually sends data then we need reasonable guarantees that the receiving end does not influence the performance of SDK (if for example it does not receive quicjly enough). Ideally all tests should use the same high-performance mock backend that receives and drops the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Increased the measure throughput to 10k which makes more sense on modern hardware. Add minimum measure during as 1 min. Also added SimpleProcessor and OTLP exporter to the test pipeline so different implementations could be compared apple to apple.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to notice, in Java there is a HUGE performance difference between pipelines with SimpleProcessor and BatchSpanProcessor. If we want to measure production-like setup, I would argue against using SimpleProcessor

### Memory Usage

Measure dynamic memory comsumption, e.g. heap, for above scenario as memory cost
at given event throughput.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to CPU usage memory usage depends on processors, exporters used, Span composition, the backend (slow backend will cause memory buildup, etc). This needs to be in the specs otherwise SDKs will implement these differently and we will not have comparable results.


### OTLP Exporting Latency

Measure and report the latency of sending events to a local OTLP collector via
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to define latency between which moments of time we measure. Is this from the moment when the Span is started and until it is received by the Collector? There are multiple measurement points possible which will have different latencies. Also, if we involve the Collector here we need to tell what config to use for the Collector since that can affect receiving performance and thus affect the latency.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The latency part was intended to measure how much the application could be blocked by the logging/trace SDK. It may not be a good metric and I removed it for now. Let me know for any suggestions.

Copy link
Contributor

@jkwatson jkwatson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not sure exactly how we're going to use this, but it seems like an ok starting point.

@tigrannajaryan
Copy link
Member

I believe this is an additive change so I am changing the label to "allowed-for-ga" rather than "required". (We can merge it whenever it is ready but I want to make sure it is not seen as a blocker for trace spec freeze).

@tigrannajaryan tigrannajaryan added release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs and removed release:required-for-ga Must be resolved before GA release, or nice to have before GA labels Oct 15, 2020
@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Oct 22, 2020
@iNikem
Copy link
Contributor

iNikem commented Oct 22, 2020

@ThomsonTan Can you rebase this PR please?

@open-telemetry/technical-committee Any objections to merging this PR?

@ThomsonTan
Copy link
Contributor Author

@ThomsonTan Can you rebase this PR please?

@open-telemetry/technical-committee Any objections to merging this PR?

@iNikem rebased the PR to the latest, let me know if there is anything I need follow up to get it merged.

@github-actions github-actions bot removed the Stale label Oct 23, 2020
@github-actions
Copy link

github-actions bot commented Nov 1, 2020

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 1, 2020
@iNikem
Copy link
Contributor

iNikem commented Nov 1, 2020

@open-telemetry/technical-committee Can we merge this PR?

@github-actions github-actions bot removed the Stale label Nov 2, 2020
@andrewhsu andrewhsu mentioned this pull request Nov 2, 2020
7 tasks
@github-actions
Copy link

github-actions bot commented Nov 9, 2020

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Nov 9, 2020
@iNikem
Copy link
Contributor

iNikem commented Nov 9, 2020

@open-telemetry/technical-committee Can we merge this PR?

@github-actions github-actions bot removed the Stale label Nov 10, 2020
@tigrannajaryan
Copy link
Member

Merging, there do not seem to be any other objections and has been approved for a while.

@tigrannajaryan tigrannajaryan merged commit a48fb7f into open-telemetry:master Nov 10, 2020
carlosalberto pushed a commit to carlosalberto/opentelemetry-specification that referenced this pull request Oct 31, 2024
Documented the performance benchmark requirement for language libraries. The most important metrics like throughput/CPU/memory are included.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:allowed-for-ga Editorial changes that can still be added before GA since they don't require action by SIGs spec:miscellaneous For issues that don't match any other spec label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants