-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add OpenTelemetry performance benchmark spec #748
Add OpenTelemetry performance benchmark spec #748
Conversation
|
||
### Number of outstanding events in local cache | ||
|
||
If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to be a hard-coded number? I would expect this to be a configuration option with some reasonable default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not seem like a performance benchmark, but a requirement of the sdk?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tigrannajaryan I'll remove the hard-coded numbers which could be hard to apply to languages and sustain over time.
@dyladan this spec is supposed to recommend SDKs to implement some common performance benchmark (like the basic metrics listed here) which the users could easily get (by running the benchmark program locally).
|
||
### CPU time | ||
|
||
Under extreme workload, the library should not take more than 5% of total CPU time for event tracing, or it should not take more than 1 full logical core in multi-core system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To guarantee this may require the library to monitor its own CPU usage and begin throttling, sampling or otherwise limits its own resource consumption.
Otherwise it is pretty easy to to write an application that does nothing but make OpenTelemetry calls and the library will consume close to 100% of CPU.
|
||
### Memory consumption | ||
|
||
Under extreme workload, the peak working set increase caused by tracing should not exceed 100MB. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to the number of items I would expect this to be configurable if it is a limit that is honored by the library. However, it is not clear if we want to put a limitation on the total memory usage. Unlike the number of events the total memory usage by tracing is much more difficult to calculate accurately.
It is also not clear what the interaction between the limits on the number of events and on the memory usage is. Should the library begin dropping when either of these limits is hit? If that is the intent then it would be useful to call it out explicitly. This likely belongs to a functional specification of the libraries, not necessarily to the performance spec that this PR describes.
I think it will be better to have a separate document for functional requirements for libraries in the specification and refer to it as necessary from the performance specification document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree that providing numbers does not seem like a cross-language issue - different languages have way too different performance characteristics. How about providing guidelines on what metrics should be looked at without targets? Also we can add tips / tricks about high performance tracing - for example, and this is my opinion, but if others agree, observability is an area that deserves possibly gross microoptimizations since users don't want to pay cost for observability, they want their money to go into serving users. This file seems like a good avenue for providing observability best practices.
As @tigrannajaryan mentioned, do we want to measure overhead during production runtime and somehow react to it? Or these are guidelines for performance testing that happens development time? |
|
||
### Number of events could be produced per second | ||
|
||
For application uses with C/C++/Java library, it can produce 10K events without dropping any event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why special mention of c/cpp/java?
|
||
### Number of events could be produced per second | ||
|
||
For application uses with C/C++/Java library, it can produce 10K events without dropping any event. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit vague.. Events could be dropped by the Exporter based on the specific exporter needs/design. Is this about all the exporters? or only for the otlp exporter?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, please clarify what makes one event. Is it a span? is it a span with 10 attributes? or span with 50 attributes? or span with 0 attributes? Depending on the implementation, perf can vary, and unless explicitly documented here. everyone end up with their own choice of event.
|
||
### Number of outstanding events in local cache | ||
|
||
If the remote end-point is unavailable for OpenTelemetry exporter, the library needs cache 1M events before dropping them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no mention of local cache in any specs so far. Its upto individual exporters to deal with local storage, if any. Or is this specifically about the OTLP exporter?
|
||
## Benchmark Report | ||
|
||
The implementation language libraries need add the typical result of above benchmarks in the release page. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this can describe the type of the application along with the spans expected from it - then every language can implement the same type of application. Otherwise there be no consistency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments. The specs are bit vague now.
It can be difficult to measure exactly how much CPU and memory overhead is caused by tracing, and it can be wildly dependent on how the tracing is used. As mentioned by @tigrannajaryan, an application which generates 1-2 spans per request will have a much different tracing overhead than an application which generates tens or hundreds of spans per request. In order to put hard limits on CPU and memory overhead, too much is required to be known about the system being traced. In my opinion, it would be more useful to define requirements for a sample application, then using that sample application define throughput requirements as a percentage increase. For example, you may require a sample app which generates 10 spans for each request it receives, each of which has 10 attributes with randomly generated data. You could then require that a load test on this particular sample application be able to handle 98% of the throughput of the same application without tracing enabled. This would provide a clear path to SIGs on how to implement and interpret a benchmark. FWIW, I think it will be difficult or impossible to come up with a set of restrictions which makes sense to apply to all language SIGs. |
At one of Zipkin workshops (around 2016) Spoons from Lightstep gave a presentation on the methodology of measuring tracing overhead under varying conditions of the application's own behavior. Perhaps @tedsuo can ask him to share that presentation. |
Also, Lightstep published a tool for benchmarking https://lightstep.com/blog/understanding-tracer-performance-with-lightstep-benchmarks/ |
@ThomsonTan to make progress on this I think it is important to clarify what the intent of this PR is. Is it to define performance requirements for OpenTelemetry SDK implementations or to define how to perform performance measurements in a consistent manner across different OpenTelemetry SDKs? If the goal is to define performance requirements I think we can discuss 2 distincts parts:
If the goal is to define the guidelines for performing SDK benchmarking then we need to define the list of scenarios about which we care the most. To me the important scenarios would be several that are within the normal operational modes (e.g. generating 1000 spans per second, sending via OTLP to a server which is fast enough not to block the SDK) plus some other scenarios which are abnormal (e.g. generating at 10x higher rate then the SDK is capable of exporting). |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@tigrannajaryan as discussed in previous meeting, it is hard to define a consistent performance requirement across languages and hardware, the goal here is asking SDKs to provide a consistent benchmark program to the users to get a sense of what performance he/she could expect from a specific SDK. So we'd like to define performance measurement in consistent manner across different OpenTelemetry SDKs, much like the Benchmarks we have in OpenTelemetry .NET SDK. This spec is meant to define the list of scenarios we and the users care most, I'll incorporate the scenario you listed to the spec. |
@ThomsonTan sounds good, please ping me when this is ready for another review round. |
@tigrannajaryan I updated the doc based the feedback here, could you please help review again? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ThomsonTan
I left a few comments and my assumption is that we are aiming to have comparable performance measurement results across languages. If that is not a goal then my comments may not be applicable.
|
||
### Create Span | ||
|
||
Number of spans which could be created in 1 second per logical core and over all |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is sufficient to measure this per one logical core. If we do want to have multicore measurements we have to specify how many cores to use so that results are comparable.
I think it is also necessary to specify some information about the composition of the Span, e.g. how many attributes, of what type and length. This can affect performance significantly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that when we go from 1 core to N cores, then we don't actually want to compare raw performance anymore. We want to compare scalability. If we have 1 core performance established, then it is our unit of work and moving to N cores we want to see the curve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, measure N cores will shows the scalability based on 1 core, also there are big.LITTLE architecture so performance over all cores is not a *N
over a single core.
|
||
### CPU Usage | ||
|
||
With given number of events throughput specified by user, e.g. 1000 spans per |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we don't use the word "event" here since it is an actual and different concept at OpenTelemetry (Span Events) which can be confusing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
I wonder if 1000 spans per second are too small number for fast languages. For example Collector can do receive/batch/export of 10k spans per sec using less than 10% of one CPU and this is in Go. I expect C++ or Rust to be even more efficient so the CPU usage may be too small for reliable measurement.
On the other hand I am worried if we specify too large a number slow languages may not be able to sustain it.
If this turns out to be the case we may need to measure CPU usage at more than one rate to get useful results. -
We should probably also specify the duration of the test over which to measure the CPU usage.
-
We need to specify which processors and exporters are used. If exporter actually sends data then we need reasonable guarantees that the receiving end does not influence the performance of SDK (if for example it does not receive quicjly enough). Ideally all tests should use the same high-performance mock backend that receives and drops the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Increased the measure throughput to 10k which makes more sense on modern hardware. Add minimum measure during as 1 min. Also added SimpleProcessor
and OTLP exporter
to the test pipeline so different implementations could be compared apple to apple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to notice, in Java there is a HUGE performance difference between pipelines with SimpleProcessor
and BatchSpanProcessor
. If we want to measure production-like setup, I would argue against using SimpleProcessor
### Memory Usage | ||
|
||
Measure dynamic memory comsumption, e.g. heap, for above scenario as memory cost | ||
at given event throughput. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to CPU usage memory usage depends on processors, exporters used, Span composition, the backend (slow backend will cause memory buildup, etc). This needs to be in the specs otherwise SDKs will implement these differently and we will not have comparable results.
|
||
### OTLP Exporting Latency | ||
|
||
Measure and report the latency of sending events to a local OTLP collector via |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to define latency between which moments of time we measure. Is this from the moment when the Span is started and until it is received by the Collector? There are multiple measurement points possible which will have different latencies. Also, if we involve the Collector here we need to tell what config to use for the Collector since that can affect receiving performance and thus affect the latency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The latency part was intended to measure how much the application could be blocked by the logging/trace SDK. It may not be a good metric and I removed it for now. Let me know for any suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not sure exactly how we're going to use this, but it seems like an ok starting point.
I believe this is an additive change so I am changing the label to "allowed-for-ga" rather than "required". (We can merge it whenever it is ready but I want to make sure it is not seen as a blocker for trace spec freeze). |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@ThomsonTan Can you rebase this PR please? @open-telemetry/technical-committee Any objections to merging this PR? |
@iNikem rebased the PR to the latest, let me know if there is anything I need follow up to get it merged. |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@open-telemetry/technical-committee Can we merge this PR? |
This PR was marked stale due to lack of activity. It will be closed in 7 days. |
@open-telemetry/technical-committee Can we merge this PR? |
Merging, there do not seem to be any other objections and has been approved for a while. |
Documented the performance benchmark requirement for language libraries. The most important metrics like throughput/CPU/memory are included.
Documented the performance benchmark requirement for language libraries. The most important metrics like throughput/CPU/memory are included.