Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more histogram types #226

Closed
wants to merge 6 commits into from
Closed

Conversation

yzhuge
Copy link
Contributor

@yzhuge yzhuge commented Oct 26, 2020

This is a PR for open-telemetry/opentelemetry-specification#982. Main changes

  • Add linear_bounds, exponential_bounds, and log_linear_bounds types. The new types are treated as compression methods for the existing explicit_bounds. ie. Methods to compress a sequence of doubles.

Terminology note: For exponential histograms, "base" is used for exponent base, consistent with standard math terminology. "Reference" is used for the multiplier on exponential scale, consistent with common usage in log scale unit such as decibel.

The proposed protocol is closely related to custom protocol of DDSketch (https://github.com/DataDog/sketches-java/blob/master/src/main/proto/DDSketch.proto). A DDsketch using the logarithm method can be represented as reference=1, base=gamma, index_offset=contiguousBinIndexOffset. A DDSketch using the "fast" option (https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/ddsketch/mapping/BitwiseLinearlyInterpolatedMapping.java) can be represented as reference=1, base=2, num_linear_subbuckets=2^ numSignificantBinaryDigits. DDSketches using log approximation methods such as quadratic or cubic methods have to use the explicit bound encoding.

The rationale for supporting logarithm and log linear format is:

  • Logarithm format provides memory optimized option, at higher cpu cost. Log/exponential function is well understood and easy to implement.
  • Log linear provides cpu optimized option, at higher memory cost. Linear subbuckets in exponential bucket is easy to describe and implement. Multiple sources produce log linear histogram, example: hdrHistogram, CirllHisto, DDSketch "fast" option.
  • The two options cover the two ends of the cpu-memory trade off.

The rationale for not supporting quadratic and cubic log approximation is:

  • There are many quadratic and cubic approximation methods for log. None could be considered "canonical". Additional research is needed on the optimal parameters on quadratic and cubic approximation that minimizes relative error, and how to represent such parameters in a protocol.
  • There is no formal mathematical description of approximation formula used in histograms like DDSketch and DynaHist (https://github.com/dynatrace-oss/dynahist), making it difficult to understand or evaluate the methods, or to implement a message decoder correctly on another platform or language (today we have to reverse engineer the source code to get the formula).
  • Including these formats in OTel standards would require all message receivers (typically metric storage backends) to be able to decompress the bounds to explicit bounds, or natively store the compressed form. This adds complexity and cost.
  • User demand on fine tuning cpu-memory trade off is yet to be seen.

Open issues:

  • Protocol now uses "repeated fixed64 bucket_counts" for counters in buckets. Each counter always costs 8 bytes. Histograms often have counters repeating hundreds or thousands of times. A "varint" counter may be better. In most cases, a counter will need fewer than 4 bytes. A previous PR (Use fixed64 and sfixed64 for the sums and counts #214) changed counters from uint64 to fixed64. For counters occurring a few times, the space cost may not matter, for histogram counter list, we may need to reconsider.

  • There is talk about adding min and max to histogram. One argument for adding them is: Consider the case where we have fooHisto metric tracking histogram of "foo", and a fooSummary metric tracking max of "foo", we could get results like
    percentile(fooHistogram, 100%) = 1000
    max(fooSummary) = 900
    In the result above, the returned 100% percentile is higher than the max. In fact, this could occur even before we reach 100%, in percentiles like 99.9%. If we have min/max included in histogram, the percentile function can internally clamp 100% to max, and 0% to min, returning more reasonable results.

@yzhuge yzhuge requested review from a team October 26, 2020 18:57
@linux-foundation-easycla
Copy link

linux-foundation-easycla bot commented Oct 26, 2020

CLA Check
The committers are authorized under a signed CLA.

@CharlesMasson
Copy link

CharlesMasson commented Nov 6, 2020

Regarding the polynomial interpolations, we'd like to put forward arguments in favor of supporting them:

  • There are many quadratic and cubic approximation methods for log. None could be considered "canonical". Additional research is needed on the optimal parameters on quadratic and cubic approximation that minimizes relative error, and how to represent such parameters in a protocol.

I think that the polynomial interpolations (between powers of 2) that minimize the memory footprint while ensuring the relative accuracy guarantee of the quantiles can be naturally favored.

  • There is no formal mathematical description of approximation formula used in histograms like DDSketch and DynaHist (https://github.com/dynatrace-oss/dynahist), making it difficult to understand or evaluate the methods, or to implement a message decoder correctly on another platform or language (today we have to reverse engineer the source code to get the formula).

See the rationale for the coefficients of the cubic interpolation in DDSketch. The method for getting the optimal coefficients of the quadratic interpolation is similar.

  • Including these formats in OTel standards would require all message receivers (typically metric storage backends) to be able to decompress the bounds to explicit bounds, or natively store the compressed form. This adds complexity and cost.

Probably true to some extent, although we are already doing it for log-linear and the mapping logic can be reduced to a few self-contained lines of codes.

  • User demand on fine tuning cpu-memory trade off is yet to be seen.

The quadratic and especially the cubic interpolations memory footprints are comparable to the optimal logarithmic mapping (8% and 1% overhead respectively), while being like the log-linear mapping considerably faster than the logarithmic mapping (typically around 3 or 4 times faster to record input values in the sketch), so we could refute it's really a trade-off.

.gitignore Outdated Show resolved Hide resolved
Comment on lines 507 to 514
// When bounds have a pattern, a more efficient encoding method may be used.
// Only one method should be defined. For performance reason, "oneof" is not used.
// When more than one methods are defined, the receiver will use the first one found using this order:
// explicit_bounds, linear_bounds, exponential_bounds, log_linear_bounds
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to move this closer to explicit bounds, no need to keep the ids ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new commit added to group all "bound" types together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble w/ the use of oneof in some places and not here. It's wire compatible, I believe, to say:

oneof {
  LinearBounds linear_bounds = 9;
  ExponentialBounds exponential_bounds = 10;
  LogLinearBounds log_linear_bounds = 11;
}

if so, then given the savings implied by these special bounds, perhaps we can afford to move the three new bounds into a oneof?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we want:
oneof {
ExplicitBounds explicit_bounds = 8
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 7
}
But " repeated double explicit_bounds = 7" is already at the top level. Though we could remove it at the cost of breaking compatibility (I heard compatibility is nice to have, but not required)

The bigger question is performance. @bogdandrutu said that oneof performance is so horrible that we should avoid it if we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu We have a few good questions to answer here, particularly about the oneof. I don't think we should bother moving field 7 or breaking compatibility, though it's sort of like keeping the worst of both worlds. 😞

One of the questions that I found troublesome when converting to or from Prometheus is that the "le" bounds contradict the [bounds[i-1], bounds[i]) statements here.
(I still have to catch up on the youtube video above...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field order is a pain. We have these options:

  1. Append new fields at the end, keeping field number order, but breaking logical grouping of fields. (my 1st version)
  2. Insert new fields to their logical groups, breaking field number order. (my 2nd version, responding to Bogdan's comment)
  3. Group fields logically and update field number in sequential order, breaking wire backward compatibility

So no option is ideal. Option 1 and 2 have no impact once compiled, it's just for human readers. So changing between 1 and 2 is easy. We need to be careful on choosing 3.

As for "le" vs. "ge" bounds, I remember we basically punted on it, saying "we could add an optional boolean flag later". The bigger issue is how will a message receiver handles it. If a receiver is natively "le", it is difficult to convert incoming "ge" buckets to "le", and vise versa. The good news is that in practice, ge vs le usually has little impact on "continuous" number set, where we are concerned on the area under a histogram curve, between two bucket bounds. As long as "area" is concerned, inclusiveness/exclusiveness at a bound is irrelevant. Even for discrete data set, le/ge will only introduce errors no more than the width of a bucket (if population count is mistakenly included to a neighboring bucket, it would introduce an error no wider than the bucket width for quantile calculation). So I think we can continue to punt the le/ge question.

I think we are close to approval on this PR. There are only minor things like field order and "oneof". @bogdandrutu I'd like to hear from you and get this PR resolved soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for a oneof. Option 2 is fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that to keep backward compatibility, we can only have
repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}

explicit_bounds is a preexisting field. Although existing single field can be included in a new oneof without breaking backward compatibility, existing "repeated" cannot. In fact, "repeated" cannot be oneof member at all.

So with oneof, we still leave out explicit_bounds. If we don't mind breaking compatibility, we can do:
oneof bound_type {
ExplicitBounds explicit_bounds = 8;
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 1;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with your first suggestion, the backward-compatible one:

repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}

If we're going to break the protocol in any way, I'd fix this however. If we decide to break the protocol because of labels vs attributes, maybe?

@yzhuge
Copy link
Contributor Author

yzhuge commented Nov 7, 2020

I still think that quadratic and cubic forms are not quite ready to be included in the standard at this moment. To be standard ready, I think we have a check list to do:

  • Have some math expert to verify the optimal nature of the coefficients. The comment in DDSketch (https://www.javadoc.io/static/com.datadoghq/sketches-java/0.4.1/com/datadoghq/sketch/ddsketch/mapping/CubicallyInterpolatedMapping.html) explains the choice of the coefficients. The math there is most likely correct, but it is beyond my math skill to confirm.

  • Decide if OTel protocol will use hardwired coefficients not included in transport protocol for quadratic and cubic formulas. Personally, I like the hardwired approach. If we have the optimal formula, why use anything else? However, this does reduce flexibility. By looking at DynaHist (https://github.com/dynatrace-oss/dynahist) code, I cannot tell if they use the same quadratic coefficients or not. If all known formula use the optimal coefficients, then it won't be an issue, otherwise, we need to choose between simplicity and flexibility.

  • Do performance testing on relative cpu cost of pure log, log-linear, log-quadratic, and log-cubic methods. Of particular interest is quadratic vs. cubic. If cubic isn't much more expensive than quadratic in cpu, we may consider not supporting quadratic.

  • My reading is that quadratic and cubic are also subbucketing methods within an exponential bucket. Really, the difference among linear, quadratic, and cubic is the degree of polynomial used. So I think that quadratic and cubic forms can be encoded by specifying a "number of subbucket" parameter only (assuming hardwired coefficients), like a
    "message LogQuadraticBounds { ExponentialBounds exponential_bounds = 1; uint32 num_of_quadratic_subbuckets = 2; }" message type. Things like "gamma", relativeError can be derived from number of subbuckets. This is just my guess now. It needs confirmation from DDSketch authors.

  • Need stand alone doc on valueToIndex() and indexToValue() functions (likely in pseudo code form), and description of the optimal coefficients used. It should have enough details for a reader to implement a producer/encoder or consumer/decoder in an arbitrary language or platform. After all, this is what "standard" is all about (runs everywhere). For log-linear, such functions are relatively simple. They are described in metrics.proto file comment. For quadratic and cubic, I think separate doc will be needed, if only for the purpose of easy update on the doc. I don't think DDSketch source code, or comment in code is enough. The formula is intertwined with the rest of the code. Abstracting the formula is not always easy. For example, DDSketch LogLikeIndexMapping.value(int index) method returns value at a given bucket index. But it returns the "logarithmic middle" of the bucket, not the start of the bucket. Such subtlety can easily be missed if somebody is writing a Python implementation to produce/encode, or consume/decode protobuf messages. Another example is that it took me a long time to figure out what correctingFactor() is in the code. A standard implementer shouldn't have to "decrypt" source code in one implementation to write another. All implementations should be derived from an authoritative doc.

Even with all these technical items above resolved, how the OTel community will think of the requirement for all metric receivers to decode log-quadratic and log-cubic is still open. And all producers SDKs should (but not must) produce the new formats. It's still an open business question.

As this protocol stands today, it does not preclude future extension to add quadratic and/or cubic forms. I think we should go one step at a time and do the simple form (log and log-linear) first. I suggest that the quadratic and cubic issue be taken out of this PR discussion, and leave as research and potential future items, though I am not sure on the venue to continue the exploration (an OTel issue?).

@yzhuge
Copy link
Contributor Author

yzhuge commented Nov 10, 2020

@jmacd can you review this PR?

@jmacd
Copy link
Contributor

jmacd commented Nov 12, 2020

At today's meeting Theo S. (@postwait) mentioned this video about Prometheus histogram support: https://www.youtube.com/watch?v=HG7uzON-IDM

I'd like to watch this before commenting further here, but generally I'm excited that we seem to have agreed on the shape of these new bounds options.

I'm not sure I've followed what protocol structures are needed and what additional code will be needed in all the place that may have to interpret histogram bounds, if we are to add one of these other interpolation techniques. @yzhuge or @CharlesMasson would you spell it out for us? Thanks.

@yzhuge
Copy link
Contributor Author

yzhuge commented Nov 14, 2020

@jmacd I watched the youtube video. The proposed new histogram format for Prometheus has exponential bounds. So it can be efficiently encoded in the proposed protobuf format here.
As for quadratic and cubic approximation methods, as I described in my previous comment, more research is needed before we nail down on whether/when/how to add them. In fact, I think this is good material for a paper or two.

Copy link
Member

@aabmass aabmass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the protocol support a sparse encoding of counts as well?

@yzhuge
Copy link
Contributor Author

yzhuge commented Nov 30, 2020

Should the protocol support a sparse encoding of counts as well?

Generally, telemetry such as response time spans a range continuously (no or few empty buckets from min to max). So the buckets end up being dense. So I think we should just start from the simple case of dense counter arrays.
If we want to save space, a lower hanging fruit is to use protobuf varint to encode the counters. Because each counter rarely goes up to the full 8 byte integer, varint encoding should reduce encoded memory size and it is trivial to specify in protobuf (In fact, it is the default for "repeat" integers).
As discussed in a meeting, we chose to use fixed 64bit int counter for now because it is consistent with other counters, and it is optimized for cpu (in languages with direct memory access, like C/C++, serialization/deserialization of fix size integer array is a trivial memcpy()).

@postwait
Copy link

My advice would to use sparsely encoded bin/count tuples.

Of course, my advice is also to fix on a universal set of bins so that users are left with an absolute mess they cannot effective use for calculations. It is unreasonable to expect even the best of data scientists to cope well with the statistical error and bias introduced in the re-binning required to translate from one set of arbitrary bins to another. Making this protocol format flexible is a great engineering approach that ignores the math problems that persist underneath everything and isn't in the best interest of end-users.

@githomin
Copy link

githomin commented Dec 2, 2020

Can we also add min, max, and sum fields? (Possibly also count and avg as short-cuts.) People care about those values and they are the most likely to be imprecise. It's an edge case, but sum can have large relative errors if calculated from both the positive and negative buckets and their sum is near zero.

@jmacd
Copy link
Contributor

jmacd commented Dec 2, 2020

Can we also add min, max, and sum fields?

+1

There are standing requests to include min and max, definitely. The existing protos include sum and count (and should be kept). I'm not inclined to include avg.

@jdmontana
Copy link

We (Google) have identified a scheme for exponentially bucketed histograms that ensures that any two histograms belonging to that scheme can be rebucketed and merged in a way that boundaries of the merged histogram will align with the old histogram and all values will be accurately assigned to the correct buckets in the new histogram. I am hoping that we will be able to include the parameters required for this alternate approach as one of the bucketing types. The parameters required are:

  1. minimum_base: The smallest exponent base that can be used for any histogram
  2. compression: A power-of-two level of compression at which buckets from the minimum_base are merged together to form larger buckets
  3. index_offset: Equivalent to traditional exponential bucketer in this proposal

Knowing minimum_base and compression, you can derive the traditional exponential base as:

base = minimum_base^2^compression

To harmonize with the proposed exponential distribution parameters which allow differentiating negative and positive buckets, compression and index_offset could be split into negative_compression/positive_compression and negative_index_offset/positive_indexx_offset.

Although this is also an exponential bucketer, the addition of the minimum_base and compression would give the monitoring backend additional information about whether/how two histograms with exponential bucketers can be merged without losing accuracy, and with deterministic loss of precision. Monitoring backends that do not support the additional alignment features associated with these parameters and do not plan to use them directly could easily just convert to traditional exponential parameters during export, so there is no risk of backend compatibility issues.

This is outside the scope of this PR, but we will be sharing an algorithm and open source implementation for dynamically (re-)bucketing histograms under the aligned histogram scheme I described above so that it is possible to record sample distributions with bounded memory, unlimited range, and guaranteed tradeoffs between range and precision. Our algorithm is somewhat similar in principle to DDSketch, but instead of using fixed precision and allowing unbounded memory and range (or using fixed precision and allowing unbounded memory but limited range, as in one of the alternative bounded-memory DDSketch implementations), our algorithm bounds memory and allows precision to fall as the range grows to cover new observations. Our motivation for using this instead of alternatives such as DDSketch is that we have metrics in shared libraries that are widely used by internal users that have modestly high cardinality, so converting these metrics to a histogram scheme like DDSketch that allows unlimited number of buckets isn't feasible, and the bounded-memory approach of discarding lower buckets to bound memory also does not work for us because these buckets may include SLI's. Our proposed approach allows us to continue using the same number of buckets as in the low-preciion fixed-bucket histograms to cover the same range in the worst case, while providing much better precision for the vast majority of cases where observed values tend to be clustered into a much smaller range. This is an unambiguous improvement with few downsides over the low-precision fixed-bucketed histograms, which allows it to be dropped in as a replacement without risking memory regressions. We expect that some other folks might be facing the same constraints that are preventing us from switching to using DDSketch, and could find this approach useful.

@yzhuge
Copy link
Contributor Author

yzhuge commented Dec 2, 2020

Given two exponential histograms with base1 and base2, and identical "reference", if
base2 = base1 ^ N
and N is an integer, we can do a N to 1 bucket merge in histogram1 to make its bounds align with histogram2. And if you restrict N to be power of 2, any 2 histograms from a minimal base family can be merged without artifacts.

Theoretically, N can be computed given base1 and 2. But because of inaccuracy in floating point math, we cannot always accurately determine if N is an integer. As a work around, a processor may snap it to an integer when it is “close enough”. This way a backend can merge histograms transported via standard exponential format.

I suggest starting from standard exponential form, and leave additional parameters to future work.

One drawback of computing base from
base = minimum_base^ (2^compression)
is additional error. Given the same minimal base and compression, systems with different software and hardware may produce slightly different base. Alternatively, the protocol may include base, minimal base, and compression. But that adds more complexity and source of inconsistency.

Exponential bound calculation using a given base already faces the math variation problem (so far we pretend it does not exist). Any inaccuracy in “base” itself will be magnified by bound calculations. So I hesitate to introduce more complexity and source of computational error.

@jdmontana
Copy link

I suggest starting from standard exponential form, and leave additional parameters to future work.

This seems reasonable to me - as you mentioned, in spite of floating point imprecision, we can determine whether traditional exponential bucketing parameters are "close enough" to aligning for now. Thanks for your feedback!

@githomin
Copy link

githomin commented Dec 4, 2020

Can we also add min, max, and sum fields?

+1

There are standing requests to include min and max, definitely. The existing protos include sum and count (and should be kept). I'm not inclined to include avg.

Ah I had missed the existing sum and count.

Can we add min and max at positions 6 and 7 and bump the rest of the numbers up by 2?

@yzhuge
Copy link
Contributor Author

yzhuge commented Dec 4, 2020

@githomin we discussed min/max in meetings. The decision was to handle it in another PR, to keep each PR focused. Min/max goes beyond histogram, some other structures may need min/max too. It needs some holistic consideration.

As for "bump the rest of the numbers up by 2", our approach is to favor backward compatibility, keeping existing fields' number unchanged. There is no ideal option here. See #226 (comment)

Comment on lines +581 to +587
// if (i >= 0) {
// exponent = i / num_of_linear_subbuckets;
// subbucketNumber = i % num_of_linear_subbuckets;
// } else {
// exponent = (i - num_of_linear_subbuckets + 1) / num_of_linear_subbuckets; // Round down toward -infinity
// subbucketNumber = i - exponent * num_of_linear_subbuckets;
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes a while to comprehend and it makes me think a comment would help. I suggest:

// Set exponent and subbucketNumber according to i. For index (i >= 0) we are setting
// a (bound >= 1), and for (i < 0) we are setting a (0 < bound < 1).
// When i == 0: exponent == 0 and subbucketNumber == 0
// When i == -1: exponent == -1 and subbucketNumber == (num_of_linear_subbuckets - 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 586: A comment like

// exponent is negative, same as (i + abs(exponent) * num_of_linear_subbuckets)

I'm not sure that helps, but it may have helped me.

Comment on lines 524 to 526
// ExponentialBounds. Bounds are on log scale.
// The bound sequence is stitched together using negative bounds, zero bound, and positive bounds.
// The formula to generate explicit bounds here are for demonstration purpose only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a bit more explanation will help the reader into this description. This might have helped me:

// We use only integer-value exponents.
// Exponent 0 maps to value == 1.
// Exponents > maps to values > 1.
// Exponents < 0 map to values > 0 and < 1.
// No exponent maps to zero, therefore we treat it as a special bucket of its own.
// A "sequence" of integer exponents is described by two parameters, described 
// by `index_offset_for_` and `num_of_bounds_for_`-prefixed fields. A sequence
// thus describes a range of values > 0.
// This encoding consists of two sequences, one describing positive values and
// one describing negative values, plus an optional zero-valued bucket.

Comment on lines 507 to 514
// When bounds have a pattern, a more efficient encoding method may be used.
// Only one method should be defined. For performance reason, "oneof" is not used.
// When more than one methods are defined, the receiver will use the first one found using this order:
// explicit_bounds, linear_bounds, exponential_bounds, log_linear_bounds
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with your first suggestion, the backward-compatible one:

repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}

If we're going to break the protocol in any way, I'd fix this however. If we decide to break the protocol because of labels vs attributes, maybe?

@@ -503,9 +500,107 @@ message DoubleHistogramDataPoint {
// a boolean value which decides what type of intervals to use.
repeated double explicit_bounds = 7;

// When bounds have a pattern, a more efficient encoding method may be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a technicality that we need to add above. I'm interested in the zero "bucket" of an exponential bounds. If there are no negative values in the distribution, but there are zeros, how will it be represented? Do we need to insert an unused 0 boundary into the sequence so that the corresponding explicit bounds are (-Inf, 0), [0, x) and the 0th bucket goes unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you need set the has_counter_for_zero flag in exponential format so that the bound sequence starts from 0.

@jmacd
Copy link
Contributor

jmacd commented Dec 4, 2020

@yzhuge ?I'm ready to approve, but I think I'd prefer the oneof approach.

@open-telemetry/specs-metrics-approvers please review!

@jmacd jmacd requested a review from justinfoote December 4, 2020 23:37
// If the message receiver natively stores exponential histograms, it will not need to generate explicit bounds.
// It will only need to record the exponential bound parameters.
message ExponentialBounds {
double reference = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we added that if reference is omitted, it's value may be presumed to equal 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment accepted.

Base automatically changed from master to main January 28, 2021 18:32
@yzhuge
Copy link
Contributor Author

yzhuge commented Jan 29, 2021

@oertl A good example is worth a thousand words.
To simplify the example a bit, we can switch units to ms, so that the data set range is [1, 1M], AND all numbers are integers. I have seen such cases multiple times in the real world. It is annoying to see empty buckets near the lower end. For 1.05 base, there are about 14 buckets from 1 to 2. But because input is integer, you will see like 13 empty buckets. Similarly, between 2 and 3, 3 and 4, there will be empty buckets, though increasingly fewer. So if we can tell the histogram the absolute resolution of the input data, or absolute resolution acceptable in its percentile output, we can reduce the density of buckets at the lower end.
This PR defines explicit, linear, exponential, and log-linear bound sequence types. Potentially, in the future, we could allow a type that is the concatenation of multiple types. The use case here could be answered with a concatenation of a linear series and an exponential series. For now, we haven’t even got simple exponential type in, so I’d defer such additions to the future.

@oertl
Copy link

oertl commented Jan 30, 2021

Another concern I have regarding ExponentialBounds: The parameter index_offset_for_positive_numbers is kind of redundant. For example, index_offset_for_positive_numbers=-3, base=2, reference=1 is essentially the same as index_offset_for_positive_numbers=0, base=2, reference=0.125. Is there a reason for this redundancy?

@yzhuge
Copy link
Contributor Author

yzhuge commented Feb 1, 2021

@oertl yes, there is a reason. Mathematically, we can set reference to the smallest bound and declare its offset being always zero (therefore omitted from the protocol). But "double" may not be able to exactly represent the small value. In contrast, "reference" is typically chosen at some "special" number such as 1 that can be exactly represented. Thus the protocol chose to use a reference and an index_offset_for_positive_numbers.
Another reason is to have a shared reference for both negative and positive numbers.

@oertl
Copy link

oertl commented Feb 1, 2021

@yzhuge I do not follow your argumentation. The implicit definition of the smallest bound as reference * power(base, index_offset_for_positive_numbers) will not be more accurate. In fact, the value of the smallest bound will even be platform dependent, since the power function will give slightly different results on different platforms. This platform dependency can hardly be avoided without specifying the numerical multiplication and the numerical evaluation of the power function in detail. This is not feasible in practice.

When defining the boundaries implicitly (as with ExponentialBounds or as I have suggested for the alternative approach) one must be aware that the boundaries will look slightly different on different platforms. This phenomenon can even lead to inconsistent data. Imagine you keep track of the maximum value which is close to a bucket boundary. Then you send the histogram data together with the maximum value to a different platform. Due to numerical variations of the bucket boundaries it is possible that the maximum value suddenly belongs to an empty bucket. Therefore, you have to be tolerant when interpreting the histogram data on a different platform. For example, DynaHist, which also keeps track of the minimum and maximum values, fixes such possible inconsistencies during deserialization.

@yzhuge
Copy link
Contributor Author

yzhuge commented Feb 1, 2021

@oertl I meant that if we use the smallest bound as reference, any inaccuracy on it will be amplified toward the bigger end. Thus defining a (typically) precise reference is preferred.

@oertl
Copy link

oertl commented Feb 2, 2021

@yzhuge I still don't get it. The only "advantage" I see in having two parameters reference and index_offset_for_positive_numbers is the possibility to select the boundary that is exactly represented by reference. All other implicitly defined boundaries will be subject to numerical inaccuracies that are even platform dependent.

By the way, the problem of numerical inaccuracies also exists for LinearBounds, where it is not possible to choose the reference. There the reference (called offset) is always the first boundary.

@yzhuge
Copy link
Contributor Author

yzhuge commented Feb 2, 2021

@oertl In exponential series, bound = reference * (base ^ index)
Any relative inaccuracy in reference will be propagated too all bounds, in addition to the inaccuracy in base, and base^index calculation. So if reference can be exactly represented by double, it removes one source of error. While a histogram producer can choose an exact number like 1 as reference, the smallest bound depends on input data and cannot always be exactly represented. Additionally, explicitly defining a reference in the protocol allows:

  • shared reference for negative and positive numbers.
  • easily and accurately determine if two histograms share same reference and base, therefore can be merged without artifact.

In linear series, bound = start + width * index
Typically, width and start are chosen with numbers that can be exactly represented by double, such as integers. So there will not be computational errors until there is not enough significant digits to represent the numbers. For non-integers, care should be taken to choose number that can exactly map to binary float point numbers, to avoid errors.

In exponential series, unless base is integer, it is hard to make all bounds exactly representable by double. So we need to be sure to at least represent the key parameters exactly.

@oertl
Copy link

oertl commented Feb 3, 2021

@yzhuge Now it becomes clearer since you mentioned that the first boundary will be data dependent. I think the specification for ExponentialBounds mixes discretization and data compaction that should be separated.

  • Discretization: Any bound_type should only be responsible for defining the discretization of the range (-inf, inf) by specifying N boundaries that split the range (-inf, inf) into N+1 buckets. These boundaries can be either explicitly defined (compare explicit_bounds) or implicitly using some formula dependent on the bound_type. The boundaries must be data independent to allow accurate merging of different records.

  • Data compaction: The default histogram data representation is an array of length N+1. However, many buckets will be empty in practice. Therefore, data compaction can be used to avoid the transmission of all N+1 frequency values. One strategy is to trim all zeros from both ends of the frequency array, and report the remaining frequencies together with an index offset and trimmed array size. However, many other compaction strategies are thinkable. For example, one could also report the list of (index/frequency)-pairs of non-empty buckets. So the trimming approach is just one of many possible compaction strategies.

I believe that discretization and data compaction are orthogonal to each other. Therefore, the histogram specification should allow choosing the discretization and the compaction strategy independently.

Given that, I have following concerns regarding ExponentialBounds as currently proposed:

  • The discretization of (-inf, inf) is not well defined. The boundaries are defined by the formula reference * (base ^ index), but there are no limits on index. As far as I understood, the data-dependent parameters index_offset_for_positive_numbers and num_of_bounds_for_positive_numbers describe the data compaction using the trimming approach but they are not relevant for the discretization itself.
  • ExponentialBounds uses a compaction strategy that should be defined outside of ExponentialBounds, because the same compaction strategy is useful for any other bound_type including explicit_bounds.

@jmacd
Copy link
Contributor

jmacd commented Feb 9, 2021

I've read the above, and I begin to fear that very few readers will make it to the bottom of this thread. I am not sure we can make effective progress by continuing the discussion in this PR.

@yzhuge Thank you for your effort here and for working with @CharlesMasson, @githomin and the DDSketch authors. Thank you @postwait and @oertl for joining this discussion; you've both brought up new considerations.

Let me summarize where we stand.

What do users want?

I think the unifying goal across all of our histogram efforts is that users should not have to configure histogram boundaries. Some will say our goal is high resolution or low error ratio, but I see these as secondary goals.

Min and Max

There is a standing proposal to introduce Min and Max fields to the histogram data point.

ExplicitBounds

The existing ExplicitBounds field will be moved inside the oneof. This has been agreed to as the first step toward supporting histogram options in a forward-looking way.

This format address both discretization and compaction, using the @oertl's terminology. These histograms allow for infinite-width buckets and prevent zero-width buckets, two features making them difficult to convert into the proposed Exponential representation of this PR.

There is also @postwait's concern: no histogram format is suitable for merging arbitrary histogram boundaries. Therefore we believe this format is only suitable in a few cases:

  1. Where users will dictate histogram buckets explicitly, in which case they are responsible for coordination of thouse bounds across their infrastructure.
  2. Where legacy formats are converted into OTLP, such as the default Prometheus boundaries [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10].

ExponentialBounds parameters

As @oertl notes, there is a concern about the use of "data-dependent parameters", again bringing up @postwait's concern:

my advice is also to fix on a universal set of bins

The idea here is that we are all worse off for choosing arbitrary parameters, no matter which histogram representation is used.

Even if we were to fix recommended parameters for use in an ExponentialBounds strategy, it appears to be most useful when used with a few hundred buckets. Exponential bounds are not a good approach for use with a small number of buckets, simply because it's difficult to find fixed, universal parameters for small histograms.

Should we separate Discretization from Compaction?

I do not believe we can satisfy all use-cases with a single histogram representation. Perhaps treating discretization and compaction separately will help, as @oertl suggests. Let's look at that.

The Prometheus default bounds, quoted above, can be thought of as a discretization of the number line with 12 buckets, 10 finite and 2 infinite. The coordinates of those buckets can be mapped exactly into the Circllhist discretization (a case of the log-linear representation proposed by @yzhuge). The finite Prometheus histogram boundaries cover 320 Circlhist buckets (using its "universal" N=10/B=90 parameters), but it's not not very sensible to represent such coarse boundaries using those paraemters.

It seems that more than one set of parameters is desireable, but arbitrary parameters are not. A few approaches have been mentioned, one by @jdmontana in this thread, and the other one UDDSketch. The overall theme of these developments can be illustrated using the Prometheus example in the paragraph above.

The Prometheus boundaries can be exactly discretized using a B=10/N=90 Circlhist scheme (covering 320 buckets), and they can also be exactly discretized using a B=10/N=18 scheme (covering 64 buckets). These two schemes can be merged into the lower-resolution form exactly, without errors and bias (I think?). It begins to look like we should specify a small number of parameter values that should be used for exponential bucketing schemes, at the very least.

Next step?

I think we should:

  1. Move forward with moving ExplicitBounds into the oneof
  2. Move forward with adding Min and Max fields to the histogram data points
  3. Create an OTEP document to clearly define the exponential histogram that addresses the concerns raised in this thread. It doesn't have to solve the problems, just acknowledge them, and that will allow the community to give a wider review to this data type. (Sorry, @yzhuge, this PR thread is effectively too long to read now!)

@oertl Do you have ideas about how to separate discretization from compaction that avoid the problems with merging arbitrary parameters discussed here? @jdmontana does Google have any news to report on this topic?

@oertl
Copy link

oertl commented Feb 10, 2021

@jmacd Regarding merging, I can imagine 3 different approaches:

  1. Allow merging only if the bound_type and the discretization are equal. For explicit bounds, this means that all boundaries must be equal. For implicit bounds this means that the bound_type and the corresponding discretization parameters must be equal. The downside is that it is not possible to switch to more coarse discretizations during aggregation. This option is too conservative to support approaches that collapse buckets during aggregation (collapsing stores by DDSketch, UDDSketch, or the approach mentioned by @jdmontana). Also, it is difficult to change the discretization of long running time series without breaking aggregates over historical data.
  2. Allow merging of a histogram with some fine discretization and a histogram with a coarser discretization into a histogram having the coarser discretization provided that the boundaries of the coarse discretization are a subset of the boundaries of the fine discretization. For explicit bounds this would mean that it is possible to merge for example a histogram with boundaries {0.5, 1, 2 ,4} with {0.5, 2}. The result would be a histogram with boundaries {0.5, 2}. I guess this approach is also sufficient to support the approach mentioned by @jdmontana where the boundaries are defined by b(i) = reference * minimum_base^(i * 2^compression). Obviously, the boundaries for compression=3 are a subset of compression=2. Also the collapsing approaches of DDSketch or UDDSketch would be supported. If the boundaries are defined by b(i) = reference * base^i with parameters i_min and i_max limiting the index i by i_min <= i <= i_max, the collapsing corresponds to increasing i_min or decreasing i_max. Clearly, the boundaries after collapsing are a subset of the boundaries before.
  3. Allow merging of histograms with completely different discretizations. Of course, this is some kind of lossy, if the boundaries are not a subset as before. However, this allows to change the discretization at any time while still being able to build aggregates over historical data. This approach requires the definition of a point reconstruction strategy. For example, if the count of bucket [5,8] is 3, one could reconstruct 3 points at the midpoint 6.5. A second approach is to distribute the 3 points evenly over the interval, and reconstruct them as 5.5, 6.5, 7.5. The reconstructed points can then be added to the other histogram as ordinary values. This kind of merging sounds expensive. However, it can be implemented quite efficiently with a time complexity scaling with the number of buckets rather than with the number of points. This approach is implemented by DynaHist to also support merging of histograms which do not have the same discretization. However, this general merging approach requires the specification of the point reconstruction strategy. Here again multiple methods are thinkable.

@yzhuge
Copy link
Contributor Author

yzhuge commented Feb 11, 2021

@jmacd I think it is a good idea to have a dedicated doc on the histogram data type, stating things like what is accepted, what is recommended, rationale, limitations, future work, etc. The doc will become part of the spec. It will be the guide for people working on histograms, be it in the front end, back end, or the transport protocol.
Questions:

  • Where would be a good place to insert such a doc?
  • How do we do discussions while drafting the doc? Using PRs on the doc itself has the risk of repeating this PR's discussion. Other ways like an "issue" thread?

@githomin
Copy link

I think we should keep in mind the goal of providing histograms with relative accuracy guarantees where users don't have to set any parameters. I think ExponentialBounds in its simplest form with decent defaults achieves this. The issues brought up about floating point accuracy are less important when we think about them in terms of relative accuracy. E.g, if a point is so close to a bucket boundary that machine precision can determine which bucket they're in, it doesn't matter for the user as relative accuracy is maintained by either bucket.

Merging won't be an issue for users that stick to the defaults, and if users choose to merge histograms of different types with different parameters, they should expect the results to be lossy.

(I'm happy to move these comments to the new doc once it has a home.)

@postwait
Copy link

@githomin I have been thinking about this a lot (about 8 years) and believe the currently proposed approach does not meet this goal. Conversion between different different binning methodologies (or boundaries) in almost all cases (all proposed) introduces error and given that all of this telemetry can forward (and thus convert more than once) the error becomes unbounded. It's a tragic failing in my opinion that should be strictly solved herein. I believe there are too many opinions here that will not give ground to accomplish this goal.

We have a solution to this, but is seems to compete with agendas here. I'm hoping I can make a stronger case soon.

@yzhuge
Copy link
Contributor Author

yzhuge commented Feb 12, 2021

If OTel standard puts restrictions on exponential histogram parameters, for example, base must be 1.01, 1.02, 1.05, 1.1, etc. I am afraid senders not meeting the restriction will simply use explicit bounds, which makes things worse in that it is still “non standard”, it cost more space, and it does not give the receiver information like “this is a base 1.03 exponential histogram”. Disallowing explicit bounds is probably out of the question.

This is not to say that we cannot recommend certain parameters. My recommendation is to use bases that support 2 for 1 merging. A series of bases are recommended, where the next is the square of the previous. This is the same approach from UDDSketch and the Google scheme from @jdmontana. I will further pin the series on base 2 log scale. The general formula of recommended base is:
base = 2 ^ (2 ^ N)
where N is an integer from -Infinity to +Infinity.

  • When N=0, base=2. This is the basic case.
  • When N < 0, each base2 bucket is subdivided into 2^N log scale sub-buckets.
  • When N > 0, every 2^N base2 buckets are merged.

The advantages of pinning the series on base2 are in mapping a “double” to a bucket.

  • Base2 exponent can be quickly extracted from a “double” number.
  • N > 0 cases can be quickly computed from base2 exponent
  • N < 0 cases become a bucket sub-division problem. Different platforms may produce slightly different subbucketing results due to float point computational error. The main base2 bucketing comes from integer math on “double” exponent, with no error. So all platforms will agree on main bucketing. Inconsistency is limited to sub-bucketing.

The main disadvantage is of course not supporting an arbitrary base. Base can only increase by square, or decrease by square root. For universally mergeable histogram, it is a reasonable trade off.

Basically, my opinion is to “recommend”, but not “require” parameters.

@jmacd
Copy link
Contributor

jmacd commented Feb 16, 2021

Not to derail this conversation any further, but there is parallel work in the Prometheus community: https://docs.google.com/document/d/1cLNv3aufPZb3fNfaJgdaRBZsInZKKIHo9E6HinJVbpM/edit#heading=h.2ywoyp13c0bx

@beorn7
Copy link

beorn7 commented Feb 16, 2021

Note, though, that the linked document is about adding better histogram specifically to Prometheus, quite explicitly using merely a PoC exposition format until there is confidence how the final solution will look like so that the final treatment thereof in the exposition format will fit that solution optimally.

This PR seems to have a very different goal, mainly to be able to map many different existing histogram representations into the proposed format.

@jmacd
Copy link
Contributor

jmacd commented Feb 24, 2021

@yzhuge I think the best way to carry this forward would be through an OTEP document in the https://github.com/open-telemetry/oteps repository.

There is a new issue to discuss how to make room for multiple bucketing schemes to co-exist, see
#259 (comment).

Thank you all for this discussion. I believe @yzhuge the best we can do is close this PR and start over, it's been very constructive. I'm keen on your remarks here: #226 (comment)

Thanks!

@oertl
Copy link

oertl commented Feb 24, 2021

@yzhuge Worth to mention, that N >= 0 will probably not be the standard case, because the relative error would be in the order of 100% or greater. For N < 0 many of the advantages (fast mapping, no floating-point errors) of your proposed base-2 bucketing do not apply.

@yzhuge
Copy link
Contributor Author

yzhuge commented Mar 4, 2021

started working on an OTEP. Will close this PR and link to OTEP once I have EP link

@yzhuge
Copy link
Contributor Author

yzhuge commented Mar 8, 2021

Work continues on open-telemetry/oteps#149

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.