Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more histogram types #226

Closed
wants to merge 6 commits into from
Closed
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
103 changes: 99 additions & 4 deletions opentelemetry/proto/metrics/v1/metrics.proto
Original file line number Diff line number Diff line change
Expand Up @@ -188,7 +188,7 @@ message IntSum {
// as a sum of all reported measurements over a time interval.
message DoubleSum {
repeated DoubleDataPoint data_points = 1;

// aggregation_temporality describes if the aggregator reports delta changes
// since last report time, or cumulative changes since a fixed start time.
AggregationTemporality aggregation_temporality = 2;
Expand Down Expand Up @@ -484,9 +484,6 @@ message DoubleHistogramDataPoint {
// Otherwise all option fields and "buckets" field must be omitted in which case the
// distribution of values in the histogram is unknown and only the total count and sum are known.

// explicit_bounds is the only supported bucket option currently.
// TODO: Add more bucket options.

// explicit_bounds specifies buckets with explicitly defined bounds for values.
// The bucket boundaries are described by "bounds" field.
//
Expand All @@ -503,9 +500,107 @@ message DoubleHistogramDataPoint {
// a boolean value which decides what type of intervals to use.
repeated double explicit_bounds = 7;

// When bounds have a pattern, a more efficient encoding method may be used.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be a technicality that we need to add above. I'm interested in the zero "bucket" of an exponential bounds. If there are no negative values in the distribution, but there are zeros, how will it be represented? Do we need to insert an unused 0 boundary into the sequence so that the corresponding explicit bounds are (-Inf, 0), [0, x) and the 0th bucket goes unused?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you need set the has_counter_for_zero flag in exponential format so that the bound sequence starts from 0.

// Only one method should be defined. For performance reason, "oneof" is not used.
// When more than one methods are defined, the receiver will use the first one found using this order:
// explicit_bounds, linear_bounds, exponential_bounds, log_linear_bounds
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider to move this closer to explicit bounds, no need to keep the ids ordered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new commit added to group all "bound" types together.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a bit of trouble w/ the use of oneof in some places and not here. It's wire compatible, I believe, to say:

oneof {
  LinearBounds linear_bounds = 9;
  ExponentialBounds exponential_bounds = 10;
  LogLinearBounds log_linear_bounds = 11;
}

if so, then given the savings implied by these special bounds, perhaps we can afford to move the three new bounds into a oneof?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally, we want:
oneof {
ExplicitBounds explicit_bounds = 8
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 7
}
But " repeated double explicit_bounds = 7" is already at the top level. Though we could remove it at the cost of breaking compatibility (I heard compatibility is nice to have, but not required)

The bigger question is performance. @bogdandrutu said that oneof performance is so horrible that we should avoid it if we can.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu We have a few good questions to answer here, particularly about the oneof. I don't think we should bother moving field 7 or breaking compatibility, though it's sort of like keeping the worst of both worlds. 😞

One of the questions that I found troublesome when converting to or from Prometheus is that the "le" bounds contradict the [bounds[i-1], bounds[i]) statements here.
(I still have to catch up on the youtube video above...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

field order is a pain. We have these options:

  1. Append new fields at the end, keeping field number order, but breaking logical grouping of fields. (my 1st version)
  2. Insert new fields to their logical groups, breaking field number order. (my 2nd version, responding to Bogdan's comment)
  3. Group fields logically and update field number in sequential order, breaking wire backward compatibility

So no option is ideal. Option 1 and 2 have no impact once compiled, it's just for human readers. So changing between 1 and 2 is easy. We need to be careful on choosing 3.

As for "le" vs. "ge" bounds, I remember we basically punted on it, saying "we could add an optional boolean flag later". The bigger issue is how will a message receiver handles it. If a receiver is natively "le", it is difficult to convert incoming "ge" buckets to "le", and vise versa. The good news is that in practice, ge vs le usually has little impact on "continuous" number set, where we are concerned on the area under a histogram curve, between two bucket bounds. As long as "area" is concerned, inclusiveness/exclusiveness at a bound is irrelevant. Even for discrete data set, le/ge will only introduce errors no more than the width of a bucket (if population count is mistakenly included to a neighboring bucket, it would introduce an error no wider than the bucket width for quantile calculation). So I think we can continue to punt the le/ge question.

I think we are close to approval on this PR. There are only minor things like field order and "oneof". @bogdandrutu I'd like to hear from you and get this PR resolved soon.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for a oneof. Option 2 is fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that to keep backward compatibility, we can only have
repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}

explicit_bounds is a preexisting field. Although existing single field can be included in a new oneof without breaking backward compatibility, existing "repeated" cannot. In fact, "repeated" cannot be oneof member at all.

So with oneof, we still leave out explicit_bounds. If we don't mind breaking compatibility, we can do:
oneof bound_type {
ExplicitBounds explicit_bounds = 8;
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 1;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with your first suggestion, the backward-compatible one:

repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}

If we're going to break the protocol in any way, I'd fix this however. If we decide to break the protocol because of labels vs attributes, maybe?

// (Optional) List of exemplars collected from
// measurements that were used to form the data point
repeated DoubleExemplar exemplars = 8;

// LinearBounds. Bounds can be generated via
// for (int i = 0; i < num_of_bounds; i++)
// boundList.add(offset + width * i);
message LinearBounds {
double offset = 1;
double width = 2;
uint32 num_of_bounds = 3;
}

// ExponentialBounds. Bounds are on log scale.
// The bound sequence is stitched together using negative bounds, zero bound, and positive bounds.
// The formula to generate explicit bounds here are for demonstration purpose only.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a bit more explanation will help the reader into this description. This might have helped me:

// We use only integer-value exponents.
// Exponent 0 maps to value == 1.
// Exponents > maps to values > 1.
// Exponents < 0 map to values > 0 and < 1.
// No exponent maps to zero, therefore we treat it as a special bucket of its own.
// A "sequence" of integer exponents is described by two parameters, described 
// by `index_offset_for_` and `num_of_bounds_for_`-prefixed fields. A sequence
// thus describes a range of values > 0.
// This encoding consists of two sequences, one describing positive values and
// one describing negative values, plus an optional zero-valued bucket.

// If the message receiver natively stores exponential histograms, it will not need to generate explicit bounds.
// It will only need to record the exponential bound parameters.
message ExponentialBounds {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This histogram bound type represents buckets of data points with finer granularity close to a central point, and less granularity as the bucket range gets farther from that central point. Like this:

                                    0
|________________|________|____|__|_|_|__|____|________|________________|

I find it surprising, though, that this central point is hardcoded to zero. This is likely useful for values that are clustered close to zero, and that may be positive or negative. But what about values that cluster around some very large number?
Instead of zero, I think I would have expected this inflection point to be something like the median of the data.

From the message perspective, I think the only thing required to support this would be to add an offset field. Then the bound would be computed like:

reference * power(base, exponent) + offset

I don't know what implications this has for downstream aggregators of these histograms. Would it be problematic to merge histrogams whose center values were different?
Like this:

                  100
                   |
|________|____|__|_|_|__|____|________|

                             110
                              |
           |________|____|__|_|_|__|____|________|

Copy link
Contributor Author

@yzhuge yzhuge Dec 19, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applying a linear offset to an exponential series is possible.
For the range of 50 to 150, a series with offset of 100 looks like:

                  100
                   |
|________|____|__|_|_|__|____|________|
50                                   150

The same range, with offset 0, looks like:

|___|____|_____|_____|_______|________|
50                                   150

Which form would be more generally useful? I think 0 offset is. Allowing variable offset would add more complexity. If there is a use case, a user can always apply the offset before inserting data to the histogram. For example, insert(x - 100) to achieve the same effect. Of course, on reading the histogram, the user needs to add 100 back.

As you suggested, arbitrary offset would make merge difficult, because bounds are not aligned.

Now onto positive and negative numbers. For simplicity, the protocol here requires the same base and reference for positive and negative series (DDSketch also does this). Thus the bound space is always symmetrical like:

                                    0
|________________|________|____|__|_|_|__|____|________|________________|

NOT like

                                                   0
|________________________|________________|________|_|__|____|________|________________|

The separate positive and negative sections define index ranges where the bounds are actually present in the infinite logical range. Example:

                                    0
|________________|________|____|__|_|_|__|____|________|________________|
                   |---------|        |---------------------------------------------|
                  -100     -40        +5                                          +320
                 negative range       positive range

For negative numbers and positive numbers of the same absolute value, the bucket granularity is identical (symmetrical with reflection point at zero). But we allow asymmetrical value ranges of negative and positive numbers.
Combining the two ranges into one is not a good idea because there are an infinite number of indexes between 0 and 1 (assuming reference of 1) for infinite precision numbers. For finite precision numbers such as 64 bit double, there is still a large number of indexes. For a base of 2, the number is around 1024. Including all these empty buckets in the series is not good.
Theoretically, we could apply a linear offset to shift all input numbers to the positive range, for bounds like

 |___|____|_____|_____|_______|________|
-50                                   150

Though in some cases, this may be preferred (like temperature in Celsius or Fahrenheit) . But for the common cases we care, like response time, bank balance, who all have natural zero point, symmetric buckets around 0 are preferred. Again, for special cases like temperature, users can shift before inserting to histogram, or use a more scientific unit like Kelvin.
And there is also a practical reason for not shifting the range, because the range is often unknown in streaming data. Range shifting adds complexity to histogram management.

double reference = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May we added that if reference is omitted, it's value may be presumed to equal 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment accepted.

double base = 2;

// The bound sequence starts with negative values generated via:
// for (int i = num_of_bounds_for_negative_numbers - 1; i >= 0; i--)
// int exponent = i + index_offset_for_negative_numbers;
// boundList.add( -1 * reference * power(base, exponent));
//
// Example of reference=1, base=2, index_offset_for_negative_numbers=-2, num_of_bounds_for_negative_numbers=6
// bounds: -8 -4 -2 -1 -.5 -.25
// exponent: 3 2 1 0 -1 -2
// Note that the loop on i goes down, so that negative numbers with larger absolute values
// (but smaller arithmetic values) are added first.

sint32 index_offset_for_negative_numbers = 3;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've included several fields in the ExponentialBounds message to ensure that the bucket sizing can be asymmetical. (for example, so we can have more granularity for positive numbers than for negative numbers). Like this:

                                                   0
|________________________|________________|________|_|__|____|________|________________|

Is there a compelling reason to support asymmetric bounds?
It would simplify the message if we simply made the exponential bounds symmetric around their central point.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see previous reply

uint32 num_of_bounds_for_negative_numbers = 4;

// If has_counter_for_zero is true, add a bound of zero using:
// boundList.add(0);
bool has_counter_for_zero = 5;

// Positive bounds follow the zero bound. Positive bounds are generated via:
// for (int i = 0; i < num_of_bounds_for_positive_numbers; i++)
// int exponent = i + index_offset_for_positive_numbers;
// boundList.add( reference * power(base, exponent));
//
// Example of reference=1, base=2, index_offset_for_positive_numbers=-3, num_of_bounds_for_positive_numbers=8
// bound: .125 .25 .5 1 2 4 8 16
// exponent: -3 -2 -1 0 1 2 3 4

sint32 index_offset_for_positive_numbers = 6;
uint32 num_of_bounds_for_positive_numbers = 7;
}

// In LogLinearBounds, each exponential bucket is divided linearly into multiple subbuckets.
// LogLinearBounds is an extension of ExponentialBounds. Similar to ExponentialBounds,
// the bound sequence is stitched together using negative bounds, zero bound, and positive bounds.
//
// Example of LogLinearBounds with num_of_linear_subbuckets=4 on ExponentialBounds of
// reference=1, base=2, index_offset_for_positive_numbers=-4, num_of_bounds_for_positive_numbers=15,
//
// bound: .5 .625 .75 .875 1 1.25 1.5 1.75 2 2.5 3 3.5 4 5 6
// index: -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10
//
// Note: Index need not start or end on multiples of num_of_linear_subbuckets.
//
// Positive bounds are computed as:
// for (int count = 0; count < num_of_bounds_for_positive_numbers; count++)
// int i = count + index_offset_for_positive_numbers;
// int exponent;
// int subbucketNumber; // In the range of [0, num_of_linear_subbuckets - 1]
// if (i >= 0) {
// exponent = i / num_of_linear_subbuckets;
// subbucketNumber = i % num_of_linear_subbuckets;
// } else {
// exponent = (i - num_of_linear_subbuckets + 1) / num_of_linear_subbuckets; // Round down toward -infinity
// subbucketNumber = i - exponent * num_of_linear_subbuckets;
// }
Comment on lines +604 to +610
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It takes a while to comprehend and it makes me think a comment would help. I suggest:

// Set exponent and subbucketNumber according to i. For index (i >= 0) we are setting
// a (bound >= 1), and for (i < 0) we are setting a (0 < bound < 1).
// When i == 0: exponent == 0 and subbucketNumber == 0
// When i == -1: exponent == -1 and subbucketNumber == (num_of_linear_subbuckets - 1)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On line 586: A comment like

// exponent is negative, same as (i + abs(exponent) * num_of_linear_subbuckets)

I'm not sure that helps, but it may have helped me.

// // Compute the exponential bucket.
// double bucketStart = reference * power(base, exponent);
// double bucketWidth = bucketStart * (base - 1);
// // Add a fraction of bucketWidth to bucketStart to get the bound.
// boundList.add( bucketStart + bucketWidth * ((double)subbucketNumber / num_of_linear_subbuckets));
//
// For negative bounds, the formula is similar, except that the index loop will go down and the bound will be converted
// to a negative number.
//
// If has_counter_for_zero is true, add a zero bound between negative and positive bounds.
// There are no subbuckets between last negative bucket and 0, or between 0 and first positive bucket.

message LogLinearBounds {
ExponentialBounds exponential_bounds = 1;
uint32 num_of_linear_subbuckets = 2;
}
}

// A representation of an exemplar, which is a sample input int measurement.
Expand Down