-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more histogram types #226
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -188,7 +188,7 @@ message IntSum { | |
// as a sum of all reported measurements over a time interval. | ||
message DoubleSum { | ||
repeated DoubleDataPoint data_points = 1; | ||
|
||
// aggregation_temporality describes if the aggregator reports delta changes | ||
// since last report time, or cumulative changes since a fixed start time. | ||
AggregationTemporality aggregation_temporality = 2; | ||
|
@@ -484,9 +484,6 @@ message DoubleHistogramDataPoint { | |
// Otherwise all option fields and "buckets" field must be omitted in which case the | ||
// distribution of values in the histogram is unknown and only the total count and sum are known. | ||
|
||
// explicit_bounds is the only supported bucket option currently. | ||
// TODO: Add more bucket options. | ||
|
||
// explicit_bounds specifies buckets with explicitly defined bounds for values. | ||
// The bucket boundaries are described by "bounds" field. | ||
// | ||
|
@@ -503,9 +500,130 @@ message DoubleHistogramDataPoint { | |
// a boolean value which decides what type of intervals to use. | ||
repeated double explicit_bounds = 7; | ||
|
||
// Multiple ways to define the bound sequence is supported. While explicit_bounds is one of the ways, | ||
// for backward compatibility reason, it is not included in "oneof bound_type". | ||
// Only one of explicit_bounds or bound_type should be present. When both are present, bound_type is ignored. | ||
oneof bound_type { | ||
LinearBounds linear_bounds = 9; | ||
ExponentialBounds exponential_bounds = 10; | ||
LogLinearBounds log_linear_bounds = 11; | ||
} | ||
// (Optional) List of exemplars collected from | ||
// measurements that were used to form the data point | ||
repeated DoubleExemplar exemplars = 8; | ||
|
||
// LinearBounds. Bounds can be generated via | ||
// for (int i = 0; i < num_of_bounds; i++) | ||
// boundList.add(offset + width * i); | ||
message LinearBounds { | ||
double offset = 1; | ||
double width = 2; | ||
uint32 num_of_bounds = 3; | ||
} | ||
|
||
// ExponentialBounds. Bounds are on log scale. | ||
// The bound sequence is stitched together using negative bounds, zero bound, and positive bounds. | ||
// The general rules is: bound = reference * power(base, exponent) | ||
// Reference and base are constants. Exponent is integer index of histogram buckets. | ||
// Reference defaults to 1. Base should be greater than 1. | ||
// Negative index is allowed. It results in a bound between 0 and the reference. | ||
// Two sections are specified, one for positive numbers, one for negative numbers. | ||
// A special field is used to specify if a bound of zero exists. | ||
// The formula to generate explicit bounds here are for demonstration purpose only. | ||
// If the message receiver natively stores exponential histograms, it will not need to generate explicit bounds. | ||
// It will only need to record the exponential bound parameters. | ||
message ExponentialBounds { | ||
double reference = 1; // If not present, 1 is assumed. | ||
double base = 2; | ||
|
||
// The bound sequence starts with negative values generated via: | ||
// for (int i = num_of_bounds_for_negative_numbers - 1; i >= 0; i--) | ||
// int exponent = i + index_offset_for_negative_numbers; | ||
// boundList.add( -1 * reference * power(base, exponent)); | ||
// | ||
// Example of reference=1, base=2, index_offset_for_negative_numbers=-2, num_of_bounds_for_negative_numbers=6 | ||
// bounds: -8 -4 -2 -1 -.5 -.25 | ||
// exponent: 3 2 1 0 -1 -2 | ||
// Note that the loop on i goes down, so that negative numbers with larger absolute values | ||
// (but smaller arithmetic values) are added first. | ||
|
||
sint32 index_offset_for_negative_numbers = 3; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've included several fields in the ExponentialBounds message to ensure that the bucket sizing can be asymmetical. (for example, so we can have more granularity for positive numbers than for negative numbers). Like this:
Is there a compelling reason to support asymmetric bounds? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see previous reply |
||
uint32 num_of_bounds_for_negative_numbers = 4; | ||
|
||
// If has_counter_for_zero is true, add a bound of zero using: | ||
// boundList.add(0); | ||
bool has_counter_for_zero = 5; | ||
|
||
// Positive bounds follow the zero bound. Positive bounds are generated via: | ||
// for (int i = 0; i < num_of_bounds_for_positive_numbers; i++) | ||
// int exponent = i + index_offset_for_positive_numbers; | ||
// boundList.add( reference * power(base, exponent)); | ||
// | ||
// Example of reference=1, base=2, index_offset_for_positive_numbers=-3, num_of_bounds_for_positive_numbers=8 | ||
// bound: .125 .25 .5 1 2 4 8 16 | ||
// exponent: -3 -2 -1 0 1 2 3 4 | ||
|
||
sint32 index_offset_for_positive_numbers = 6; | ||
uint32 num_of_bounds_for_positive_numbers = 7; | ||
} | ||
|
||
// In LogLinearBounds, each exponential bucket is divided linearly into multiple subbuckets. | ||
// LogLinearBounds is an extension of ExponentialBounds. Similar to ExponentialBounds, | ||
// the bound sequence is stitched together using negative bounds, zero bound, and positive bounds. | ||
// | ||
// Example of LogLinearBounds with num_of_linear_subbuckets=4 on ExponentialBounds of | ||
// reference=1, base=2, index_offset_for_positive_numbers=-4, num_of_bounds_for_positive_numbers=15, | ||
// | ||
// bound: .5 .625 .75 .875 1 1.25 1.5 1.75 2 2.5 3 3.5 4 5 6 | ||
// index: -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 | ||
// | ||
// Note: Index need not start or end on multiples of num_of_linear_subbuckets. | ||
// | ||
// Positive bounds are computed as: | ||
// for (int count = 0; count < num_of_bounds_for_positive_numbers; count++) | ||
// int i = count + index_offset_for_positive_numbers; | ||
// int exponent; | ||
// int subbucketNumber; // In the range of [0, num_of_linear_subbuckets - 1] | ||
// | ||
// // i is the bucket index. We need to derive "major" bucket number (ie. exponent) and "minor" bucket number | ||
// // (ie. subbucketNumber) from it. An example of num_of_linear_subbuckets=4: | ||
// // index: -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 | ||
// // exponent: -2 -2 -2 -2 -1 -1 -1 -1 0 0 0 0 1 1 1 1 2 | ||
// // subbucketNumber: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 | ||
// // At index 0, exponent=0, subbucketNumber=0. | ||
// // Exponent increases by 1 on every num_of_linear_subbuckets indexes. | ||
// // subbucketNumber cycles from 0 to num_of_linear_subbuckets - 1. | ||
// // For index >= 0, simple division and mod will do the work. | ||
// // For negative indexes, we need some special logic to make division round toward -infinity. | ||
// // In most platforms (such as Java), integer division on negative numbers round toward 0. | ||
// // Toward 0: -5 / 4 = -1.25 = -1 | ||
// // Toward -infinity: -5 / 4 = -1.25 = -2 | ||
// // Similarly, some special logic is needed on subbucketNumber to make it cycle in 0, 1, 2 ... order | ||
// // in the negative index range. | ||
// | ||
// if (i >= 0) { | ||
// exponent = i / num_of_linear_subbuckets; | ||
// subbucketNumber = i % num_of_linear_subbuckets; | ||
// } else { | ||
// exponent = (i - num_of_linear_subbuckets + 1) / num_of_linear_subbuckets; // Round down toward -infinity | ||
// subbucketNumber = i - exponent * num_of_linear_subbuckets; | ||
// } | ||
Comment on lines
+604
to
+610
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It takes a while to comprehend and it makes me think a comment would help. I suggest:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On line 586: A comment like
I'm not sure that helps, but it may have helped me. |
||
// // Compute the exponential bucket. | ||
// double bucketStart = reference * power(base, exponent); | ||
// double bucketWidth = bucketStart * (base - 1); | ||
// // Add a fraction of bucketWidth to bucketStart to get the bound. | ||
// boundList.add( bucketStart + bucketWidth * ((double)subbucketNumber / num_of_linear_subbuckets)); | ||
// | ||
// For negative bounds, the formula is similar, except that the index loop will go down and the bound will be converted | ||
// to a negative number. | ||
// | ||
// If has_counter_for_zero is true, add a zero bound between negative and positive bounds. | ||
// There are no subbuckets between last negative bucket and 0, or between 0 and first positive bucket. | ||
|
||
message LogLinearBounds { | ||
ExponentialBounds exponential_bounds = 1; | ||
uint32 num_of_linear_subbuckets = 2; | ||
} | ||
} | ||
|
||
// A representation of an exemplar, which is a sample input int measurement. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This histogram bound type represents buckets of data points with finer granularity close to a central point, and less granularity as the bucket range gets farther from that central point. Like this:
I find it surprising, though, that this central point is hardcoded to zero. This is likely useful for values that are clustered close to zero, and that may be positive or negative. But what about values that cluster around some very large number?
Instead of zero, I think I would have expected this inflection point to be something like the median of the data.
From the message perspective, I think the only thing required to support this would be to add an
offset
field. Then the bound would be computed like:I don't know what implications this has for downstream aggregators of these histograms. Would it be problematic to merge histrogams whose center values were different?
Like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying a linear offset to an exponential series is possible.
For the range of 50 to 150, a series with offset of 100 looks like:
The same range, with offset 0, looks like:
Which form would be more generally useful? I think 0 offset is. Allowing variable offset would add more complexity. If there is a use case, a user can always apply the offset before inserting data to the histogram. For example, insert(x - 100) to achieve the same effect. Of course, on reading the histogram, the user needs to add 100 back.
As you suggested, arbitrary offset would make merge difficult, because bounds are not aligned.
Now onto positive and negative numbers. For simplicity, the protocol here requires the same base and reference for positive and negative series (DDSketch also does this). Thus the bound space is always symmetrical like:
NOT like
The separate positive and negative sections define index ranges where the bounds are actually present in the infinite logical range. Example:
For negative numbers and positive numbers of the same absolute value, the bucket granularity is identical (symmetrical with reflection point at zero). But we allow asymmetrical value ranges of negative and positive numbers.
Combining the two ranges into one is not a good idea because there are an infinite number of indexes between 0 and 1 (assuming reference of 1) for infinite precision numbers. For finite precision numbers such as 64 bit double, there is still a large number of indexes. For a base of 2, the number is around 1024. Including all these empty buckets in the series is not good.
Theoretically, we could apply a linear offset to shift all input numbers to the positive range, for bounds like
Though in some cases, this may be preferred (like temperature in Celsius or Fahrenheit) . But for the common cases we care, like response time, bank balance, who all have natural zero point, symmetric buckets around 0 are preferred. Again, for special cases like temperature, users can shift before inserting to histogram, or use a more scientific unit like Kelvin.
And there is also a practical reason for not shifting the range, because the range is often unknown in streaming data. Range shifting adds complexity to histogram management.