-
Notifications
You must be signed in to change notification settings - Fork 256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more histogram types #226
Changes from 4 commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -188,7 +188,7 @@ message IntSum { | |
// as a sum of all reported measurements over a time interval. | ||
message DoubleSum { | ||
repeated DoubleDataPoint data_points = 1; | ||
|
||
// aggregation_temporality describes if the aggregator reports delta changes | ||
// since last report time, or cumulative changes since a fixed start time. | ||
AggregationTemporality aggregation_temporality = 2; | ||
|
@@ -484,9 +484,6 @@ message DoubleHistogramDataPoint { | |
// Otherwise all option fields and "buckets" field must be omitted in which case the | ||
// distribution of values in the histogram is unknown and only the total count and sum are known. | ||
|
||
// explicit_bounds is the only supported bucket option currently. | ||
// TODO: Add more bucket options. | ||
|
||
// explicit_bounds specifies buckets with explicitly defined bounds for values. | ||
// The bucket boundaries are described by "bounds" field. | ||
// | ||
|
@@ -503,9 +500,107 @@ message DoubleHistogramDataPoint { | |
// a boolean value which decides what type of intervals to use. | ||
repeated double explicit_bounds = 7; | ||
|
||
// When bounds have a pattern, a more efficient encoding method may be used. | ||
// Only one method should be defined. For performance reason, "oneof" is not used. | ||
// When more than one methods are defined, the receiver will use the first one found using this order: | ||
// explicit_bounds, linear_bounds, exponential_bounds, log_linear_bounds | ||
LinearBounds linear_bounds = 9; | ||
ExponentialBounds exponential_bounds = 10; | ||
LogLinearBounds log_linear_bounds = 11; | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Consider to move this closer to explicit bounds, no need to keep the ids ordered. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. new commit added to group all "bound" types together. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm having a bit of trouble w/ the use of
if so, then given the savings implied by these special bounds, perhaps we can afford to move the three new bounds into a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ideally, we want: The bigger question is performance. @bogdandrutu said that oneof performance is so horrible that we should avoid it if we can. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @bogdandrutu We have a few good questions to answer here, particularly about the One of the questions that I found troublesome when converting to or from Prometheus is that the "le" bounds contradict the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. field order is a pain. We have these options:
So no option is ideal. Option 1 and 2 have no impact once compiled, it's just for human readers. So changing between 1 and 2 is easy. We need to be careful on choosing 3. As for "le" vs. "ge" bounds, I remember we basically punted on it, saying "we could add an optional boolean flag later". The bigger issue is how will a message receiver handles it. If a receiver is natively "le", it is difficult to convert incoming "ge" buckets to "le", and vise versa. The good news is that in practice, ge vs le usually has little impact on "continuous" number set, where we are concerned on the area under a histogram curve, between two bucket bounds. As long as "area" is concerned, inclusiveness/exclusiveness at a bound is irrelevant. Even for discrete data set, le/ge will only introduce errors no more than the width of a bucket (if population count is mistakenly included to a neighboring bucket, it would introduce an error no wider than the bucket width for quantile calculation). So I think we can continue to punt the le/ge question. I think we are close to approval on this PR. There are only minor things like field order and "oneof". @bogdandrutu I'd like to hear from you and get this PR resolved soon. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I vote for a oneof. Option 2 is fine with me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Note that to keep backward compatibility, we can only have explicit_bounds is a preexisting field. Although existing single field can be included in a new oneof without breaking backward compatibility, existing "repeated" cannot. In fact, "repeated" cannot be oneof member at all. So with oneof, we still leave out explicit_bounds. If we don't mind breaking compatibility, we can do: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm OK with your first suggestion, the backward-compatible one:
If we're going to break the protocol in any way, I'd fix this however. If we decide to break the protocol because of labels vs attributes, maybe? |
||
// (Optional) List of exemplars collected from | ||
// measurements that were used to form the data point | ||
repeated DoubleExemplar exemplars = 8; | ||
|
||
// LinearBounds. Bounds can be generated via | ||
// for (int i = 0; i < num_of_bounds; i++) | ||
// boundList.add(offset + width * i); | ||
message LinearBounds { | ||
double offset = 1; | ||
double width = 2; | ||
uint32 num_of_bounds = 3; | ||
} | ||
|
||
// ExponentialBounds. Bounds are on log scale. | ||
// The bound sequence is stitched together using negative bounds, zero bound, and positive bounds. | ||
// The formula to generate explicit bounds here are for demonstration purpose only. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a bit more explanation will help the reader into this description. This might have helped me:
|
||
// If the message receiver natively stores exponential histograms, it will not need to generate explicit bounds. | ||
// It will only need to record the exponential bound parameters. | ||
message ExponentialBounds { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This histogram bound type represents buckets of data points with finer granularity close to a central point, and less granularity as the bucket range gets farther from that central point. Like this:
I find it surprising, though, that this central point is hardcoded to zero. This is likely useful for values that are clustered close to zero, and that may be positive or negative. But what about values that cluster around some very large number? From the message perspective, I think the only thing required to support this would be to add an
I don't know what implications this has for downstream aggregators of these histograms. Would it be problematic to merge histrogams whose center values were different?
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Applying a linear offset to an exponential series is possible.
The same range, with offset 0, looks like:
Which form would be more generally useful? I think 0 offset is. Allowing variable offset would add more complexity. If there is a use case, a user can always apply the offset before inserting data to the histogram. For example, insert(x - 100) to achieve the same effect. Of course, on reading the histogram, the user needs to add 100 back. As you suggested, arbitrary offset would make merge difficult, because bounds are not aligned. Now onto positive and negative numbers. For simplicity, the protocol here requires the same base and reference for positive and negative series (DDSketch also does this). Thus the bound space is always symmetrical like:
NOT like
The separate positive and negative sections define index ranges where the bounds are actually present in the infinite logical range. Example:
For negative numbers and positive numbers of the same absolute value, the bucket granularity is identical (symmetrical with reflection point at zero). But we allow asymmetrical value ranges of negative and positive numbers.
Though in some cases, this may be preferred (like temperature in Celsius or Fahrenheit) . But for the common cases we care, like response time, bank balance, who all have natural zero point, symmetric buckets around 0 are preferred. Again, for special cases like temperature, users can shift before inserting to histogram, or use a more scientific unit like Kelvin. |
||
double reference = 1; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. May we added that if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. comment accepted. |
||
double base = 2; | ||
|
||
// The bound sequence starts with negative values generated via: | ||
// for (int i = num_of_bounds_for_negative_numbers - 1; i >= 0; i--) | ||
// int exponent = i + index_offset_for_negative_numbers; | ||
// boundList.add( -1 * reference * power(base, exponent)); | ||
// | ||
// Example of reference=1, base=2, index_offset_for_negative_numbers=-2, num_of_bounds_for_negative_numbers=6 | ||
// bounds: -8 -4 -2 -1 -.5 -.25 | ||
// exponent: 3 2 1 0 -1 -2 | ||
// Note that the loop on i goes down, so that negative numbers with larger absolute values | ||
// (but smaller arithmetic values) are added first. | ||
|
||
sint32 index_offset_for_negative_numbers = 3; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We've included several fields in the ExponentialBounds message to ensure that the bucket sizing can be asymmetical. (for example, so we can have more granularity for positive numbers than for negative numbers). Like this:
Is there a compelling reason to support asymmetric bounds? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. see previous reply |
||
uint32 num_of_bounds_for_negative_numbers = 4; | ||
|
||
// If has_counter_for_zero is true, add a bound of zero using: | ||
// boundList.add(0); | ||
bool has_counter_for_zero = 5; | ||
|
||
// Positive bounds follow the zero bound. Positive bounds are generated via: | ||
// for (int i = 0; i < num_of_bounds_for_positive_numbers; i++) | ||
// int exponent = i + index_offset_for_positive_numbers; | ||
// boundList.add( reference * power(base, exponent)); | ||
// | ||
// Example of reference=1, base=2, index_offset_for_positive_numbers=-3, num_of_bounds_for_positive_numbers=8 | ||
// bound: .125 .25 .5 1 2 4 8 16 | ||
// exponent: -3 -2 -1 0 1 2 3 4 | ||
|
||
sint32 index_offset_for_positive_numbers = 6; | ||
uint32 num_of_bounds_for_positive_numbers = 7; | ||
} | ||
|
||
// In LogLinearBounds, each exponential bucket is divided linearly into multiple subbuckets. | ||
// LogLinearBounds is an extension of ExponentialBounds. Similar to ExponentialBounds, | ||
// the bound sequence is stitched together using negative bounds, zero bound, and positive bounds. | ||
// | ||
// Example of LogLinearBounds with num_of_linear_subbuckets=4 on ExponentialBounds of | ||
// reference=1, base=2, index_offset_for_positive_numbers=-4, num_of_bounds_for_positive_numbers=15, | ||
// | ||
// bound: .5 .625 .75 .875 1 1.25 1.5 1.75 2 2.5 3 3.5 4 5 6 | ||
// index: -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 | ||
// | ||
// Note: Index need not start or end on multiples of num_of_linear_subbuckets. | ||
// | ||
// Positive bounds are computed as: | ||
// for (int count = 0; count < num_of_bounds_for_positive_numbers; count++) | ||
// int i = count + index_offset_for_positive_numbers; | ||
// int exponent; | ||
// int subbucketNumber; // In the range of [0, num_of_linear_subbuckets - 1] | ||
// if (i >= 0) { | ||
// exponent = i / num_of_linear_subbuckets; | ||
// subbucketNumber = i % num_of_linear_subbuckets; | ||
// } else { | ||
// exponent = (i - num_of_linear_subbuckets + 1) / num_of_linear_subbuckets; // Round down toward -infinity | ||
// subbucketNumber = i - exponent * num_of_linear_subbuckets; | ||
// } | ||
Comment on lines
+604
to
+610
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It takes a while to comprehend and it makes me think a comment would help. I suggest:
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On line 586: A comment like
I'm not sure that helps, but it may have helped me. |
||
// // Compute the exponential bucket. | ||
// double bucketStart = reference * power(base, exponent); | ||
// double bucketWidth = bucketStart * (base - 1); | ||
// // Add a fraction of bucketWidth to bucketStart to get the bound. | ||
// boundList.add( bucketStart + bucketWidth * ((double)subbucketNumber / num_of_linear_subbuckets)); | ||
// | ||
// For negative bounds, the formula is similar, except that the index loop will go down and the bound will be converted | ||
// to a negative number. | ||
// | ||
// If has_counter_for_zero is true, add a zero bound between negative and positive bounds. | ||
// There are no subbuckets between last negative bucket and 0, or between 0 and first positive bucket. | ||
|
||
message LogLinearBounds { | ||
ExponentialBounds exponential_bounds = 1; | ||
uint32 num_of_linear_subbuckets = 2; | ||
} | ||
} | ||
|
||
// A representation of an exemplar, which is a sample input int measurement. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a technicality that we need to add above. I'm interested in the zero "bucket" of an exponential bounds. If there are no negative values in the distribution, but there are zeros, how will it be represented? Do we need to insert an unused 0 boundary into the sequence so that the corresponding explicit bounds are
(-Inf, 0), [0, x)
and the 0th bucket goes unused?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you need set the has_counter_for_zero flag in exponential format so that the bound sequence starts from 0.