Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add more histogram types #226
Add more histogram types #226
Changes from 3 commits
0b3cc20
6c68c14
dd8b15e
4723b59
70bbaf7
7135d0b
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There may be a technicality that we need to add above. I'm interested in the zero "bucket" of an exponential bounds. If there are no negative values in the distribution, but there are zeros, how will it be represented? Do we need to insert an unused 0 boundary into the sequence so that the corresponding explicit bounds are
(-Inf, 0), [0, x)
and the 0th bucket goes unused?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you need set the has_counter_for_zero flag in exponential format so that the bound sequence starts from 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider to move this closer to explicit bounds, no need to keep the ids ordered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
new commit added to group all "bound" types together.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm having a bit of trouble w/ the use of
oneof
in some places and not here. It's wire compatible, I believe, to say:if so, then given the savings implied by these special bounds, perhaps we can afford to move the three new bounds into a
oneof
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ideally, we want:
oneof {
ExplicitBounds explicit_bounds = 8
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 7
}
But " repeated double explicit_bounds = 7" is already at the top level. Though we could remove it at the cost of breaking compatibility (I heard compatibility is nice to have, but not required)
The bigger question is performance. @bogdandrutu said that oneof performance is so horrible that we should avoid it if we can.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bogdandrutu We have a few good questions to answer here, particularly about the
oneof
. I don't think we should bother moving field 7 or breaking compatibility, though it's sort of like keeping the worst of both worlds. 😞One of the questions that I found troublesome when converting to or from Prometheus is that the "le" bounds contradict the
[bounds[i-1], bounds[i])
statements here.(I still have to catch up on the youtube video above...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
field order is a pain. We have these options:
So no option is ideal. Option 1 and 2 have no impact once compiled, it's just for human readers. So changing between 1 and 2 is easy. We need to be careful on choosing 3.
As for "le" vs. "ge" bounds, I remember we basically punted on it, saying "we could add an optional boolean flag later". The bigger issue is how will a message receiver handles it. If a receiver is natively "le", it is difficult to convert incoming "ge" buckets to "le", and vise versa. The good news is that in practice, ge vs le usually has little impact on "continuous" number set, where we are concerned on the area under a histogram curve, between two bucket bounds. As long as "area" is concerned, inclusiveness/exclusiveness at a bound is irrelevant. Even for discrete data set, le/ge will only introduce errors no more than the width of a bucket (if population count is mistakenly included to a neighboring bucket, it would introduce an error no wider than the bucket width for quantile calculation). So I think we can continue to punt the le/ge question.
I think we are close to approval on this PR. There are only minor things like field order and "oneof". @bogdandrutu I'd like to hear from you and get this PR resolved soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote for a oneof. Option 2 is fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that to keep backward compatibility, we can only have
repeated double explicit_bounds = 7;
oneof bound_type {
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
explicit_bounds is a preexisting field. Although existing single field can be included in a new oneof without breaking backward compatibility, existing "repeated" cannot. In fact, "repeated" cannot be oneof member at all.
So with oneof, we still leave out explicit_bounds. If we don't mind breaking compatibility, we can do:
oneof bound_type {
ExplicitBounds explicit_bounds = 8;
LinearBounds linear_bounds = 9;
ExponentialBounds exponential_bounds = 10;
LogLinearBounds log_linear_bounds = 11;
}
message ExplicitBounds {
repeated double explicit_bounds = 1;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with your first suggestion, the backward-compatible one:
If we're going to break the protocol in any way, I'd fix this however. If we decide to break the protocol because of labels vs attributes, maybe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a bit more explanation will help the reader into this description. This might have helped me:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This histogram bound type represents buckets of data points with finer granularity close to a central point, and less granularity as the bucket range gets farther from that central point. Like this:
I find it surprising, though, that this central point is hardcoded to zero. This is likely useful for values that are clustered close to zero, and that may be positive or negative. But what about values that cluster around some very large number?
Instead of zero, I think I would have expected this inflection point to be something like the median of the data.
From the message perspective, I think the only thing required to support this would be to add an
offset
field. Then the bound would be computed like:I don't know what implications this has for downstream aggregators of these histograms. Would it be problematic to merge histrogams whose center values were different?
Like this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Applying a linear offset to an exponential series is possible.
For the range of 50 to 150, a series with offset of 100 looks like:
The same range, with offset 0, looks like:
Which form would be more generally useful? I think 0 offset is. Allowing variable offset would add more complexity. If there is a use case, a user can always apply the offset before inserting data to the histogram. For example, insert(x - 100) to achieve the same effect. Of course, on reading the histogram, the user needs to add 100 back.
As you suggested, arbitrary offset would make merge difficult, because bounds are not aligned.
Now onto positive and negative numbers. For simplicity, the protocol here requires the same base and reference for positive and negative series (DDSketch also does this). Thus the bound space is always symmetrical like:
NOT like
The separate positive and negative sections define index ranges where the bounds are actually present in the infinite logical range. Example:
For negative numbers and positive numbers of the same absolute value, the bucket granularity is identical (symmetrical with reflection point at zero). But we allow asymmetrical value ranges of negative and positive numbers.
Combining the two ranges into one is not a good idea because there are an infinite number of indexes between 0 and 1 (assuming reference of 1) for infinite precision numbers. For finite precision numbers such as 64 bit double, there is still a large number of indexes. For a base of 2, the number is around 1024. Including all these empty buckets in the series is not good.
Theoretically, we could apply a linear offset to shift all input numbers to the positive range, for bounds like
Though in some cases, this may be preferred (like temperature in Celsius or Fahrenheit) . But for the common cases we care, like response time, bank balance, who all have natural zero point, symmetric buckets around 0 are preferred. Again, for special cases like temperature, users can shift before inserting to histogram, or use a more scientific unit like Kelvin.
And there is also a practical reason for not shifting the range, because the range is often unknown in streaming data. Range shifting adds complexity to histogram management.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May we added that if
reference
is omitted, it's value may be presumed to equal 1?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment accepted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've included several fields in the ExponentialBounds message to ensure that the bucket sizing can be asymmetical. (for example, so we can have more granularity for positive numbers than for negative numbers). Like this:
Is there a compelling reason to support asymmetric bounds?
It would simplify the message if we simply made the exponential bounds symmetric around their central point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see previous reply
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It takes a while to comprehend and it makes me think a comment would help. I suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On line 586: A comment like
I'm not sure that helps, but it may have helped me.