-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816
Conversation
There exists a situation where the Block/Chunk is malformed in some way. This creates a start/stop range that results in the uint64 subtraction of 'ids/offset - start' to wrap around. This in turn causes an out of bounds lower-bound slice, which causes a panic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get this reason to avoid panics and crashes, but wonder if we can move this check to readChunkRange
itself. Or check in object storage providers' Get_range method what is the logic if someone asks getRange for range that does not exists (:
@@ -1394,6 +1394,11 @@ func (r *bucketIndexReader) loadSeries(ctx context.Context, ids []uint64, start, | |||
r.stats.seriesFetchedSizeSum += int(end - start) | |||
|
|||
for _, id := range ids { | |||
|
|||
if id-start > uint64(len(b)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm are we should this is the correct place to validate against malformed index?
If we are asking to fetch start -> end-start
from r.block.readIndexRange(ctx, int64(start), int64(end-start))
I think this method should FAIL if there is no such range in the object storage. So I believe we should fix that method instead of every small usages of those in thousands of places like this. (:
Also what is the root cause again? Because you have some block that has malformed index right? Wonder in what way malformed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is also what I was thinking. This is a good stop to the problem but not a good fix. It might not even be the correct fix at all.
The root cause looks to be that some blocks have a malformed index. I should have the results of a thanos bucket verify
at some point today or tomorrow.
As a sort of update, I updated to 0.3 but error persists, so I bumped back down. I'm going to find out what blocks are bad right now so I can do a comparison on how many bytes we were supposed to be expecting vs how much we got |
Changes
Add a conditional to check whether or not the summation of an
id
oroffset
with thestart
time does not result in a number that is bigger than the bounds of thebucketIndexRange
/readChunkRange
slice. The reason for this is because blocks with bad data will create a wrapped-around integer.Using the
readChunkRange
as an example, a good block will produce a range that is good. Something like:A bad block will produce something like:
As you can see from the first line:
offset
-start
(150844334
-154139263
) =-3294929
, which because it's an unsigned int, wraps around and instead creates the value of4291672367
, which is out of bounds of a block capacity of1047153
.Verification
We have an issue right now with some historical data we have stored in buckets. This issue is documented in #785 . Will need to write some proper bucket tests for it.