Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

Closed
wants to merge 3 commits into from
Closed

Conversation

jalev
Copy link
Contributor

@jalev jalev commented Feb 6, 2019

Changes

Add a conditional to check whether or not the summation of an id or offset with the start time does not result in a number that is bigger than the bounds of the bucketIndexRange/readChunkRange slice. The reason for this is because blocks with bad data will create a wrapped-around integer.

Using the readChunkRange as an example, a good block will produce a range that is good. Something like:

Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391849520, lowerBound: 0, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391849760, lowerBound: 240, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850005, lowerBound: 485, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850307, lowerBound: 787, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850562, lowerBound: 1042, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850795, lowerBound: 1275, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851053, lowerBound: 1533, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851296, lowerBound: 1776, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851540, lowerBound: 2020, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851791, lowerBound: 2271, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852069, lowerBound: 2549, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852350, lowerBound: 2830, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852634, lowerBound: 3114, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852876, lowerBound: 3356, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853127, lowerBound: 3607, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853367, lowerBound: 3847, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853614, lowerBound: 4094, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853678, lowerBound: 4158, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853758, lowerBound: 4238, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853831, lowerBound: 4311, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853922, lowerBound: 4402, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854100, lowerBound: 4580, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854259, lowerBound: 4739, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854429, lowerBound: 4909, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854613, lowerBound: 5093, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854669, lowerBound: 5149, blockCapacity: 200000, blockLength: 72520
....

A bad block will produce something like:

Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 150844334, lowerBound: 4291672367, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 153567825, lowerBound: 4294395858, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 145437857, offsets: 142139769, lowerBound: 4291669208, blockCapacity: 1600000, blockLength: 1046559
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 145437857, offsets: 138987621, lowerBound: 4288517060, blockCapacity: 1600000, blockLength: 1046559
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 95992829, offsets: 94499912, lowerBound: 4293474379, blockCapacity: 1600000, blockLength: 1000750
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 90331848, offsets: 84061971, lowerBound: 4288697419, blockCapacity: 1600000, blockLength: 999845
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 152590538, lowerBound: 4293418571, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 301518029, offsets: 298164590, lowerBound: 4291613857, blockCapacity: 1600000, blockLength: 1050729
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 143073110, offsets: 141515948, lowerBound: 4293410134, blockCapacity: 1600000, blockLength: 1047153
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 143073110, offsets: 136620994, lowerBound: 4288515180, blockCapacity: 1600000, blockLength: 1047153

As you can see from the first line: offset - start (150844334 - 154139263) = -3294929, which because it's an unsigned int, wraps around and instead creates the value of 4291672367, which is out of bounds of a block capacity of 1047153.

Verification

We have an issue right now with some historical data we have stored in buckets. This issue is documented in #785 . Will need to write some proper bucket tests for it.

jalev added 3 commits February 6, 2019 17:28
There exists a situation where the Block/Chunk is malformed in some way.
This creates a start/stop range that results in the uint64 subtraction
of 'ids/offset - start' to wrap around. This in turn causes an out of
bounds lower-bound slice, which causes a panic.
Copy link
Member

@FUSAKLA FUSAKLA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch 👍

Copy link
Member

@bwplotka bwplotka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get this reason to avoid panics and crashes, but wonder if we can move this check to readChunkRange itself. Or check in object storage providers' Get_range method what is the logic if someone asks getRange for range that does not exists (:

@@ -1394,6 +1394,11 @@ func (r *bucketIndexReader) loadSeries(ctx context.Context, ids []uint64, start,
r.stats.seriesFetchedSizeSum += int(end - start)

for _, id := range ids {

if id-start > uint64(len(b)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm are we should this is the correct place to validate against malformed index?

If we are asking to fetch start -> end-start from r.block.readIndexRange(ctx, int64(start), int64(end-start)) I think this method should FAIL if there is no such range in the object storage. So I believe we should fix that method instead of every small usages of those in thousands of places like this. (:

Also what is the root cause again? Because you have some block that has malformed index right? Wonder in what way malformed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also what I was thinking. This is a good stop to the problem but not a good fix. It might not even be the correct fix at all.

The root cause looks to be that some blocks have a malformed index. I should have the results of a thanos bucket verify at some point today or tomorrow.

@jalev
Copy link
Contributor Author

jalev commented Feb 11, 2019

As a sort of update, I updated to 0.3 but error persists, so I bumped back down.

I'm going to find out what blocks are bad right now so I can do a comparison on how many bytes we were supposed to be expecting vs how much we got

@jalev jalev closed this Feb 20, 2019
@jalev jalev deleted the slice-fix branch February 20, 2019 14:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants