Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

jalev · 2019-02-06T17:44:58Z

Changes

Add a conditional to check whether or not the summation of an id or offset with the start time does not result in a number that is bigger than the bounds of the bucketIndexRange/readChunkRange slice. The reason for this is because blocks with bad data will create a wrapped-around integer.

Using the readChunkRange as an example, a good block will produce a range that is good. Something like:

Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391849520, lowerBound: 0, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391849760, lowerBound: 240, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850005, lowerBound: 485, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850307, lowerBound: 787, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850562, lowerBound: 1042, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391850795, lowerBound: 1275, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851053, lowerBound: 1533, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851296, lowerBound: 1776, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851540, lowerBound: 2020, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391851791, lowerBound: 2271, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852069, lowerBound: 2549, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852350, lowerBound: 2830, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852634, lowerBound: 3114, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391852876, lowerBound: 3356, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853127, lowerBound: 3607, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853367, lowerBound: 3847, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853614, lowerBound: 4094, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853678, lowerBound: 4158, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853758, lowerBound: 4238, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853831, lowerBound: 4311, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391853922, lowerBound: 4402, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854100, lowerBound: 4580, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854259, lowerBound: 4739, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854429, lowerBound: 4909, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854613, lowerBound: 5093, blockCapacity: 200000, blockLength: 72520
Feb 06 14:32:40 ip-172-22-12-55 thanos-store[31056]: Successful block. start: 391849520, offsets: 391854669, lowerBound: 5149, blockCapacity: 200000, blockLength: 72520
....

A bad block will produce something like:

Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 150844334, lowerBound: 4291672367, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 153567825, lowerBound: 4294395858, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 145437857, offsets: 142139769, lowerBound: 4291669208, blockCapacity: 1600000, blockLength: 1046559
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 145437857, offsets: 138987621, lowerBound: 4288517060, blockCapacity: 1600000, blockLength: 1046559
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 95992829, offsets: 94499912, lowerBound: 4293474379, blockCapacity: 1600000, blockLength: 1000750
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 90331848, offsets: 84061971, lowerBound: 4288697419, blockCapacity: 1600000, blockLength: 999845
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 154139263, offsets: 152590538, lowerBound: 4293418571, blockCapacity: 1600000, blockLength: 1034363
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 301518029, offsets: 298164590, lowerBound: 4291613857, blockCapacity: 1600000, blockLength: 1050729
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 143073110, offsets: 141515948, lowerBound: 4293410134, blockCapacity: 1600000, blockLength: 1047153
Feb 06 14:45:39 ip-172-22-12-55 thanos-store[31206]: Compare offsets from one to the other for block. start: 143073110, offsets: 136620994, lowerBound: 4288515180, blockCapacity: 1600000, blockLength: 1047153

As you can see from the first line: offset - start (150844334 - 154139263) = -3294929, which because it's an unsigned int, wraps around and instead creates the value of 4291672367, which is out of bounds of a block capacity of 1047153.

Verification

We have an issue right now with some historical data we have stored in buckets. This issue is documented in #785 . Will need to write some proper bucket tests for it.

There exists a situation where the Block/Chunk is malformed in some way. This creates a start/stop range that results in the uint64 subtraction of 'ids/offset - start' to wrap around. This in turn causes an out of bounds lower-bound slice, which causes a panic.

FUSAKLA

Nice catch 👍

bwplotka

I get this reason to avoid panics and crashes, but wonder if we can move this check to readChunkRange itself. Or check in object storage providers' Get_range method what is the logic if someone asks getRange for range that does not exists (:

bwplotka · 2019-02-07T10:24:29Z

pkg/store/bucket.go

@@ -1394,6 +1394,11 @@ func (r *bucketIndexReader) loadSeries(ctx context.Context, ids []uint64, start,
 	r.stats.seriesFetchedSizeSum += int(end - start)

 	for _, id := range ids {
+
+		if id-start > uint64(len(b)) {


Hm are we should this is the correct place to validate against malformed index?

If we are asking to fetch start -> end-start from r.block.readIndexRange(ctx, int64(start), int64(end-start)) I think this method should FAIL if there is no such range in the object storage. So I believe we should fix that method instead of every small usages of those in thousands of places like this. (:

Also what is the root cause again? Because you have some block that has malformed index right? Wonder in what way malformed.

This is also what I was thinking. This is a good stop to the problem but not a good fix. It might not even be the correct fix at all.

The root cause looks to be that some blocks have a malformed index. I should have the results of a thanos bucket verify at some point today or tomorrow.

jalev · 2019-02-11T15:30:44Z

As a sort of update, I updated to 0.3 but error persists, so I bumped back down.

I'm going to find out what blocks are bad right now so I can do a comparison on how many bytes we were supposed to be expecting vs how much we got

jalev added 3 commits February 6, 2019 17:28

Merge branch 'master' of github.com:transferwise/thanos into slice-fix

ca02bb5

Removes unnecessary if statement from pkg/store/bucket.go

e46edf4

FUSAKLA approved these changes Feb 6, 2019

View reviewed changes

bwplotka requested changes Feb 7, 2019

View reviewed changes

GiedriusS mentioned this pull request Feb 10, 2019

Compactor Retention Issue and Resource Usage #824

Closed

bwplotka mentioned this pull request Feb 11, 2019

store: crashing after upgrade to 0.3.0 #829

Closed

jalev closed this Feb 20, 2019

jalev deleted the slice-fix branch February 20, 2019 14:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

jalev commented Feb 6, 2019 •

edited

Loading

FUSAKLA left a comment

bwplotka left a comment

bwplotka Feb 7, 2019

jalev Feb 7, 2019

jalev commented Feb 11, 2019

Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

Store: Handles out-of-range lower-bound indexes for loadChunks and loadSeries #816

Conversation

jalev commented Feb 6, 2019 • edited Loading

Changes

Verification

FUSAKLA left a comment

Choose a reason for hiding this comment

bwplotka left a comment

Choose a reason for hiding this comment

bwplotka Feb 7, 2019

Choose a reason for hiding this comment

jalev Feb 7, 2019

Choose a reason for hiding this comment

jalev commented Feb 11, 2019

jalev commented Feb 6, 2019 •

edited

Loading