Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thanos Downsampling Checksum Mismatch #5944

Closed
RohitKochhar opened this issue Dec 7, 2022 · 5 comments
Closed

Thanos Downsampling Checksum Mismatch #5944

RohitKochhar opened this issue Dec 7, 2022 · 5 comments

Comments

@RohitKochhar
Copy link
Contributor

Thanos, Prometheus and Golang version used: v0.25.1 / NA

Object Storage Provider: AWS S3

What happened:

When Thanos Compact encounters a block with a checksum mismatch during compaction, we have an automated program that marks the block no-compact and restarts Compact, allowing the backlog of blocks to be compacted without issue.

However, we are now encountering an error where Compact encounters blocks with checksum mismatches during downsampling, causing an error to be thrown with the following message:

downsampling to 5 min: downsample block {BLOCK_ID} to window 300000: get chunk XXX, series XXXXX: checksum mismatch expected:40464b23, actual:2d8ed7bd \n first pass of downsampling failed

What you expected to happen:

Ideally, there would be a mark that could be added using thanos tools mark that specifies no-downsample, or if the block is marked no-compaction it would be inferred that it should be no-downsample to prevent compact from failing.

How to reproduce it (as minimally and precisely as possible):

Reproducing steps can be a bit difficult, but it seems like a block with a checksum mismatch needs to be marked no-compact and then Thanos Compact must attempt to downsample it without compaction.

@yeya24
Copy link
Contributor

yeya24 commented Dec 7, 2022

Ideally, there would be a mark that could be added using thanos tools mark that specifies no-downsample, or if the block is marked no-compaction it would be inferred that it should be no-downsample to prevent compact from failing.

This sounds a good idea!

@RohitKochhar
Copy link
Contributor Author

@yeya24 Thanks! Do you have any recommendations as to how we can avoid this issue currently?

The only thing I can really think of is to edit the metadata.json of the block to specify downsample_resolution: 0, but I don't really want to have an additional program changing the metadata of the program out of fear that it may clash with the other Thanos components.

@yeya24
Copy link
Contributor

yeya24 commented Dec 7, 2022

@RohitKochhar Editing metadata directly sounds dangerous. I guess it should not be that hard to add the new marker you mentioned. Checking the metadata before downsampling is good enough.
If you want to skip for now, it is probably not doable. I recommend stopping running downsampling at all until the tool is there.

@RohitKochhar
Copy link
Contributor Author

@yeya24 Thanks so much for your feedback.

I will be opening a PR in the coming days to add this flag.

@RohitKochhar
Copy link
Contributor Author

This changed was implemented and is a part of v0.30.0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants