-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SafeDelete faster for large blocks; TSDB block repair case #1056
Conversation
When we have label sets that are not in the correct order, fixing that changes the order of the series in the index. So the index must be rewritten in that new order. This makes this repair tool take up a bunch more memory, but produces blocks that verify correctly.
The directory name must be the block ID name exactly to verify. A temp directory or random name will not work here.
Pointer/reference logic error was eliminating all chunks for a series in a given TSDB block that wasn't the first chunk. Chunks are now referenced correctly via pointers.
If we have just downloaded a TSDB block for repair, we create a brand new block that is fixed. The original block can be used as the data we upload in SafeDelete() to save time/bandwidth rather than re-download it from the storage bucket.
Some linters catch errors.Errorf() as its not really part of the errors package.
Where duplicate series means two series with an identical label set. This code just drops any duplicate series that is encountered after the first. No merging or anything fancy.
We're comparing items by pointers, using Go's range variables is misleading here and we need not fall into the same trap.
This prevents us from having to re-implement label sorting.
54b0722
to
8225c79
Compare
8225c79
to
0a01898
Compare
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TSDB code handily rewrites the block metadata even though we don't write or change read block here. This nukes the extra metadata Thanos uses. Write it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is also a bug that you are trying to fix? Perhaps we should add it to the changelog and/or the PR messsage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug, this is part of the initial patches to be able to reuse downloaded blocks. The TSDB library isn't aware of the additional meta data Thanos provides and was removing it. Therefore, when a previously downloaded block is reused it would have incomplete metadata without this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is being changed now: prometheus-junkyard/tsdb#637 so this won't be needed TBH (once we use latest TSDB).
Also this is true what @GiedriusS said - splitting PRs into focused changes helps a lot. However this change relies on this fix. I would be more explicit in description. This will help us to understand this early on (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add TODO to update TSDB and remove this as it will be not needed in newest TSDB. For now we need this, thanks for fixing this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo added, thanks!
// Build a new TSDB block. | ||
for _, s := range series { | ||
if labels.Compare(lastSet, s.lset) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The series are sorted based on their labelSet. What happens if two labelSets are identical? For one they will be consecutive in the slice. So if the labelSet we have now is the same as the previous labelSet we have a duplicate series which the TSDB code will produce an error on. So we just drop that series.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment on the implied behavior here. I believe you when you say you've seen this behavior, but it worries me, as something went terribly wrong wherever these blocks were produced. We should at least document that the heuristic here is that on collision we keep the first series found and drop consecutive ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry one more thing. As this shouldn't generally happen, let's add a metric so we can monitor when it does happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The repair process batch job only which makes metrics less useful. I had some similar comments in #964
Perhaps I should leave a comment if that were to change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes let's in that case put a TODO in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO added.
pkg/verifier/safe_delete.go
Outdated
tempdir, err := ioutil.TempDir("", "safe-delete") | ||
if err != nil { | ||
return err | ||
if tempdir == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we already have the block, don't download it again. Blocks can be large. 30GiB or close to it in my environment.
I've tested this several times on my own TSDB blocks. I've been wondering how to get better tests integrated into the code as unit tests, but I cannot make these TSDB blocks of mine public. But I've repaired several hundred GiB of data with this. |
pkg/verifier/safe_delete.go
Outdated
// source bucket. It returns error if block dir already exists in backup | ||
// bucket (blocks should be immutable) or any of the operation fails. If | ||
// the block has already been downloaded it must exist in tempdir, else it | ||
// will be downloaded to be copied to the backup bucket. tempdir set to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer not to hide this behavior in tempdir
values. Maybe "safe delete modes" (as in "force download", "try local copy") or a bool (as in just "forceDownload") at the very least that makes it very explicit what's going to happen.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SafeDelete()
doesn't know the temporary directory where a previous function may have downloaded the block, so we need to pass in a path somehow. Perhaps rename this argument something like useExisting
? Which would mean we use the existing block if found under that path, otherwise a new temporary directory is created and the block is downloaded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry I meant let's pass two parameters to explicitly describe the intention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WIP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, committed a patch for makes this explicit. I've tested the repair process with my collection of broken blocks I've started keeping and had the behavior expected. (No repeated downloads.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had thanos-compact cut a 450GiB TSDB block today which I referenced in #1152 which is part of the motivation for this.
pkg/verifier/safe_delete.go
Outdated
|
||
if _, err := os.Stat(dir); err != nil { | ||
// TSDB block not already present, download it | ||
err = os.Mkdir(dir, 0755) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should only be done if os.IsNotExist(err)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
pkg/verifier/safe_delete.go
Outdated
return errors.Wrap(err, "download from source") | ||
} | ||
} else { | ||
level.Info(logger).Log("msg", "using previously downloaded tsdb block", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In what kind of situation would we end up in this path? And if the answer is a (half or maybe even fully) finished previous download, then how do we prevent that the directory is there but the content is not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My blocks (and I'm sure others) easily grow above 30GiB when compacted. So if I need to repair block A, Thanos will download block A, create a repaired block B, upload block B. So at this point, we've created, successfully, a new block from A and that block B now passes the VerifyIndex()
test.
At this point we call SafeDelete()
which downloads block A, uploads block A to the backup bucket, then deletes block A from the main bucket.
This process can be time consuming (and perhaps expensive). So if blocks are immutable, and the repair process verifies that A is a legitimate block (if malformed) we should be able to skip the redundant download of A and backup the existing downloaded copy.
// Build a new TSDB block. | ||
for _, s := range series { | ||
if labels.Compare(lastSet, s.lset) == 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add a comment on the implied behavior here. I believe you when you say you've seen this behavior, but it worries me, as something went terribly wrong wherever these blocks were produced. We should at least document that the heuristic here is that on collision we keep the first series found and drop consecutive ones.
pkg/verifier/safe_delete.go
Outdated
if err := os.RemoveAll(tempdir); err != nil { | ||
level.Warn(logger).Log("msg", "failed to delete dir", "dir", tempdir, "err", err) | ||
|
||
if _, err := os.Stat(dir); os.IsNotExist(err) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're swallowing any error that is not "IsNotExist" here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should just do here os.Mkdir
and ignore the error if it is saying that the directory already exists. Otherwise, we will have racy behaviour here (of course, there's barely any chance that someone will run two of these at the same time but still)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good suggestions!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added some additional logic here to handle any other possible error from os.Stat()
.
The pre-existence of the directory triggers the code path where we can avoid having to re-download a possibly large TSDB block again. Races with the os.Mkdir()
are not anticipated as if the temporary directory isn't supplied ioutil.TempDir()
is used to generate a value. When we do supply tempdir
in pkg/verifier/index_issue.go
is it also generated with ioutil.Tempdir()
:
tmpdir, err := ioutil.TempDir("", fmt.Sprintf("index-issue-block-%s-", id))
When we want to SafeDelete an already downloaded TSDB block, be very explicit about that use case.
pkg/verifier/safe_delete.go
Outdated
// previously downloaded block found in tempdir to avoid repeated downloads. | ||
// If tempdir is an empty string a unique temporary directory is created for | ||
// scratch space, otherwise the given directory is used. | ||
func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID, useExisting bool, tempdir string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry for nitpicking on this, but can we name this useExistingTempdir
? I realize that's long, but we'll understand in 3 months what this means that way :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem. Fixed.
pkg/verifier/safe_delete.go
Outdated
if err := block.Download(ctx, logger, bkt, id, dir); err != nil { | ||
return errors.Wrap(err, "download from source") | ||
if useExisting { | ||
if _, err := os.Stat(filepath.Join(dir, "meta.json")); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check really sufficient? Shouldn't we check that all the downloaded content is equal to what is in object storage, ideally by their hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There don't appear to be tools for this in the existing code base. So this seems like a bit of scope creep. Glad to add additional checks here, but these features sound like another GitHub Issue/PR.
In fact, why isn't there a checksum in meta.json? It doesn't have to be cryptographically secure, but it would serve to verify that the block was intact. As we move from "metric data is ephemeral" to "metric data is important long term" that would be quite handy.
pkg/verifier/safe_delete.go
Outdated
tempdir, err := ioutil.TempDir("", "safe-delete") | ||
if err != nil { | ||
return err | ||
if tempdir == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I wasn't clear, I would expect that when useExisting
is set, that tempDir
must be set, so I would expect an error from this function if that is not the case, and this if statement to be combined with the parts below, I think that'll also make the code a lot more easy to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, refactored this a bit. It will return an error if you ask it to use an existing tempdir and do not provide one, and the logic is a little simpler.
Be extra explicit about using existing tempdirs. Added note about future improvements for verifying if we have an intact tsdb block.
cluster_test.go:129 fails on me again... |
gossip has been removed, if you rebase I think those flakes should go away :) |
Sorry for the silence, I'm working off the backlog that accumulated over kubecon, once rebased and CI is green I'm happy to give this another pass |
No problem. I know there are a lot of PRs as well. Does rebasing help you move this forward? |
The test that kept flaking is gone due to the rebase, that's all :) |
Updates on getting this merged? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but as this is easy to miss something I'd be good if we can get a second approval from @bwplotka or @GiedriusS
pkg/block/index.go
Outdated
if labels.Compare(lastSet, s.lset) == 0 { | ||
level.Warn(logger).Log("msg", | ||
"dropping duplicate series in tsdb block found", | ||
"labelset", fmt.Sprintf("%s", s.lset), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should use s.lset.String()
here. The new meta linter will complain about this 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is also a bug that you are trying to fix? Perhaps we should add it to the changelog and/or the PR messsage.
pkg/verifier/safe_delete.go
Outdated
// previously downloaded block found in tempdir to avoid repeated downloads. | ||
// If tempdir is an empty string a unique temporary directory is created for | ||
// scratch space, otherwise the given directory is used. | ||
func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID, useExistingTempdir bool, tempdir string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just get rid of useExistingTempdir
here and use tempdir
to control how this function works? It seems like we are making this unnecessarily complex. Perhaps I am missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I felt like giving a particular value of tempdir
special meaning was not obvious enough. I think I prefer this over tempdir being empty. I do agree that below the if doesn't need the tempdir == ""
case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you feel strongly about this @GiedriusS I'm ok with the other behavior, just feel I will have forgotten the implicit behavior of the empty string in 3 months, hence I wanted it to be more explicit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I feel strongly about this - boolean in random place of the functions IMO smells bad. Same with 2 flags controlling same thing in very strict way.Essentially it does not make sense to set useExistingTempdir = true
and tempdir = ""
so we need to validate that etc. Interface is broader then needed.
Also if userExistingTempdir = true and tempdir is specified you literally don't use bkt
so it's even worse in terms of readability and surprises ):
I would propose:
BackupAndDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID)
as it was.BackupAndDeleteFromLocal(ctx context.Context, logger log.Logger, bdir string, backupBkt objstore.Bucket)
This will ensure separation of concerns AND in fact BackupAndDelete
can use BackupAndDeleteFromLocal
internally. Right now what I can see is 2 function baked in one function (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I like @bwplotka's suggestion best as well.
pkg/verifier/safe_delete.go
Outdated
err = os.Mkdir(dir, 0755) | ||
if err != nil { | ||
return err | ||
if useExistingTempdir && tempdir == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like they both can be set or unset which proves my point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I touched up the comments for this function again so that they better describe the behavior of the current code. No code changes though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed different approach - splitting this into 2 function which will solve this issue (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is now committed. The function signatures didn't quite work out as we would have liked, but I totally aggree, this feels much, much nicer.
The tempdir value is ignored if useExistingTempdir is false.
Updates? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jjneely Thanks for this!
I think it makes a lot of sense, but propose a split in SafeDelete
to ensure readability and address @GiedriusS concerns (: Let me know if that makes sense.
Otherwise LGTM!
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is being changed now: prometheus-junkyard/tsdb#637 so this won't be needed TBH (once we use latest TSDB).
Also this is true what @GiedriusS said - splitting PRs into focused changes helps a lot. However this change relies on this fix. I would be more explicit in description. This will help us to understand this early on (:
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also missing trailing period in comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
// Build a new TSDB block. | ||
for _, s := range series { | ||
// The TSDB library will throw an error if we add a series with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove double spaces in this comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Old habits die hard...fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask where is this habit from? Sorry for my ignorance - super curious (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lol... Search google for "double space after period" you'll learn more than you bargained for. Like, my age. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL never heard of this either. But actually the rule could apply here since code formatting is using monospaced fonts mostly :)
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add TODO to update TSDB and remove this as it will be not needed in newest TSDB. For now we need this, thanks for fixing this 👍
pkg/verifier/safe_delete.go
Outdated
// previously downloaded block found in tempdir to avoid repeated downloads. | ||
// If tempdir is an empty string a unique temporary directory is created for | ||
// scratch space, otherwise the given directory is used. | ||
func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID, useExistingTempdir bool, tempdir string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I feel strongly about this - boolean in random place of the functions IMO smells bad. Same with 2 flags controlling same thing in very strict way.Essentially it does not make sense to set useExistingTempdir = true
and tempdir = ""
so we need to validate that etc. Interface is broader then needed.
Also if userExistingTempdir = true and tempdir is specified you literally don't use bkt
so it's even worse in terms of readability and surprises ):
I would propose:
BackupAndDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID)
as it was.BackupAndDeleteFromLocal(ctx context.Context, logger log.Logger, bdir string, backupBkt objstore.Bucket)
This will ensure separation of concerns AND in fact BackupAndDelete
can use BackupAndDeleteFromLocal
internally. Right now what I can see is 2 function baked in one function (:
pkg/verifier/safe_delete.go
Outdated
err = os.Mkdir(dir, 0755) | ||
if err != nil { | ||
return err | ||
if useExistingTempdir && tempdir == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I proposed different approach - splitting this into 2 function which will solve this issue (:
And sorry for massive delay in review! |
Co-Authored-By: Bartek Płotka <bwplotka@gmail.com>
Co-Authored-By: Bartek Płotka <bwplotka@gmail.com>
Co-Authored-By: Bartek Płotka <bwplotka@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Happy to put LGTM once we address the https://github.com/improbable-eng/thanos/pull/1056/files#r296908203 (:
pkg/block/index.go
Outdated
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir. | ||
// TODO: This is not needed in newer TSDB code. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR number would be nice for easier trackngm but not a blocker (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
// Build a new TSDB block. | ||
for _, s := range series { | ||
// The TSDB library will throw an error if we add a series with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I ask where is this habit from? Sorry for my ignorance - super curious (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One more nit and LGTM, thanks for this!
pkg/verifier/safe_delete.go
Outdated
return errors.Errorf("%s dir seems to exists in backup bucket. Remove this block manually if you are sure it is safe to do", id) | ||
} | ||
|
||
// Now we call backupAndDeleteDownloaded() to handle the work. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this comment is not needed.
pkg/verifier/safe_delete.go
Outdated
// backupAndDeleteDownloaded is a helper function that uploads a TSDB block | ||
// found on disk to the backup bucket and, on success, removes the TSDB block | ||
// from the source bucket. | ||
func backupAndDeleteDownloaded(ctx context.Context, logger log.Logger, bdir string, bkt, backupBkt objstore.Bucket, id ulid.ULID) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for being picky but there is some improvement we can make here.
I like BackupAndDelete
vs BackupAndDeleteDownloaded
, but it's not clear what's the difference between backupAndDeleteDownloaded
and BackupAndDeleteDownloaded
is. Sure they are heavily commented, but comment should have less priority than naming. I think here the proper naming and small possible signature of function would do the work.
So what about:
func backupAndDeleteDownloaded(ctx context.Context, logger log.Logger, bdir string, bkt, backupBkt objstore.Bucket, id ulid.ULID) error { | |
func backupDownloaded(ctx context.Context, logger log.Logger, bdir string, bktToRemove, backupBkt objstore.Bucket, id ulid.ULID) error { |
and then if no error do on the caller in both BackupAndDeleteDownloaded
and BackupAndDelete
just.
// Block uploaded, so we are ok to remove from src bucket.
if err := block.Delete(ctx, bkt, id); err != nil {
return errors.Wrap(err, "delete from source")
}
This will make the signature of BackupAndDeleteDownloaded
and backupDowloaded
smaller (thus reducing complexity) as you can drop bkt
argument.
What do you think? (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjusted as requested, and went through another round of testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! Thanks 👍
I think netlify issue is because of not rebased PR - nothing changed for netlify here so merging. |
Awesome, thanks! |
This makes SafeDelete faster and less error prone when dealing with large blocks. This prevents SafeDelete from re-downloading a block that was just downloaded.
This PR also addresses a TSDB block repair case when duplicate series are detected.
Changes
Verification
I run these patches in production, and have repaired quite a few hundred gig of blocks with these.