-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SafeDelete faster for large blocks; TSDB block repair case #1056
Changes from 22 commits
f0dfe27
f09d3f5
230bc7b
3f0d6cd
ed78bde
7fcf793
2f179ff
9ca0c05
fd572aa
24173d3
0d39253
5a7fd5d
bfd44f1
d1274ef
0a01898
bf62c5c
e296f8f
768ec93
c46ff1c
b625a95
380672b
5599139
e6ad6a0
649df4d
9e26cfd
31f7dd6
f6f3443
37d495a
b852c34
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -495,12 +495,16 @@ func Repair(logger log.Logger, dir string, id ulid.ULID, source metadata.SourceT | |
resmeta.Stats = tsdb.BlockStats{} // reset stats | ||
resmeta.Thanos.Source = source // update source | ||
|
||
if err := rewrite(indexr, chunkr, indexw, chunkw, &resmeta, ignoreChkFns); err != nil { | ||
if err := rewrite(logger, indexr, chunkr, indexw, chunkw, &resmeta, ignoreChkFns); err != nil { | ||
return resid, errors.Wrap(err, "rewrite block") | ||
} | ||
if err := metadata.Write(logger, resdir, &resmeta); err != nil { | ||
return resid, err | ||
} | ||
// TSDB may rewrite metadata in bdir | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also missing trailing period in comment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
if err := metadata.Write(logger, bdir, meta); err != nil { | ||
return resid, err | ||
} | ||
return resid, nil | ||
} | ||
|
||
|
@@ -595,6 +599,7 @@ type seriesRepair struct { | |
// rewrite writes all data from the readers back into the writers while cleaning | ||
// up mis-ordered and duplicated chunks. | ||
func rewrite( | ||
logger log.Logger, | ||
indexr tsdb.IndexReader, chunkr tsdb.ChunkReader, | ||
indexw tsdb.IndexWriter, chunkw tsdb.ChunkWriter, | ||
meta *metadata.Meta, | ||
|
@@ -665,8 +670,22 @@ func rewrite( | |
return labels.Compare(series[i].lset, series[j].lset) < 0 | ||
}) | ||
|
||
lastSet := labels.Labels{} | ||
// Build a new TSDB block. | ||
for _, s := range series { | ||
// The TSDB library will throw an error if we add a series with | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we remove double spaces in this comment? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Old habits die hard...fixed. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I ask where is this habit from? Sorry for my ignorance - super curious (: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Lol... Search google for "double space after period" you'll learn more than you bargained for. Like, my age. :-) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TIL never heard of this either. But actually the rule could apply here since code formatting is using monospaced fonts mostly :) |
||
// identical labels as the last series. This means that we have | ||
// discovered a duplicate time series in the old block. We drop | ||
// all duplicate series preserving the first one. | ||
// TODO: Add metric to count dropped series if repair becomes a daemon | ||
// rather than a batch job. | ||
if labels.Compare(lastSet, s.lset) == 0 { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The series are sorted based on their labelSet. What happens if two labelSets are identical? For one they will be consecutive in the slice. So if the labelSet we have now is the same as the previous labelSet we have a duplicate series which the TSDB code will produce an error on. So we just drop that series. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's add a comment on the implied behavior here. I believe you when you say you've seen this behavior, but it worries me, as something went terribly wrong wherever these blocks were produced. We should at least document that the heuristic here is that on collision we keep the first series found and drop consecutive ones. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. Thank you! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry one more thing. As this shouldn't generally happen, let's add a metric so we can monitor when it does happen. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The repair process batch job only which makes metrics less useful. I had some similar comments in #964 Perhaps I should leave a comment if that were to change? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes let's in that case put a TODO in there There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. TODO added. |
||
level.Warn(logger).Log("msg", | ||
"dropping duplicate series in tsdb block found", | ||
"labelset", s.lset.String(), | ||
) | ||
continue | ||
} | ||
if err := chunkw.WriteChunks(s.chks...); err != nil { | ||
return errors.Wrap(err, "write chunks") | ||
} | ||
|
@@ -691,6 +710,7 @@ func rewrite( | |
} | ||
postings.Add(i, s.lset) | ||
i++ | ||
lastSet = s.lset | ||
} | ||
|
||
s := make([]string, 0, 256) | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ package verifier | |
|
||
import ( | ||
"context" | ||
"fmt" | ||
"io/ioutil" | ||
"os" | ||
"path/filepath" | ||
|
@@ -14,10 +15,13 @@ import ( | |
"github.com/pkg/errors" | ||
) | ||
|
||
// SafeDelete moves block to backup bucket and if succeeded, removes it from source bucket. | ||
// It returns error if block dir already exists in backup bucket (blocks should be immutable) or any | ||
// of the operation fails. | ||
func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID) error { | ||
// SafeDelete moves a TSDB block to a backup bucket and, on success, removes | ||
jjneely marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// it from the source bucket. It returns error if block dir already exists in | ||
jjneely marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// the backup bucket (blocks should be immutable) or if any of the operations | ||
// fail. When useExistingTempdir is true do not download the block but use the | ||
jjneely marked this conversation as resolved.
Show resolved
Hide resolved
|
||
// previously downloaded block found in tempdir to avoid repeated downloads. | ||
// The tempdir value is ignored unless useExistingTempdir is set to true. | ||
func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, backupBkt objstore.Bucket, id ulid.ULID, useExistingTempdir bool, tempdir string) error { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Couldn't we just get rid of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I felt like giving a particular value of There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If you feel strongly about this @GiedriusS I'm ok with the other behavior, just feel I will have forgotten the implicit behavior of the empty string in 3 months, hence I wanted it to be more explicit. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Hm, I feel strongly about this - boolean in random place of the functions IMO smells bad. Same with 2 flags controlling same thing in very strict way.Essentially it does not make sense to set Also if userExistingTempdir = true and tempdir is specified you literally don't use I would propose:
This will ensure separation of concerns AND in fact There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah I like @bwplotka's suggestion best as well. |
||
foundDir := false | ||
err := backupBkt.Iter(ctx, id.String(), func(name string) error { | ||
foundDir = true | ||
|
@@ -31,23 +35,48 @@ func SafeDelete(ctx context.Context, logger log.Logger, bkt objstore.Bucket, bac | |
return errors.Errorf("%s dir seems to exists in backup bucket. Remove this block manually if you are sure it is safe to do", id) | ||
} | ||
|
||
tempdir, err := ioutil.TempDir("", "safe-delete") | ||
if err != nil { | ||
return err | ||
} | ||
dir := filepath.Join(tempdir, id.String()) | ||
err = os.Mkdir(dir, 0755) | ||
if err != nil { | ||
return err | ||
if useExistingTempdir && tempdir == "" { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It seems like they both can be set or unset which proves my point. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I touched up the comments for this function again so that they better describe the behavior of the current code. No code changes though. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I proposed different approach - splitting this into 2 function which will solve this issue (: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is now committed. The function signatures didn't quite work out as we would have liked, but I totally aggree, this feels much, much nicer. |
||
return errors.New("instructed to use existing tempdir but no tempdir provided") | ||
} | ||
defer func() { | ||
if err := os.RemoveAll(tempdir); err != nil { | ||
level.Warn(logger).Log("msg", "failed to delete dir", "dir", tempdir, "err", err) | ||
|
||
var dir string | ||
if useExistingTempdir { | ||
dir = filepath.Join(tempdir, id.String()) | ||
if _, err := os.Stat(filepath.Join(dir, "meta.json")); err != nil { | ||
// If there is any error stat'ing meta.json inside the TSDB block | ||
// then declare the existing block as bad and refuse to upload it. | ||
// TODO: Make this check more robust. | ||
return errors.Wrap(err, "existing tsdb block is invalid") | ||
} | ||
level.Info(logger).Log("msg", "using previously downloaded tsdb block", | ||
"id", id.String()) | ||
} else { | ||
// tempdir unspecified, create one | ||
tempdir, err = ioutil.TempDir("", fmt.Sprintf("safe-delete-%s", id.String())) | ||
if err != nil { | ||
return err | ||
} | ||
}() | ||
// We manage this tempdir so we clean it up on exit. | ||
defer func() { | ||
if err := os.RemoveAll(tempdir); err != nil { | ||
level.Warn(logger).Log("msg", "failed to delete dir", "dir", tempdir, "err", err) | ||
} | ||
}() | ||
|
||
if err := block.Download(ctx, logger, bkt, id, dir); err != nil { | ||
return errors.Wrap(err, "download from source") | ||
dir = filepath.Join(tempdir, id.String()) | ||
if _, err := os.Stat(dir); os.IsNotExist(err) { | ||
// TSDB block not already present, create it. | ||
err = os.Mkdir(dir, 0755) | ||
if err != nil { | ||
return errors.Wrap(err, "creating tsdb block tempdir") | ||
} | ||
} else if err != nil { | ||
// Error calling Stat() and something is really wrong. | ||
return errors.Wrap(err, "stat call on directory") | ||
} | ||
if err := block.Download(ctx, logger, bkt, id, dir); err != nil { | ||
return errors.Wrap(err, "download from source") | ||
} | ||
} | ||
|
||
if err := block.Upload(ctx, logger, backupBkt, dir); err != nil { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The TSDB code handily rewrites the block metadata even though we don't write or change read block here. This nukes the extra metadata Thanos uses. Write it back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is also a bug that you are trying to fix? Perhaps we should add it to the changelog and/or the PR messsage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a bug, this is part of the initial patches to be able to reuse downloaded blocks. The TSDB library isn't aware of the additional meta data Thanos provides and was removing it. Therefore, when a previously downloaded block is reused it would have incomplete metadata without this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this is being changed now: prometheus-junkyard/tsdb#637 so this won't be needed TBH (once we use latest TSDB).
Also this is true what @GiedriusS said - splitting PRs into focused changes helps a lot. However this change relies on this fix. I would be more explicit in description. This will help us to understand this early on (:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add TODO to update TSDB and remove this as it will be not needed in newest TSDB. For now we need this, thanks for fixing this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Todo added, thanks!