-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/receive: remove flushed WAL #1654
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
I feel like it would be nice to have a test verifying that this is not putting us at risk of data loss. |
This commit ensures that we delete the WAL after it has been flushed to a block. Flushing the WAL simply creates a block but does not remove the WAL directory or its contents. This means that once the DB is re-opened, new samples are added to the same WAL. Flushing the WAL again does not result in blocks with overlapping time ranges because the flushing logic guards against this (https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300). Nevertheless, we should delete the WAL after flushing it to ensure that flushed samples are not needlessly re-processed. Also, once multi-TSDB support is added, holding old samples in the WAL could cause problems. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
89ab7f6
to
8c896a5
Compare
@brancz ack, added test to ensure that opening a db, adding samples, flushing, and the querying returns the same samples. |
brancz
approved these changes
Oct 16, 2019
Very nice! 👍 |
GiedriusS
pushed a commit
that referenced
this pull request
Oct 28, 2019
This commit ensures that we delete the WAL after it has been flushed to a block. Flushing the WAL simply creates a block but does not remove the WAL directory or its contents. This means that once the DB is re-opened, new samples are added to the same WAL. Flushing the WAL again does not result in blocks with overlapping time ranges because the flushing logic guards against this (https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300). Nevertheless, we should delete the WAL after flushing it to ensure that flushed samples are not needlessly re-processed. Also, once multi-TSDB support is added, holding old samples in the WAL could cause problems. Signed-off-by: Lucas Servén Marín <lserven@gmail.com> Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>
Could we release v0.8.2 to ship this? 🤔 |
squat
added a commit
to squat/thanos
that referenced
this pull request
Nov 5, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead automatically re-opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
squat
added a commit
to squat/thanos
that referenced
this pull request
Nov 5, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
squat
added a commit
to squat/thanos
that referenced
this pull request
Nov 6, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
brancz
pushed a commit
that referenced
this pull request
Nov 7, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from #1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 26, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com> Signed-off-by: Aleksey Sin <asin@ozon.ru>
IKSIN
pushed a commit
to monitoring-tools/thanos
that referenced
this pull request
Nov 27, 2019
Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com> Signed-off-by: Aleksey Sin <asin@ozon.ru>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This commit ensures that we delete the WAL after it has been flushed to
a block. Flushing the WAL simply creates a block but does not remove the
WAL directory or its contents. This means that once the DB is re-opened,
new samples are added to the same WAL. Flushing the WAL again does not
result in blocks with overlapping time ranges because the flushing logic
guards against this
(https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300).
Nevertheless, we should delete the WAL after flushing it to ensure that
flushed samples are not needlessly re-processed. Also, once multi-TSDB
support is added, holding old samples in the WAL could cause problems.
Signed-off-by: Lucas Servén Marín lserven@gmail.com
Verification
Ran thanos receive locally and ensured that after several starts and stops, blocks are created but the WAL is empty.
cc @bwplotka @brancz @krasi-georgiev