pkg/receive: remove flushed WAL #1654

squat · 2019-10-15T22:34:55Z

This commit ensures that we delete the WAL after it has been flushed to
a block. Flushing the WAL simply creates a block but does not remove the
WAL directory or its contents. This means that once the DB is re-opened,
new samples are added to the same WAL. Flushing the WAL again does not
result in blocks with overlapping time ranges because the flushing logic
guards against this
(https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300).
Nevertheless, we should delete the WAL after flushing it to ensure that
flushed samples are not needlessly re-processed. Also, once multi-TSDB
support is added, holding old samples in the WAL could cause problems.

Signed-off-by: Lucas Servén Marín lserven@gmail.com

Verification

Ran thanos receive locally and ensured that after several starts and stops, blocks are created but the WAL is empty.

cc @bwplotka @brancz @krasi-georgiev

brancz · 2019-10-16T06:26:36Z

I feel like it would be nice to have a test verifying that this is not putting us at risk of data loss.

This commit ensures that we delete the WAL after it has been flushed to a block. Flushing the WAL simply creates a block but does not remove the WAL directory or its contents. This means that once the DB is re-opened, new samples are added to the same WAL. Flushing the WAL again does not result in blocks with overlapping time ranges because the flushing logic guards against this (https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300). Nevertheless, we should delete the WAL after flushing it to ensure that flushed samples are not needlessly re-processed. Also, once multi-TSDB support is added, holding old samples in the WAL could cause problems. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>

squat · 2019-10-16T08:29:40Z

@brancz ack, added test to ensure that opening a db, adding samples, flushing, and the querying returns the same samples.

brancz · 2019-10-16T09:31:51Z

Very nice! 👍

This commit ensures that we delete the WAL after it has been flushed to a block. Flushing the WAL simply creates a block but does not remove the WAL directory or its contents. This means that once the DB is re-opened, new samples are added to the same WAL. Flushing the WAL again does not result in blocks with overlapping time ranges because the flushing logic guards against this (https://github.com/prometheus/prometheus/blob/master/tsdb/db.go#L300). Nevertheless, we should delete the WAL after flushing it to ensure that flushed samples are not needlessly re-processed. Also, once multi-TSDB support is added, holding old samples in the WAL could cause problems. Signed-off-by: Lucas Servén Marín <lserven@gmail.com> Signed-off-by: Giedrius Statkevičius <giedriuswork@gmail.com>

metalmatze · 2019-11-05T18:07:55Z

Could we release v0.8.2 to ship this? 🤔

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead automatically re-opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from #1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com>

Every time thanos receive is started, it has to replay the WAL three times, namely: 1. open the TSDB; 2. close the TSDB; open the ReadOnly TSDB and Flush; and 3. open the TSDB These WAL replays can take a very long time if the WAL has lots of data. With the fix from thanos-io#1654, the third time will be instantaneous because the WAL will be empty. That still leaves two potentially long WAL replays. We can cut this down to just one long replay if we do the following operations instead: 1. with a closed TSDB, open the ReadOnly TSDB and Flush; and 2. open the TSDB Now, the second step will be a fast replay because the WAL is empty, leaving just one potentially expensive WAL replay. This commit eliminates explicit opening of the writable TSDB during startup, and instead opens it after flushing the read-only TSDB. Signed-off-by: Lucas Servén Marín <lserven@gmail.com> Signed-off-by: Aleksey Sin <asin@ozon.ru>

squat mentioned this pull request Oct 16, 2019

receive: can not load WAL data when restart #1624

Closed

squat force-pushed the deleteflushedwal branch from 89ab7f6 to 8c896a5 Compare October 16, 2019 08:28

brancz approved these changes Oct 16, 2019

View reviewed changes

brancz merged commit 48a8fb6 into thanos-io:master Oct 16, 2019

squat deleted the deleteflushedwal branch October 16, 2019 09:32

squat mentioned this pull request Nov 5, 2019

cmd/thanos/receive: reduce WAL replays at startup #1721

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/receive: remove flushed WAL #1654

pkg/receive: remove flushed WAL #1654

squat commented Oct 15, 2019

brancz commented Oct 16, 2019

squat commented Oct 16, 2019

brancz commented Oct 16, 2019

metalmatze commented Nov 5, 2019 •

edited

Loading

pkg/receive: remove flushed WAL #1654

pkg/receive: remove flushed WAL #1654

Conversation

squat commented Oct 15, 2019

Verification

brancz commented Oct 16, 2019

squat commented Oct 16, 2019

brancz commented Oct 16, 2019

metalmatze commented Nov 5, 2019 • edited Loading

metalmatze commented Nov 5, 2019 •

edited

Loading