Skip to content
This repository has been archived by the owner on Aug 13, 2019. It is now read-only.

re-add the missing prometheus_tsdb_wal_corruptions_total #473

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 7 additions & 1 deletion head.go
Original file line number Diff line number Diff line change
Expand Up @@ -89,6 +89,7 @@ type headMetrics struct {
maxTime prometheus.GaugeFunc
samplesAppended prometheus.Counter
walTruncateDuration prometheus.Summary
walCorruptionsTotal prometheus.Counter
headTruncateFail prometheus.Counter
headTruncateTotal prometheus.Counter
checkpointDeleteFail prometheus.Counter
Expand Down Expand Up @@ -152,6 +153,10 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics {
Name: "prometheus_tsdb_wal_truncate_duration_seconds",
Help: "Duration of WAL truncation.",
})
m.walCorruptionsTotal = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_wal_corruptions_total",
Help: "Total number of WAL corruptions.",
})
m.samplesAppended = prometheus.NewCounter(prometheus.CounterOpts{
Name: "prometheus_tsdb_head_samples_appended_total",
Help: "Total number of appended samples.",
Expand Down Expand Up @@ -195,6 +200,7 @@ func newHeadMetrics(h *Head, r prometheus.Registerer) *headMetrics {
m.maxTime,
m.gcDuration,
m.walTruncateDuration,
m.walCorruptionsTotal,
m.samplesAppended,
m.headTruncateFail,
m.headTruncateTotal,
Expand Down Expand Up @@ -480,10 +486,10 @@ func (h *Head) Init(minValidTime int64) error {
return nil
}
level.Warn(h.logger).Log("msg", "encountered WAL error, attempting repair", "err", err)
h.metrics.walCorruptionsTotal.Inc()
Copy link
Contributor

@codesome codesome Dec 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe we detect WAL corruptions when we call loadWAL. So do we also need to increment it for this: https://github.com/prometheus/tsdb/blob/9e51d56e08958f22f55daf26795ee477def7797e/head.go#L471-L473

And also maybe a small test for that?

Copy link
Contributor Author

@krasi-georgiev krasi-georgiev Dec 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When this happens Prometheus will exist, so why would we increment there?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is no way to recover this Inc() info after we return, then there is no need of adding it here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add the metrics directly to wal.Repair() so we can know when there is a corruption and whether or not it has been repaired?

Copy link
Contributor Author

@krasi-georgiev krasi-georgiev Dec 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think it makes much difference and where we place it, but not a bad idea.

How would we know if the corruption has been repaired?

if err := h.wal.Repair(err); err != nil {
return errors.Wrap(err, "repair corrupted WAL")
}

return nil
}

Expand Down
3 changes: 3 additions & 0 deletions head_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ import (
"sort"
"testing"

prom_testutil "github.com/prometheus/client_golang/prometheus/testutil"
"github.com/prometheus/tsdb/chunkenc"
"github.com/prometheus/tsdb/chunks"
"github.com/prometheus/tsdb/index"
Expand Down Expand Up @@ -927,7 +928,9 @@ func TestWalRepair(t *testing.T) {

h, err := NewHead(nil, nil, w, 1)
testutil.Ok(t, err)
testutil.Equals(t, 0.0, prom_testutil.ToFloat64(h.metrics.walCorruptionsTotal))
testutil.Ok(t, h.Init(math.MinInt64))
testutil.Equals(t, 1.0, prom_testutil.ToFloat64(h.metrics.walCorruptionsTotal))

sr, err := wal.NewSegmentsReader(dir)
testutil.Ok(t, err)
Expand Down