receive: can not load WAL data when restart #1624

GuyZhangZhang · 2019-10-10T13:36:47Z

Thanos, Prometheus and Golang version used:

Thanos: self build from thanos master: a09a4b9
Golang: go version go1.12.7 darwin/amd64
Prometheus: 2.13.0

thanos build command: GOOS=linux GOARCH=amd64 go build -o thanos ./cmd/thanos

What happened:

I'm trying to use thanos receive, they are running as expected hours before I restart it.
the thanos receive try to load WAL data again and again

What you expected to happen:

load WAL data and listen & receive new data.

How to reproduce it (as minimally and precisely as possible):

I'm not sure is there any logic issue when receive starting? the thanos-0.7 can restart successful but master code cannot.

Full logs to relevant components:

Logs

level=info ts=2019-10-10T13:11:00.909450843Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:11:00.909646509Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:11:00.910759805Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:11:00.913326316Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:11:00.913810978Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:11:00.913638216Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:11:00.913837063Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:11:00.91392518Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:11:00.914250195Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:11:00.91435138Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:11:00.914334174Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:11:00.914388157Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:11:00.914437502Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:11:01.480532567Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:11:54.313611621Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:11:58.976905831Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:12:03.674326957Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:12:08.825274021Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:12:13.699480427Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:12:18.64123707Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167
level=info ts=2019-10-10T13:12:23.570922965Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2039 maxSegment=2167

.....

level=info ts=2019-10-10T13:22:40.613369573Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2165 maxSegment=2167
level=warn ts=2019-10-10T13:22:41.261691466Z caller=head.go:492 component=receive component=tsdb component=tsdb msg="unknown series references" count=104
level=info ts=2019-10-10T13:22:41.261809977Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2166 maxSegment=2167
level=info ts=2019-10-10T13:22:41.26233365Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2167 maxSegment=2167
level=info ts=2019-10-10T13:23:18.541986663Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-10T13:24:06.787360411Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-10T13:24:11.174920806Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2034 maxSegment=2167
level=info ts=2019-10-10T13:24:15.48524744Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2035 maxSegment=2167
level=info ts=2019-10-10T13:24:20.140672496Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2036 maxSegment=2167
level=info ts=2019-10-10T13:24:24.467852131Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2037 maxSegment=2167
level=info ts=2019-10-10T13:24:28.944796843Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2038 maxSegment=2167

....

level=info ts=2019-10-10T13:33:51.954823544Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2136 maxSegment=2167
level=warn ts=2019-10-10T13:33:57.647949531Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973
level=info ts=2019-10-10T13:33:57.648056531Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2137 maxSegment=2167
level=warn ts=2019-10-10T13:34:03.485428154Z caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=7034
level=info ts=2019-10-10T13:34:03.485536561Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2138 maxSegment=2167
level=info ts=2019-10-10T13:34:22.347698132Z caller=main.go:170 msg="Tracing will be disabled"
level=warn ts=2019-10-10T13:34:22.348067504Z caller=receive.go:145 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2019-10-10T13:34:22.34921212Z caller=factory.go:39 component=receive msg="loading bucket configuration"
level=info ts=2019-10-10T13:34:22.353678479Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-10T13:34:22.353887745Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-10T13:34:22.354027884Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-10T13:34:22.354002076Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-10T13:34:22.354402573Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-10T13:34:22.35507048Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-10T13:34:22.355474967Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570680000000 maxt=1570687200000 ulid=01DPT7F6K1TJJC4XAJJ9WVBZY5
level=info ts=2019-10-10T13:34:22.355602831Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570687200000 maxt=1570694400000 ulid=01DPTEAZDMJJDYSZ5XVQ9Z76XQ
level=info ts=2019-10-10T13:34:22.355642604Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570694400000 maxt=1570701600000 ulid=01DPTN6PKK3HJA99PH74XZYWPX
level=info ts=2019-10-10T13:34:22.355704854Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1570701600000 maxt=1570708800000 ulid=01DPTW2DQ6DQT1YZ9BF40QJS47
level=info ts=2019-10-10T13:34:22.840486871Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"

Environment:

OS (e.g. from /etc/os-release): ubuntu
Kernel (e.g. uname -a): 4.15.0
Others:

The text was updated successfully, but these errors were encountered:

brancz · 2019-10-10T13:56:36Z

On master receive will on startup ensure that it starts with a clean state, so it loads the WAL, flushes it to a block and then uploads it. That's why you're seeing it be loaded multiple times.

cc @squat

squat · 2019-10-10T14:57:03Z

@GuyCheung the WAL gets replayed and written to disk in the following cases:

startup (for cleanup)
every time that the hashring configuration changes
every two hours
shutdown

Are you seeing an error where the WAL data is not successfully getting loaded?

GuyZhangZhang · 2019-10-15T11:09:25Z

@brancz @squat sorry for later reply. I didn't see critical errors except some unknown series error log like caller=head.go:492 component=receive component=tsdb msg="unknown series references" count=6973.

if it will load multiple times, is there an exact number of how many times? seems it's an infinite loop and always loading

squat · 2019-10-15T11:51:44Z

@GuyCheung the WAL will be reloaded twice on startup (only one of the two times actually has data played back) and two times every two hours:

During startup:

startup for ReadOnly DB flush (data is played back)
startup for Write DB (WAL is empty so nothing is played back)

... then every two hours:

once for ReadOnly DB flush (data is played back)
again for Write DB open (WAL is empty so nothing is played back)

Are you seeing it more often than that?

GuyZhangZhang · 2019-10-16T03:18:13Z

@squat thanks for ur information! I think I know why it's always keep loading... the process had been OOM and been killed. I have some questions based on this, could you help me?

before my questions, update the information: I have changed thanos verison to v0.8.1 now.

1st, during multiple times replay, the memory will not released?

In my mind, I think the memory will not much bigger than the data folder on the disk? but I found the process took about 47G memory(total 48G), and the WAL data folder on disk was about 22G

the process took about 47G memory when been killed:

[7140727.569918] Out of memory: Kill process 89696 (thanos) score 936 or sacrifice child
[7140727.573522] Killed process 89696 (thanos) total-vm:52478600kB, anon-rss:47156872kB, file-rss:0kB, shmem-rss:0kB
[7140730.168611] oom_reaper: reaped process 89696 (thanos), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

the receiver data folder size like below:

2nd, according to this situation, is there some way I can limit the memory to avoid OOM?

I try to find the answer on the internet, but unfortunately, I didn't find any useful information, and there is a voice said the Prometheus 2 can not limit the memory, I'm not sure about this.

3rd, about the WAL replay, I got succeed on a 128G memory host, but seems not only replay twice?

I think it was replayed triple times during startup, and each time seems took 10 minutes, did this act as ur expectation? is there sth we can do to reduce the replay time cost?

the key points of my logs like below:

level=info ts=2019-10-15T11:30:13.41402523Z caller=receive.go:432 component=receive msg="starting receiver"
level=info ts=2019-10-15T11:30:13.414148657Z caller=main.go:353 component=receive msg="listening for requests and metrics" component=receive address=0.0.0.0:19210
level=info ts=2019-10-15T11:30:13.414159228Z caller=handler.go:160 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19211
level=info ts=2019-10-15T11:30:13.414214989Z caller=prober.go:143 component=receive msg="changing probe status" status=healthy
level=info ts=2019-10-15T11:30:13.414218471Z caller=receive.go:293 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2019-10-15T11:30:13.414273506Z caller=main.go:257 component=receive msg="disabled TLS, key and cert must be set to enable"
level=info ts=2019-10-15T11:30:13.414441585Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:30:13.414492525Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:30:13.414524465Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:30:13.4145516Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:30:13.728236906Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:30:47.562607851Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:30:53.447669188Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:30:58.981511175Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:31:04.501141802Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:31:10.169544723Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

......

level=info ts=2019-10-15T11:40:25.17463596Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2584 maxSegment=2588
level=info ts=2019-10-15T11:40:31.55063322Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2585 maxSegment=2588
level=info ts=2019-10-15T11:40:32.134839757Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:40:32.13568359Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:40:32.136520863Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:40:48.050314426Z caller=head.go:509 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:41:19.363551827Z caller=head.go:533 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:41:26.103639804Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2588
level=info ts=2019-10-15T11:41:31.860346412Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2588
level=info ts=2019-10-15T11:41:37.503849905Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2588
level=info ts=2019-10-15T11:41:43.180104605Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2588

.......

level=info ts=2019-10-15T11:51:48.263074931Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263503714Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2588
level=info ts=2019-10-15T11:51:48.263827295Z caller=head.go:557 component=receive component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2588
level=info ts=2019-10-15T11:55:32.977551161Z caller=compact.go:495 component=receive component=tsdb msg="write block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3 duration=3m26.870889547s
level=info ts=2019-10-15T11:55:32.977767127Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571104800000 maxt=1571112000000 ulid=01DQ6WK24Q2YT0FF2P4X9C24QQ
level=info ts=2019-10-15T11:55:32.977804379Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571112000000 maxt=1571119200000 ulid=01DQ73ESJB1ADEC1AJ53449AXY
level=info ts=2019-10-15T11:55:32.977831925Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571119200000 maxt=1571126400000 ulid=01DQ7AAGNR3YW4KE0JZ16E7VSJ
level=info ts=2019-10-15T11:55:32.977858102Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571126400000 maxt=1571133600000 ulid=01DQ7JCVJMGZKY88TTVF6Z9WX8
level=info ts=2019-10-15T11:55:32.977885957Z caller=repair.go:59 component=receive component=tsdb component=tsdb msg="found healthy block" mint=1571133600003 maxt=1571137701384 ulid=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:55:33.274419251Z caller=head.go:509 component=receive component=tsdb component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2019-10-15T11:55:43.440394647Z caller=shipper.go:349 component=receive msg="upload new block" id=01DQ7M5MPT3ZVP0YGNVWZZBRX3
level=info ts=2019-10-15T11:56:06.821192263Z caller=head.go:533 component=receive component=tsdb component=tsdb msg="WAL checkpoint loaded"
level=info ts=2019-10-15T11:56:13.313346626Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2493 maxSegment=2589
level=info ts=2019-10-15T11:56:19.591774203Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2494 maxSegment=2589
level=info ts=2019-10-15T11:56:25.923561156Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2495 maxSegment=2589
level=info ts=2019-10-15T11:56:32.700377317Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2496 maxSegment=2589
level=info ts=2019-10-15T11:56:39.431858684Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2497 maxSegment=2589

..........

level=info ts=2019-10-15T12:06:15.766904275Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2586 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767436341Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2587 maxSegment=2589
level=info ts=2019-10-15T12:06:15.767896116Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2588 maxSegment=2589
level=info ts=2019-10-15T12:06:15.768469372Z caller=head.go:557 component=receive component=tsdb component=tsdb msg="WAL segment loaded" segment=2589 maxSegment=2589
level=info ts=2019-10-15T12:06:37.956202426Z caller=receive.go:239 component=receive msg="tsdb started"
level=info ts=2019-10-15T12:06:37.9562544Z caller=prober.go:114 component=receive msg="changing probe status" status=ready
level=info ts=2019-10-15T12:06:37.95626502Z caller=receive.go:243 component=receive msg="server is ready to receive web requests."
level=info ts=2019-10-15T12:06:37.956699189Z caller=receive.go:347 component=receive msg="listening for StoreAPI gRPC" address=0.0.0.0:19219

squat · 2019-10-16T07:31:09Z

@GuyCheung, this huge growth in WAL looks to me to be the same issue that #1654 fixes. Every time that the receive component triggers a flush, the WAL is replayed and written to a block but the WAL is never deleted. Only new samples are actually written to the block but ALL samples are processed. This means that the the receive component has to do way more work than necessary. If the receive component has correctly shut down and cleaned the WAL, there should be 0 replay cost when starting back up.

squat · 2019-10-16T09:34:06Z

@GuyCheung now that #1654 has merged, could you test again using the image built from the commit?

GuyZhangZhang · 2019-10-17T05:29:31Z

@squat I have tried the new version which commit id is 48a8fb6e2f6a476bcffa508d6609a19847c695ef

but I got OOM on the 128GB memory hosts... and the data folder size just about 6GB, could you help to look into this?

I shared the pprof/heap file to you: https://drive.google.com/file/d/1iKqfMD9brOXbt7mLqJCX689AhRPuzJ_N/view?usp=sharing

Victorion · 2019-11-21T02:04:43Z

@squat I also tried to test the receive component with this patch - same story, but with small chunk - ~80 MB generated wal logs can't be replayed (single segment) - got OOM with 4GB memory limit on the container.

stale · 2020-01-11T03:43:49Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

brancz · 2020-01-13T08:12:13Z

Does this continue to occur for you?

ricoberger · 2020-01-17T07:53:13Z

We have a similar problem, running Thanos in a Kubernetes cluster. When a pod gets restarted the receiver is OOM killed. The StatefulSet looks similar to the following example: https://github.com/thanos-io/kube-thanos/blob/master/examples/all/manifests/thanos-receive-statefulSet.yaml

The problem occurs with Thanos v0.10.0 and all earlier versions. We also tested it with up to 16GB of memory (when the pods are running they using around 1GB of memory). These are the logs until the pod gets OOM killed:

level=info ts=2020-01-17T07:40:38.874473819Z caller=main.go:149 msg="Tracing will be disabled"
level=warn ts=2020-01-17T07:40:38.874629267Z caller=receive.go:170 component=receive msg="setting up receive; the Thanos receive component is EXPERIMENTAL, it may break significantly without notice"
level=info ts=2020-01-17T07:40:38.874722296Z caller=options.go:20 component=receive protocol=HTTP msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:38.902765617Z caller=options.go:80 component=receive msg="TLS client using system certificate pool"
level=info ts=2020-01-17T07:40:38.904426534Z caller=factory.go:43 component=receive msg="loading bucket configuration"
level=info ts=2020-01-17T07:40:39.133165452Z caller=receive.go:476 component=receive msg="starting receiver"
level=info ts=2020-01-17T07:40:39.133391365Z caller=prober.go:127 component=receive msg="changing probe status" status=healthy
level=info ts=2020-01-17T07:40:39.133430517Z caller=http.go:53 component=receive service=http/server component=receive msg="listening for requests and metrics" address=0.0.0.0:10902
level=info ts=2020-01-17T07:40:39.133494173Z caller=options.go:20 component=receive protocol=gRPC msg="disabled TLS, key and cert must be set to enable"
level=info ts=2020-01-17T07:40:39.133543844Z caller=receive.go:322 component=receive msg="hashring has changed; server is not ready to receive web requests."
level=info ts=2020-01-17T07:40:39.13335757Z caller=handler.go:171 component=receive component=receive-handler msg="Start listening for connections" address=0.0.0.0:19291
level=info ts=2020-01-17T07:40:39.394096288Z caller=head.go:583 component=receive component=tsdb msg="replaying WAL, this may take awhile"
level=info ts=2020-01-17T07:40:41.731857492Z caller=head.go:607 component=receive component=tsdb msg="WAL checkpoint loaded"
level=info ts=2020-01-17T07:40:44.053388267Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=15 maxSegment=22
level=info ts=2020-01-17T07:40:45.417218717Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=16 maxSegment=22
level=info ts=2020-01-17T07:40:47.796282924Z caller=head.go:631 component=receive component=tsdb msg="WAL segment loaded" segment=17 maxSegment=22

squat · 2020-01-17T09:14:47Z

This entire code path is simply leveraging Prometheus TSDB packages. Do we see similar spikes in memory when Prometheus restarts and replays a large WAL?

ricoberger · 2020-01-17T14:06:38Z

We are seeing similar spikes in memory, when Prometheus restarts and replays a large WAL? The Prometheus spikes are not as extreme as in Thanos.

Thanos, gets OOM killed with 16GB, after that we decreased the limit to 4GB:

Prometheus:

sepich · 2020-01-27T16:09:17Z

Also see that thanos-receive v0.10.1 spikes in memory usage on start and got in OOM-loop.
Have to reduce --tsdb.block-duration=15m to have smaller WAL size - startup memory usage reduced dramatically.

rdzimmer-zz · 2020-02-07T13:34:28Z

I am also seeing huge spikes in memory after a restart. 9GB becomes 93GB. I have more details in #2107 (which I closed as a duplicate after finding I missed this issue), but sounds pretty similar to what everyone else is seeing.

stale · 2020-03-08T14:24:55Z

This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions.

mikkeloscar · 2020-03-09T07:27:21Z

This still happens with Thanos v0.11.0

brancz · 2020-03-17T14:03:19Z

This can largely be attributed to WAL replay cost of the Prometheus TSDB, which thanos-receive makes use of. This is being improved on Prometheus itself already, once merged there, it'll trickle down to here.

norbertwnuk · 2020-03-25T20:21:17Z

@brancz - can you point us to some existing Prometheus PR?

brancz · 2020-03-26T14:44:38Z

One example: prometheus/prometheus#6679. There are more ideas in the pipeline once that work is completed.

stale · 2020-04-25T14:48:34Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

brancz · 2020-04-27T07:25:19Z

Not necessarily a general solution (as it just pushes the problem out further), but latest version of prometheus TSDB should at least cut this problem in half: prometheus/prometheus#7098

stale · 2020-05-27T09:12:20Z

Hello 👋 Looks like there was no activity on this issue for last 30 days.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity for next week, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2020-06-03T09:50:24Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

daxwang · 2020-06-05T03:32:37Z

same problem for v0.12.2, restart receive, then oom

pratap-bhaskar · 2020-08-13T10:07:38Z

We are issuing a similar problem with our setup, we have Thanos-Receive to get metrics from 8 clusters, we are running the pods on 64GB memory VMs and it gets OOM Killed atleast once a day and we lose the data for atleast 2 hours. Is there a way to circumvent this scenario?

JoseRIvera07 · 2021-01-29T14:30:37Z

stills valid for v0.17.2, after restart receiver component. OMM Killed

thinker0 · 2021-02-09T07:58:17Z

same problem for v0.18.0, restart receive, then oom

kakkoyun · 2021-02-09T12:32:37Z

I'm reopening this. It seems like there is still an issue. If we think this is not an issue. Let's convert this to a discussion and redirect users to the discussion for similar issues.

cc @brancz @squat @bwplotka

stale · 2021-06-03T19:03:00Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

EladDolev · 2021-06-04T13:16:32Z

still happening

stale · 2021-08-04T22:43:23Z

Hello 👋 Looks like there was no activity on this issue for the last two months.
Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗
If there will be no activity in the next two weeks, this issue will be closed (we can always reopen an issue if we need!). Alternatively, use remind command if you wish to be reminded at some point in future.

stale · 2021-08-21T03:14:13Z

Closing for now as promised, let us know if you need this to be reopened! 🤗

imranismail · 2021-08-24T02:38:13Z

This is still happening on the latest release

luizrojo · 2022-01-20T15:38:08Z

Still happening on 0.24.0

Boeller666 · 2022-06-14T09:11:17Z

Still happening on 0.25.1

Venture200 · 2023-03-07T11:20:38Z

Still happening on 0.29.0

praveenchandran · 2023-04-01T18:38:39Z

Still happening on 0.31.0. Why is this not looked into ? :(

gioppoluca · 2023-06-16T15:57:28Z

still happening on 0.31.0

kevint-corelight · 2025-01-29T23:34:29Z

still happening on 0.37.2

FUSAKLA added the component: receive label Nov 8, 2019

stale bot added the stale label Jan 11, 2020

stale bot removed the stale label Jan 13, 2020

rdzimmer-zz mentioned this issue Feb 7, 2020

Thanos Receive memory usage spikes extremely high after restart (9GB before restart, 93GB after) #2107

Closed

sepich mentioned this issue Feb 10, 2020

receive: Support backfilling data through remote write (in case of origin got disconnected) #2114

Closed

stale bot added the stale label Mar 8, 2020

stale bot removed the stale label Mar 9, 2020

stale bot added the stale label Apr 25, 2020

stale bot removed the stale label Apr 26, 2020

stale bot added the stale label May 27, 2020

stale bot closed this as completed Jun 3, 2020

kakkoyun added help wanted needs-more-info and removed stale labels Feb 9, 2021

kakkoyun reopened this Feb 9, 2021

stale bot added the stale label Jun 3, 2021

stale bot removed the stale label Jun 4, 2021

stale bot added the stale label Aug 4, 2021

stale bot closed this as completed Aug 21, 2021

receive: can not load WAL data when restart #1624

receive: can not load WAL data when restart #1624

Comments

GuyZhangZhang commented Oct 10, 2019 • edited Loading

brancz commented Oct 10, 2019 • edited Loading

squat commented Oct 10, 2019

GuyZhangZhang commented Oct 15, 2019

squat commented Oct 15, 2019 • edited Loading

GuyZhangZhang commented Oct 16, 2019 • edited Loading

squat commented Oct 16, 2019

squat commented Oct 16, 2019

GuyZhangZhang commented Oct 17, 2019

Victorion commented Nov 21, 2019

stale bot commented Jan 11, 2020

brancz commented Jan 13, 2020

ricoberger commented Jan 17, 2020

squat commented Jan 17, 2020

ricoberger commented Jan 17, 2020

sepich commented Jan 27, 2020

rdzimmer-zz commented Feb 7, 2020

stale bot commented Mar 8, 2020

mikkeloscar commented Mar 9, 2020

brancz commented Mar 17, 2020

norbertwnuk commented Mar 25, 2020

brancz commented Mar 26, 2020

stale bot commented Apr 25, 2020

brancz commented Apr 27, 2020

stale bot commented May 27, 2020

stale bot commented Jun 3, 2020

daxwang commented Jun 5, 2020

pratap-bhaskar commented Aug 13, 2020

JoseRIvera07 commented Jan 29, 2021

thinker0 commented Feb 9, 2021

kakkoyun commented Feb 9, 2021

stale bot commented Jun 3, 2021

EladDolev commented Jun 4, 2021

stale bot commented Aug 4, 2021

stale bot commented Aug 21, 2021

imranismail commented Aug 24, 2021

luizrojo commented Jan 20, 2022

Boeller666 commented Jun 14, 2022

Venture200 commented Mar 7, 2023

praveenchandran commented Apr 1, 2023

gioppoluca commented Jun 16, 2023

kevint-corelight commented Jan 29, 2025

GuyZhangZhang commented Oct 10, 2019 •

edited

Loading

brancz commented Oct 10, 2019 •

edited

Loading

squat commented Oct 15, 2019 •

edited

Loading

GuyZhangZhang commented Oct 16, 2019 •

edited

Loading