-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
receive: can not load WAL data when restart #1624
Comments
On master receive will on startup ensure that it starts with a clean state, so it loads the WAL, flushes it to a block and then uploads it. That's why you're seeing it be loaded multiple times. cc @squat |
@GuyCheung the WAL gets replayed and written to disk in the following cases:
Are you seeing an error where the WAL data is not successfully getting loaded? |
@brancz @squat sorry for later reply. I didn't see critical errors except some unknown series error log like if it will load multiple times, is there an exact number of how many times? seems it's an infinite loop and always loading |
@GuyCheung the WAL will be reloaded twice on startup (only one of the two times actually has data played back) and two times every two hours: During startup:
... then every two hours:
Are you seeing it more often than that? |
@squat thanks for ur information! I think I know why it's always keep loading... the process had been before my questions, update the information: I have changed thanos verison to
In my mind, I think the memory will not much bigger than the data folder on the disk? but I found the process took about 47G memory(total 48G), and the WAL data folder on disk was about 22G the process took about 47G memory when been killed:
the receiver data folder size like below:
I try to find the answer on the internet, but unfortunately, I didn't find any useful information, and there is a voice said the Prometheus 2 can not limit the memory, I'm not sure about this.
I think it was replayed triple times during startup, and each time seems took 10 minutes, did this act as ur expectation? is there sth we can do to reduce the replay time cost? the key points of my logs like below:
|
@GuyCheung, this huge growth in WAL looks to me to be the same issue that #1654 fixes. Every time that the receive component triggers a flush, the WAL is replayed and written to a block but the WAL is never deleted. Only new samples are actually written to the block but ALL samples are processed. This means that the the receive component has to do way more work than necessary. If the receive component has correctly shut down and cleaned the WAL, there should be 0 replay cost when starting back up. |
@GuyCheung now that #1654 has merged, could you test again using the image built from the commit? |
@squat I have tried the new version which commit id is but I got OOM on the 128GB memory hosts... and the data folder size just about 6GB, could you help to look into this? I shared the pprof/heap file to you: https://drive.google.com/file/d/1iKqfMD9brOXbt7mLqJCX689AhRPuzJ_N/view?usp=sharing |
@squat I also tried to test the receive component with this patch - same story, but with small chunk - ~80 MB generated wal logs can't be replayed (single segment) - got OOM with 4GB memory limit on the container. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
Does this continue to occur for you? |
We have a similar problem, running Thanos in a Kubernetes cluster. When a pod gets restarted the receiver is OOM killed. The StatefulSet looks similar to the following example: https://github.com/thanos-io/kube-thanos/blob/master/examples/all/manifests/thanos-receive-statefulSet.yaml The problem occurs with Thanos
|
This entire code path is simply leveraging Prometheus TSDB packages. Do we see similar spikes in memory when Prometheus restarts and replays a large WAL? |
Also see that thanos-receive v0.10.1 spikes in memory usage on start and got in OOM-loop. |
I am also seeing huge spikes in memory after a restart. 9GB becomes 93GB. I have more details in #2107 (which I closed as a duplicate after finding I missed this issue), but sounds pretty similar to what everyone else is seeing. |
This issue/PR has been automatically marked as stale because it has not had recent activity. Please comment on status otherwise the issue will be closed in a week. Thank you for your contributions. |
This still happens with Thanos v0.11.0 |
This can largely be attributed to WAL replay cost of the Prometheus TSDB, which thanos-receive makes use of. This is being improved on Prometheus itself already, once merged there, it'll trickle down to here. |
@brancz - can you point us to some existing Prometheus PR? |
One example: prometheus/prometheus#6679. There are more ideas in the pipeline once that work is completed. |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Not necessarily a general solution (as it just pushes the problem out further), but latest version of prometheus TSDB should at least cut this problem in half: prometheus/prometheus#7098 |
Hello 👋 Looks like there was no activity on this issue for last 30 days. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
same problem for v0.12.2, restart receive, then oom |
We are issuing a similar problem with our setup, we have Thanos-Receive to get metrics from 8 clusters, we are running the pods on 64GB memory VMs and it gets OOM Killed atleast once a day and we lose the data for atleast 2 hours. Is there a way to circumvent this scenario? |
stills valid for v0.17.2, after restart receiver component. OMM Killed |
same problem for v0.18.0, restart receive, then oom |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
still happening |
Hello 👋 Looks like there was no activity on this issue for the last two months. |
Closing for now as promised, let us know if you need this to be reopened! 🤗 |
This is still happening on the latest release |
Still happening on 0.24.0 |
Still happening on 0.25.1 |
Still happening on 0.29.0 |
Still happening on 0.31.0. Why is this not looked into ? :( |
still happening on 0.31.0 |
still happening on 0.37.2 |
Thanos, Prometheus and Golang version used:
Thanos: self build from thanos master: a09a4b9
Golang: go version go1.12.7 darwin/amd64
Prometheus: 2.13.0
thanos build command:
GOOS=linux GOARCH=amd64 go build -o thanos ./cmd/thanos
What happened:
I'm trying to use thanos receive, they are running as expected hours before I restart it.
the thanos receive try to load WAL data again and again
What you expected to happen:
load WAL data and listen & receive new data.
How to reproduce it (as minimally and precisely as possible):
I'm not sure is there any logic issue when receive starting? the thanos-0.7 can restart successful but master code cannot.
Full logs to relevant components:
Environment:
uname -a
): 4.15.0The text was updated successfully, but these errors were encountered: