-
Notifications
You must be signed in to change notification settings - Fork 463
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_restarts_under_load
failing
#1343
Comments
I tried to reproduce your observations, got similar results, but for me it seems more like there is some problem with locks in postgres, rather than client issue. From my observations, iteration usually have some progress in about ~1 second, or doesn't have progress at all for 10-20 seconds (that's why test is failing). What I tried to do was to collect various useful info once there is no progress in iteration. I do this by running 19:40:37 - new iteration has no progress (txes are stuck) ps ax shows this:
psql for pg_stat_activity responds only at 19:40:51, when iteration has already finished, reports this:
Not sure if pg_stat_activity tries to acquire some common lock or postgres is just unresponsive for processing transactions and new connections. |
Tried to dig deeper, seems that I know where bug is now, but not entirely sure yet. The issue is still not stable to reproduce, but I managed to get backtrace of
It seems that this backend blocked on waiting for shared buffer, so I increased shared_buffers to 512MB and tried to reproduce issue again. After that, issue started taking more time to reproduce, but test now fails on ~25 iteration and everything hangs for a substantial time (no progress for several minutes, at least). That's convinient for debug, and I got backtrace for all postgres processes when everything is stuck. Here's All processes are waiting for something, and that is one of:
After one minute in stuck state, pageserver errors with:
The most obvious explanation for this is that WAL is stuck on safekeeper and recent logs are not streamed to pageserver, and there's likely a bug somewhere nearby send_wal.rs |
I tried bisecting the issue and it got me to fa8a6c0, i.e. issue appeared when we disabled debug logs in safekeeper (that increased safekeeper perf in 1.5-2 times in local tests). Without logs issue is reproduced for commits from January, so it's not a recent bug. I added metric for So, it looks like pageserver isn't able to keep up with safekeepers and LSN gap is increasing between them. After some time (probably when shared_buffers are filled up) compute starts making requests which call Same metrics when shared_buffers=1MB: This time pageserver also falls behind, but The situation where transactions hang for 10+ seconds (because compute needs to wait for pageserver to catch up) doesn't look good, I wonder if backpressure supposed to solve this? For shared_buffers=1MB problem doesn't look too bad, because everything continues to work normally after 10 seconds. For shared_buffers=512MB situation is worse, because requests are timed out after 1 minute and this error aborts the transaction. Not sure what is the next step for this issue. |
Nice investigation. Looks like we should harden back pressure settings. CC @knizhnik |
we'll harden the settings once we see a test failure again |
Test failed on main, logs have nothing suspicious but it seems that pageserver is lagging behind:
|
From #1280 (comment)
After poking it I see that sometimes during the whole iteration period (10s) some backend(s) never manage to finish the transaction. Walproposer looks healthy, seems like it is a client problem, as if it just never sends next query. e.g.
So almost everyone waits for 2403614, and that backend did first update and now waits for the client to perform the second
Not sure how that's possible.
The text was updated successfully, but these errors were encountered: