Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stopping a PostgreSQL container immediately after starting it kills PostgreSQL with SIGKILL, possibly corrupting the database #1207

Open
martinvonwittich opened this issue Mar 5, 2024 · 4 comments

Comments

@martinvonwittich
Copy link

We are running a PostgreSQL docker container in production for our ERP, and have had multiple instances of corruption in the last few months.

The most recent corruption started with the following message "invalid record length" immediately after a restart:

2024-02-29 12:44:27.403	2024-02-29 11:44:27.402 UTC [7731] FATAL:  sorry, too many clients already
2024-02-29 12:44:28.505	2024-02-29 11:44:28.505 UTC [7732] FATAL:  sorry, too many clients already
2024-02-29 12:44:28.699	2024-02-29 11:44:28.699 UTC [7733] FATAL:  sorry, too many clients already
2024-02-29 12:44:29.912	2024-02-29 11:44:29.912 UTC [7734] FATAL:  sorry, too many clients already
2024-02-29 12:44:30.843	2024-02-29 11:44:30.843 UTC [7735] FATAL:  sorry, too many clients already
2024-02-29 12:44:33.862	2024-02-29 11:44:33.862 UTC [7740] FATAL:  sorry, too many clients already
2024-02-29 12:44:35.004	2024-02-29 11:44:35.003 UTC [7741] FATAL:  sorry, too many clients already
2024-02-29 12:44:37.264	2024-02-29 11:44:37.264 UTC [7746] FATAL:  sorry, too many clients already
2024-02-29 12:44:37.952	PostgreSQL Database directory appears to contain a database; Skipping initialization
2024-02-29 12:44:37.983	2024-02-29 11:44:37.983 UTC [1] LOG:  starting PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-02-29 12:44:37.983	2024-02-29 11:44:37.983 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-02-29 12:44:37.983	2024-02-29 11:44:37.983 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-02-29 12:44:37.987	2024-02-29 11:44:37.987 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-02-29 12:44:37.996	2024-02-29 11:44:37.995 UTC [29] LOG:  database system was interrupted; last known up at 2024-02-29 11:44:08 UTC
2024-02-29 12:44:38.572	2024-02-29 11:44:38.572 UTC [29] LOG:  database system was not properly shut down; automatic recovery in progress
2024-02-29 12:44:38.576	2024-02-29 11:44:38.576 UTC [29] LOG:  redo starts at 24/68195310
2024-02-29 12:44:38.580	2024-02-29 11:44:38.580 UTC [29] LOG:  invalid record length at 24/6834FC90: wanted 24, got 0
2024-02-29 12:44:38.580	2024-02-29 11:44:38.580 UTC [29] LOG:  redo done at 24/6834FC68 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2024-02-29 12:44:38.591	2024-02-29 11:44:38.591 UTC [27] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-02-29 12:44:38.674	2024-02-29 11:44:38.674 UTC [1] LOG:  database system is ready to accept connections
2024-02-29 12:44:40.319	2024-02-29 11:44:40.319 UTC [1] LOG:  received fast shutdown request
2024-02-29 12:44:40.321	2024-02-29 11:44:40.321 UTC [1] LOG:  aborting any active transactions
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [572] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [575] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [574] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [577] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [562] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [582] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [563] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [576] FATAL:  terminating connection due to administrator command
2024-02-29 12:44:40.332	2024-02-29 11:44:40.321 UTC [569] FATAL:  terminating connection due to administrator command

followed by more serious errors:

2024-02-29 12:45:53.223	2024-02-29 11:45:53.223 UTC [71] ERROR:  index "mail_message_pkey" contains unexpected zero page at block 2029
2024-02-29 12:45:53.223	2024-02-29 11:45:53.223 UTC [71] HINT:  Please REINDEX it.
2024-02-29 12:45:53.223	2024-02-29 11:45:53.223 UTC [71] STATEMENT:   SELECT msg.res_id FROM mail_message msg
2024-02-29 12:45:53.223		                                 RIGHT JOIN mail_notification rel
2024-02-29 12:45:53.223		                                 ON rel.mail_message_id = msg.id AND rel.res_partner_id = 42832 AND (rel.is_read = false OR rel.is_read IS NULL)
2024-02-29 12:45:53.223		                                 WHERE msg.model = 'mail.channel' AND msg.res_id in (1) AND msg.message_type != 'user_notification'
2024-02-29 12:45:53.972	2024-02-29 11:45:53.972 UTC [118] LOG:  request to flush past end of generated WAL; request 24/6843DB78, current position 24/683AC100
2024-02-29 12:45:53.972	2024-02-29 11:45:53.972 UTC [118] CONTEXT:  writing block 0 of relation base/186211/274960

and finally a PostgreSQL container unable to start due to a corrupted WAL:

2024-02-29 12:48:58.044	PostgreSQL Database directory appears to contain a database; Skipping initialization
2024-02-29 12:48:58.070	2024-02-29 11:48:58.069 UTC [1] LOG:  starting PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-02-29 12:48:58.070	2024-02-29 11:48:58.070 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-02-29 12:48:58.070	2024-02-29 11:48:58.070 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-02-29 12:48:58.074	2024-02-29 11:48:58.074 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-02-29 12:48:58.082	2024-02-29 11:48:58.082 UTC [29] LOG:  database system was interrupted while in recovery at 2024-02-29 11:48:51 UTC
2024-02-29 12:48:58.082	2024-02-29 11:48:58.082 UTC [29] HINT:  This probably means that some data is corrupted and you will have to use the last backup for recovery.
2024-02-29 12:48:58.467	2024-02-29 11:48:58.466 UTC [30] FATAL:  the database system is starting up
2024-02-29 12:48:58.687	2024-02-29 11:48:58.687 UTC [29] LOG:  database system was not properly shut down; automatic recovery in progress
2024-02-29 12:48:58.691	2024-02-29 11:48:58.691 UTC [29] LOG:  redo starts at 24/68351E78
2024-02-29 12:48:58.693	2024-02-29 11:48:58.693 UTC [29] PANIC:  invalid lp
2024-02-29 12:48:58.693	2024-02-29 11:48:58.693 UTC [29] CONTEXT:  WAL redo at 24/68413FC8 for Heap/HOT_UPDATE: off 7 xmax 18320045 flags 0x60 ; new off 12 xmax 0; blkref #0: rel 1663/186211/189171, blk 49
2024-02-29 12:48:59.169	2024-02-29 11:48:59.169 UTC [31] FATAL:  the database system is not yet accepting connections
2024-02-29 12:48:59.169	2024-02-29 11:48:59.169 UTC [31] DETAIL:  Consistent recovery state has not been yet reached.
2024-02-29 12:48:59.314	2024-02-29 11:48:59.314 UTC [1] LOG:  startup process (PID 29) was terminated by signal 6: Aborted
2024-02-29 12:48:59.314	2024-02-29 11:48:59.314 UTC [1] LOG:  terminating any other active server processes
2024-02-29 12:48:59.315	2024-02-29 11:48:59.315 UTC [1] LOG:  shutting down due to startup process failure
2024-02-29 12:48:59.334	2024-02-29 11:48:59.334 UTC [1] LOG:  database system is shut down

We were able to recover from the corruption by using pg_resetwal, dumping the database to SQL and restoring into a clean database from the dump, but the multiple corruption incidents got me curious, and I started to investigate if I could reproduce the error.

From the initial error it is obvious that PostgreSQL wasn't properly shut down, but probably killed with SIGKILL, because the log doesn't indicate that the instance running at 2024-02-29 12:44:37.264 was shut down before the instance at 2024-02-29 12:44:37.983 started:

2024-02-29 12:44:37.264	2024-02-29 11:44:37.264 UTC [7746] FATAL:  sorry, too many clients already
2024-02-29 12:44:37.952	PostgreSQL Database directory appears to contain a database; Skipping initialization
2024-02-29 12:44:37.983	2024-02-29 11:44:37.983 UTC [1] LOG:  starting PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
To compare it with a clean restart, where the old instance reports that it is being properly shut down:

2024-02-29 03:00:11.265	2024-02-29 02:00:11.264 UTC [51784] FATAL:  terminating connection due to administrator command
2024-02-29 03:00:11.268	2024-02-29 02:00:11.268 UTC [51805] FATAL:  terminating connection due to administrator command
2024-02-29 03:00:11.276	2024-02-29 02:00:11.276 UTC [1] LOG:  background worker "logical replication launcher" (PID 32) exited with exit code 1
2024-02-29 03:00:11.285	2024-02-29 02:00:11.285 UTC [51804] FATAL:  terminating connection due to administrator command
2024-02-29 03:00:11.302	2024-02-29 02:00:11.302 UTC [27] LOG:  shutting down
2024-02-29 03:00:11.304	2024-02-29 02:00:11.304 UTC [27] LOG:  checkpoint starting: shutdown immediate
2024-02-29 03:00:11.363	2024-02-29 02:00:11.363 UTC [1] LOG:  database system is shut down
2024-02-29 03:00:19.513	PostgreSQL Database directory appears to contain a database; Skipping initialization
2024-02-29 03:00:19.568	2024-02-29 02:00:19.567 UTC [1] LOG:  starting PostgreSQL 15.4 (Debian 15.4-2.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-02-29 03:00:19.568	2024-02-29 02:00:19.568 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-02-29 03:00:19.568	2024-02-29 02:00:19.568 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-02-29 03:00:19.572	2024-02-29 02:00:19.572 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-02-29 03:00:19.584	2024-02-29 02:00:19.580 UTC [29] LOG:  database system was shut down at 2024-02-29 02:00:11 UTC
2024-02-29 03:00:19.594	2024-02-29 02:00:19.594 UTC [1] LOG:  database system is ready to accept connections

At the time where the corruption occurred, a team member issued a command to restart the docker container (we have since figured out that the team restarted PostgreSQL very often due to a misunderstanding that it would be necessary when deploying schema changes), so I went ahead to see if I could reproduce the problem just by restarting the container. I created an empty PostgreSQL container as documented on https://hub.docker.com/_/postgres under "start a postgres instance":

martin.mein-iserv.de ~ # docker run --name some-postgres -e POSTGRES_PASSWORD=mysecretpassword -d postgres
cf5a99834d8d4674bfddc7695f5aa97b1304bcc6817e6dad09479aa1a95beb59
And then I restarted the container in a loop:

martin.mein-iserv.de ~ # for i in {1..5}; do date -Iseconds; time docker stop some-postgres >/dev/null; echo; date -Iseconds; time docker start some-postgres >/dev/null; echo; done
2024-03-05T15:44:44+01:00
docker stop some-postgres > /dev/null  0,04s user 0,03s system 22% cpu 0,307 total

2024-03-05T15:44:44+01:00
docker start some-postgres > /dev/null  0,03s user 0,02s system 14% cpu 0,365 total

2024-03-05T15:44:45+01:00
docker stop some-postgres > /dev/null  0,03s user 0,02s system 0% cpu 10,274 total

2024-03-05T15:44:55+01:00
docker start some-postgres > /dev/null  0,03s user 0,03s system 18% cpu 0,291 total

2024-03-05T15:44:55+01:00
docker stop some-postgres > /dev/null  0,03s user 0,02s system 14% cpu 0,329 total

2024-03-05T15:44:56+01:00
docker start some-postgres > /dev/null  0,02s user 0,03s system 17% cpu 0,315 total

2024-03-05T15:44:56+01:00
docker stop some-postgres > /dev/null  0,02s user 0,02s system 0% cpu 10,257 total

2024-03-05T15:45:06+01:00
docker start some-postgres > /dev/null  0,03s user 0,02s system 13% cpu 0,348 total

2024-03-05T15:45:07+01:00
docker stop some-postgres > /dev/null  0,02s user 0,02s system 0% cpu 10,274 total

2024-03-05T15:45:17+01:00
docker start some-postgres > /dev/null  0,04s user 0,03s system 18% cpu 0,354 total

It's immediately obvious that something is going wrong here - sometimes docker stop hangs for 10 seconds, which is suspiciously similar to Docker's default --stop-timeout:

man docker-stop:

       --stop-timeout
         Timeout (in seconds) to stop a container, or -1 to disable timeout.

       The --stop-timeout flag sets the number of seconds to wait for the container
         to stop after sending the pre-defined (see --stop-signal) system call signal.
         If the container does not exit after the timeout elapses, it is forcibly killed
         with a SIGKILL signal.

       If --stop-timeout is set to -1, no timeout is applied, and the daemon will
         wait indefinitely for the container to exit.

       The default is determined by the daemon, and 10 seconds for Linux containers,
         and 30 seconds for Windows containers.

(Sidenote: the issue is very timing-sensitive. Without the date call, every docker stop after the first one was affected, and doing anything more complicated like calling docker logs between a docker start and docker stop makes the issue unreproducible.)

The log confirms that e.g. the docker stop run at 2024-03-05T15:44:45+01:00 did not shut down PostgreSQL cleanly:

[...]
PostgreSQL Database directory appears to contain a database; Skipping initialization

2024-03-05 14:44:45.412 UTC [1] LOG:  starting PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-03-05 14:44:45.412 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-05 14:44:45.412 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-03-05 14:44:45.416 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-05 14:44:45.421 UTC [30] LOG:  database system was shut down at 2024-03-05 14:44:44 UTC
2024-03-05 14:44:45.429 UTC [1] LOG:  database system is ready to accept connections

PostgreSQL Database directory appears to contain a database; Skipping initialization

2024-03-05 14:44:55.980 UTC [1] LOG:  starting PostgreSQL 16.2 (Debian 16.2-1.pgdg120+2) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-03-05 14:44:55.980 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2024-03-05 14:44:55.980 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2024-03-05 14:44:55.984 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-03-05 14:44:55.989 UTC [1] LOG:  received fast shutdown request
2024-03-05 14:44:55.989 UTC [29] LOG:  database system was interrupted; last known up at 2024-03-05 14:44:45 UTC
2024-03-05 14:44:56.048 UTC [29] LOG:  database system was not properly shut down; automatic recovery in progress
2024-03-05 14:44:56.049 UTC [29] LOG:  invalid record length at 0/15268D8: expected at least 24, got 0
2024-03-05 14:44:56.049 UTC [29] LOG:  redo is not required
2024-03-05 14:44:56.052 UTC [27] LOG:  checkpoint starting: end-of-recovery immediate wait
2024-03-05 14:44:56.062 UTC [27] LOG:  checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.003 s, sync=0.002 s, total=0.011 s; sync files=2, longest=0.001 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/15268D8, redo lsn=0/15268D8
2024-03-05 14:44:56.065 UTC [27] LOG:  shutting down
2024-03-05 14:44:56.067 UTC [27] LOG:  checkpoint starting: shutdown immediate
2024-03-05 14:44:56.073 UTC [27] LOG:  checkpoint complete: wrote 0 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.001 s, total=0.008 s; sync files=0, longest=0.000 s, average=0.000 s; distance=0 kB, estimate=0 kB; lsn=0/1526950, redo lsn=0/1526950
2024-03-05 14:44:56.080 UTC [1] LOG:  database system is shut down
[...]

Apparently sometimes the SIGTERM that docker stop sends initially seems to go unnoticed by the container, leading docker stop to send a SIGKILL 10 seconds later. Digging down with execsnoop, it seems that the underlying problem is the fact that the docker stop happens during the container startup phase before the actual postgres binary is even started:

Details

16:31:36 date             968413 953994   0 /usr/bin/date -Iseconds
16:31:36 docker           968414 953994   0 /usr/bin/docker start some-postgres
16:31:36 ifupdown-hotplu  968423 968130   0
16:31:36 ifquery          968426 968425   0 /sbin/ifquery --allow hotplug -l veth1d2dfb4
16:31:36 bridge-network-  968421 968130   0 /lib/udev/bridge-network-interface
16:31:36 bridge-network-  968422 968135   0 /lib/udev/bridge-network-interface
16:31:36 ifupdown-hotplu  968424 968135   0 /lib/udev/ifupdown-hotplug
16:31:36 ifquery          968428 968427   0 /sbin/ifquery --allow hotplug -l veth0fae70f
16:31:36 systemd-sysctl   968429 968130   0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/veth1d2dfb4 --prefix=/net/ipv4/neigh/veth1d2dfb4 --prefix=/net/ipv6/conf/veth1d2dfb4 --prefix=/net/ipv6/neigh/veth1d2dfb4
16:31:36 systemd-sysctl   968430 968135   0 /lib/systemd/systemd-sysctl --prefix=/net/ipv4/conf/veth0fae70f --prefix=/net/ipv4/neigh/veth0fae70f --prefix=/net/ipv6/conf/veth0fae70f --prefix=/net/ipv6/neigh/veth0fae70f
16:31:36 containerd-shim  968431 1549     0
16:31:36 containerd-shim  968439 968431   0 /usr/bin/containerd-shim-runc-v2 -namespace moby -id e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4 -address /run/containerd/containerd.sock
16:31:36 runc             968448 968439   0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4/log.json --log-format json --systemd-cgroup create --bundle /run/containerd/io.containerd.runtime.v2.task/moby/e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4 --pid-file /run/containerd/io.containerd.runtime.v2.task/moby/e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4/init.pid e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4
16:31:36 exe              968457 968448   0 /proc/self/exe init
16:31:36 exe              968466 968448   0 /proc/3052/exe -exec-root=/var/run/docker e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4 beb8ec28ab26
16:31:36 exe              968474 3052     0 /proc/self/exe /var/run/docker/netns/241fd6979fc9 all false
16:31:36 ifupdown-hotplu  968482 968135   0 /lib/udev/ifupdown-hotplug
16:31:36 runc             968483 968439   0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4/log.json --log-format json --systemd-cgroup start e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4
16:31:36 docker-entrypoi  968459 968439   0 /usr/local/bin/docker-entrypoint.sh postgres
16:31:36 bash             968459 968439   0 /usr/bin/bash /usr/local/bin/docker-entrypoint.sh postgres
16:31:36 id               968491 968459   0 /usr/bin/id -u
16:31:36 mkdir            968492 968459   0 /usr/bin/mkdir -p /var/lib/postgresql/data
16:31:36 date             968493 953994   0 /usr/bin/date -Iseconds
16:31:36 chmod            968495 968459   0 /usr/bin/chmod 00700 /var/lib/postgresql/data
16:31:36 docker           968494 953994   0 /usr/bin/docker stop some-postgres
16:31:36 mkdir            968496 968459   0 /usr/bin/mkdir -p /var/run/postgresql
16:31:36 chmod            968497 968459   0 /usr/bin/chmod 03775 /var/run/postgresql
16:31:36 find             968498 968459   0 /usr/bin/find /var/lib/postgresql/data ! -user postgres -exec chown postgres {} +
16:31:36 find             968499 968459   0 /usr/bin/find /var/run/postgresql ! -user postgres -exec chown postgres {} +
16:31:36 id               968505 968459   0 /usr/bin/id -u
16:31:36 gosu             968459 968439   0 /usr/local/bin/gosu postgres /usr/local/bin/docker-entrypoint.sh postgres
16:31:36 docker-entrypoi  968459 968439   0 /usr/local/bin/docker-entrypoint.sh postgres
16:31:36 bash             968459 968439   0 /usr/bin/bash /usr/local/bin/docker-entrypoint.sh postgres
16:31:36 id               968513 968459   0 /usr/bin/id -u
16:31:36 mkdir            968514 968459   0 /usr/bin/mkdir -p /var/lib/postgresql/data
16:31:36 chmod            968515 968459   0 /usr/bin/chmod 00700 /var/lib/postgresql/data
16:31:36 mkdir            968516 968459   0 /usr/bin/mkdir -p /var/run/postgresql
16:31:36 runc             968518 968439   0 /usr/bin/runc --root /var/run/docker/runtime-runc/moby --log /run/containerd/io.containerd.runtime.v2.task/moby/e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4/log.json --log-format json --systemd-cgroup kill e24c90a798805e9fe8c1d2a00017fb966fffd2cbff97d281e7dab595e80020b4 2
16:31:36 chmod            968517 968459   0 /usr/bin/chmod 03775 /var/run/postgresql
16:31:36 id               968519 968459   0 /usr/bin/id -u
16:31:36 cat              968528 968459   0 /usr/bin/cat
16:31:36 postgres         968459 968439   0 /usr/lib/postgresql/16/bin/postgres

I was unable to reproduce the issue with a non-containerized PostgreSQL installation on Debian 11 bullseye with a similar systemctl stop postgresql@13-main.service; systemctl start postgresql@13-main.service loop, so this issue seems to affect only the Docker version of PostgreSQL.

@tianon
Copy link
Member

tianon commented Mar 5, 2024

It sounds like #714 (and thus #763) might be related here. 😅

("Clean" shutdown is something we've had a number of discussions about in this repository, and something not everyone can agree on a good definition of due to wildly differing use cases and usage patterns. 🙈)

@tianon
Copy link
Member

tianon commented Mar 5, 2024

You probably also want to set a much more generous --stop-timeout on your PostgreSQL containers -- I usually use 120, but even that might not be enough (as discussed in #714). You'll likely also need to raise systemd's default TimeoutStopSec value to make sure your containers get their full stop timeout if the system or service shuts down for any reason, like Docker upgrades.

@martinvonwittich
Copy link
Author

martinvonwittich commented Mar 5, 2024

You probably also want to set a much more generous --stop-timeout on your PostgreSQL containers

I'm pretty sure increasing the --stop-timeout wouldn't help here at all - due to the fact that the main postgres process isn't even running yet when triggering the issue with docker stop, the SIGINT apparently goes nowhere useful (the entrypoint script seems to ignore it?). Docker will wait fruitlessly for PostgreSQL to stop, while PostgreSQL doesn't know that it's supposed to stop in the first place. Raising --stop-timeout will just cause Docker to wait longer before sending SIGKILL.

We've already decided to migrate our PostgreSQL instance out of Docker to ensure we won't get hit by this again.

@tianon
Copy link
Member

tianon commented Mar 6, 2024

Ah right -- I guess we could implement some trap logic in there that tries to mimic "Smart Shutdown" on SIGTERM and "Fast Shutdown" on SIGINT, but I think it'd be easier / more likely to be "generally correct" to just assume stopping the container during initialization is a "failed to initialize" situation and exit 1 (more compelling with better PGDATA handling in that case, as discussed/described in #159).

There's not really something uncontroversially "sane" we could do if we get a request to stop/kill during that initialization process. 🤔

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants