PAF with PostgreSQL 12 & no recovery.conf #156

ghost · 2019-10-31T16:38:25Z

Hi,

I'm looking for information on how to utilize PostgreSQL with PAF/Pacemaker. PostgreSQL has no recovery.conf, making the recovery_template option for PAF seem pointless. How can PAF be configured to work with PostgreSQL 12, or is this yet possible?

ghost · 2019-10-31T22:01:09Z

I got it working with this modification to the resource agent script:

--- pgsqlms.orig	2019-10-31 17:40:25.522235840 +0000
+++ pgsqlms	2019-10-31 22:00:12.065127673 +0000
@@ -86,6 +86,7 @@
 # numeric pgsql versions
 my $PGVERNUM;
 my $PGVER_10  = 100000;
+my $PGVER_12  = 120000;
 
 # Run a query using psql.
 #
@@ -576,6 +577,22 @@
     };
 }
 
+sub _create_standby_signal {
+    my $fh;
+    my (undef, undef, $uid, $gid) = getpwnam ($system_user);
+    my $standby_signal_file = "$datadir/standby.signal";
+    ocf_log ('debug', '_create_standby_signal: create file "%s"', $standby_signal_file);
+    unless (open ($fh, '>', $standby_signal_file)) {
+        ocf_exit_reason ('Could not open file "%s": %s', $standby_signal_file, $!);
+        exit $OCF_ERR_CONFIGURED;
+    }
+    close $fh;
+    unless (chown $uid, $gid, $standby_signal_file) {
+        ocf_exit_reason ('Could not set owner of "%s"', $standby_signal_file);
+        exit $OCF_ERR_CONFIGURED;
+    };
+}
+
 # Parse and return various informations about the local PostgreSQL instance as
 # reported by its controldata file.
 #
@@ -1308,49 +1325,11 @@
         exit $OCF_ERR_ARGS;
     }
 
-    # check recovery template
-    if ( ! -f $recovery_tpl ) {
-        ocf_exit_reason( 'Recovery template file "%s" does not exist',
-            $recovery_tpl );
-        exit $OCF_ERR_ARGS;
-    }
-
-    # check content of the recovery template file
-    unless ( open( $fh, '<', $recovery_tpl ) ) {
-        ocf_exit_reason( 'Could not open file "%s": %s', $recovery_tpl, $! );
-        exit $OCF_ERR_ARGS;
-    }
-    @content = <$fh>;
-    close $fh;
-
     unless ( looks_like_number($maxlag) ) {
         ocf_exit_reason( 'maxlag is not a number: "%s"', $maxlag );
         exit $OCF_ERR_INSTALLED;
     }
 
-    unless ( grep /^\s*standby_mode\s*=\s*'?on'?\s*$/, @content ) {
-        ocf_exit_reason(
-            'Recovery template file must contain "standby_mode = on"' );
-        exit $OCF_ERR_ARGS;
-    }
-
-    unless ( grep /^\s*recovery_target_timeline\s*=\s*'?latest'?\s*$/, @content ) {
-        ocf_exit_reason(
-            "Recovery template file must contain \"recovery_target_timeline = 'latest'\""
-        );
-        exit $OCF_ERR_ARGS;
-    }
-
-    unless (
-        grep /^\s*primary_conninfo\s*=.*['\s]application_name=$nodename['\s]/,
-        @content
-    ) {
-        ocf_exit_reason(
-            'Recovery template file must contain in primary_conninfo parameter "application_name=%s"',
-            $nodename );
-        exit $OCF_ERR_ARGS;
-    }
-
     # check system user
     unless ( defined getpwnam $system_user ) {
         ocf_exit_reason( 'System user "%s" does not exist', $system_user );
@@ -1380,6 +1359,46 @@
         exit $OCF_ERR_INSTALLED;
     }
 
+    if ($PGVERNUM < $PGVER_12) {
+        # check recovery template
+        if ( ! -f $recovery_tpl ) {
+            ocf_exit_reason( 'Recovery template file "%s" does not exist',
+                $recovery_tpl );
+            exit $OCF_ERR_ARGS;
+        }
+
+        # check content of the recovery template file
+        unless ( open( $fh, '<', $recovery_tpl ) ) {
+            ocf_exit_reason( 'Could not open file "%s": %s', $recovery_tpl, $! );
+            exit $OCF_ERR_ARGS;
+        }
+        @content = <$fh>;
+        close $fh;
+
+        unless ( grep /^\s*standby_mode\s*=\s*'?on'?\s*$/, @content ) {
+            ocf_exit_reason(
+                'Recovery template file must contain "standby_mode = on"' );
+            exit $OCF_ERR_ARGS;
+        }
+
+        unless ( grep /^\s*recovery_target_timeline\s*=\s*'?latest'?\s*$/, @content ) {
+            ocf_exit_reason(
+                "Recovery template file must contain \"recovery_target_timeline = 'latest'\""
+            );
+            exit $OCF_ERR_ARGS;
+        }
+
+        unless (
+            grep /^\s*primary_conninfo\s*=.*['\s]application_name=$nodename['\s]/,
+            @content
+        ) {
+            ocf_exit_reason(
+                'Recovery template file must contain in primary_conninfo parameter "application_name=%s"',
+                $nodename );
+            exit $OCF_ERR_ARGS;
+        }
+    }
+
     $PGWALDUMP = "$bindir/pg_xlogdump" if $PGVERNUM < $PGVER_10;
 
     # check binaries
@@ -1432,8 +1451,12 @@
         'pgsql_start: instance "%s" is not running, starting it as a secondary',
         $OCF_RESOURCE_INSTANCE );
 
-    # Create recovery.conf from the template file.
-    _create_recovery_conf();
+    if ( $PGVERNUM < $PGVER_12 ) {
+        # Create recovery.conf from the template file.
+        _create_recovery_conf ();
+    } else {
+        _create_standby_signal ();
+    }
 
     # Start the instance as a secondary.
     $rc = _pg_ctl_start();

loktissimo · 2019-11-01T14:28:42Z

It would be great to add change in main branch. I'm setting up psql12 and can test it.

ioguix · 2019-11-03T21:09:15Z

Hi guys,

I am working on this and should publish the release candidate this week.

@caseyallenshobe, as I already have a patch with some more cleanup and polishing on it, I'll keep mine. But I do appreciate the time you invested, thank you. As far as I can see, you can keep it this way until the official release. It sounds safe.

bdonfouet · 2019-12-19T09:41:33Z

Hello,
I'm testing the 2.3_rc1 with postgresql 12 on centos, but cant get it to work!

output of pcs status command:

Cluster name: db_cluster
Stack: corosync
Current DC: master (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Thu Dec 19 12:51:36 2019
Last change: Thu Dec 19 12:11:23 2019 by root via cibadmin on master

2 nodes configured
5 resources configured

Online: [ master slave ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     pgsqld	(ocf::heartbeat:pgsqlms):	FAILED master (blocked)
     pgsqld	(ocf::heartbeat:pgsqlms):	FAILED slave (blocked)
 pgsql-master-ip	(ocf::heartbeat:IPaddr2):	Stopped
 slavedb_fence	(stonith:fence_vmware_soap):	Started master
 masterdb_fence	(stonith:fence_vmware_soap):	Started slave

Failed Resource Actions:
* pgsqld_stop_0 on master 'invalid parameter' (2): call=36, status=complete, exitreason='warning',
    last-rc-change='Thu Dec 19 12:23:30 2019', queued=1ms, exec=168ms
* pgsqld_stop_0 on slave 'invalid parameter' (2): call=36, status=complete, exitreason='warning',
    last-rc-change='Thu Dec 19 12:23:30 2019', queued=0ms, exec=163ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

output of pcs config --all command

Corosync Nodes:
 master slave
Pacemaker Nodes:
 master slave

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-12/bin pgdata=/var/lib/pgsql/12/data
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=xxxxxxx
  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
              start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s)
              stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s)

Stonith Devices:
 Resource: slavedb_fence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=xxxxxxx login=xxxxxx@vsphere.local passwd=xxxxxxx port=xxxxxx ssl_insecure=1
  Operations: monitor interval=60s (slavedb_fence-monitor-interval-60s)
 Resource: masterdb_fence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=xxxxx login=xxxxxx@vsphere.local passwd=xxxxx port=xxxxx ssl_insecure=1
  Operations: monitor interval=60s (masterdb_fence-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: masterdb_fence
    Disabled on: master (score:-INFINITY) (id:location-masterdb_fence-master--INFINITY)
  Resource: slavedb_fence
    Disabled on: slave (score:-INFINITY) (id:location-slavedb_fence-slave--INFINITY)
Ordering Constraints:
  promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory)
  demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory-1)
Colocation Constraints:
  pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: db_cluster
 dc-version: 1.1.20-5.el7-3c4c782f70
 have-watchdog: false
 last-lrm-refresh: 1576623829
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:

can provide more info on the config if needed, but since i'm really new to pacemaker/corosync world, will need your imputs guys.

Regards

ioguix · 2019-12-20T14:21:55Z

Hello @bdonfouet,

There's various place where pgsqlms returns $OCF_ERR_ARGS. Among them, most common mistakes are:

wrong pgdata in pgsqlms setup
wrong bindir in pgsqlms setup
wrong wal_level in postgresql setup
wrong primary_conninfo in postgresql setup: it must contains application_name=$hostname
make sure the recovery template does not exist with pgsql v12

But I'm quite surprised there's no explicit exitreason message, we always set it before existing on error...

You could also share specific logs. If the cluster has never start, errors/warning around pattern pgsqld_monitor_0 should be interesting. As I suspect something might be wrong in the RA setup, it probably fails on basic checkup during the very first probe action.

bdonfouet · 2019-12-21T06:57:11Z

Hi @ioguix ,
And thanks for you answer. Does the nodes names needed to be the hostnames of the servers?
Will check my setup when I will be in front of my computer.
To easyly remember "who is who", I used nodes name different than hostname, and I've referenced these node names in the /etc/hosts file.is that my mistake?

Thanks

ioguix · 2019-12-24T10:45:11Z

Does the nodes names needed to be the hostnames of the servers?
To easyly remember "who is who", I used nodes name different than hostname, and I've referenced these node names in the /etc/hosts file.is that my mistake?

You are not forced to use the hostname in Corosync/Pacemaker. But it makes things much more easier to setup and removes some traps.

From PAF point of view, the application_name in primary_conninfo MUST be set to the local nodename as defined in Corosync/Pacemaker.

zeroone · 2019-12-29T17:26:15Z

Hi @ioguix ,
It seems to have some problems around below code block in pgsqlms file of v2.3 rc1.

$guc = qx{ $POSTGRES -C primary_conninfo -D "$pgdata" $start_opts};
unless ($guc =~ /['\s]application_name=$nodename['\s]/) {
    ocf_exit_reason( 'warning',
       q{Parameter "primary_conninfo" MUST contain 'application_name=%s'. }.
       q{It is currently set to '%s'}, $nodename, $guc );
        return $OCF_ERR_ARGS;
}

Function ocf_exit_reason is not ocf_log, so the param 'warning' is not needed, as a result of this, the output in log file is just 'warning', no detail errors.
The regex /['\s]application_name=$nodename['\s]/, seems to can not extract hostname like 'ha1.example.com' correctly, so that hostname never pass this function, and application_name at the beginning of primary_conninfo or at the end of its value, produce different results.

I just comment out above code block, and everything works fine with postgresql v12 on centos7.

I am not familiar with perl, just some info for you.

ioguix · 2019-12-30T16:58:29Z

Hi @zeroone,

Function ocf_exit_reason is not ocf_log, so the param 'warning' is not needed

Good catch. Thank you!

The regex /['\s]application_name=$nodename['\s]/, seems to can not extract hostname like 'ha1.example.com' correctly[...]

I haven't been able to reproduce. I tested on clusters using:

cluster 1:
- hostnames srv1, srv2 and srv3
- nodenames srv1.example.com, srv2.example.com and srv3.example.com
cluster 2:
- hostnames srv1.example.com, srv2.example.com and srv3.example.com
- nodenames srv1.example.com, srv2.example.com and srv3.example.com

After a very quick look at them, they were starting and doing fine.

Eg.:

[root@srv1 ~]# hostname
srv1.example.com

[root@srv1 ~]# crm_mon -DnA1

Node srv1.example.com: online
	fence_vm_srv2.example.com	(stonith:fence_virsh):	Started
	fence_vm_srv3.example.com	(stonith:fence_virsh):	Started
	pgsqld	(ocf::heartbeat:pgsqlms):	Master
	pgsql-master-ip	(ocf::heartbeat:IPaddr2):	Started
Node srv2.example.com: online
	fence_vm_srv1.example.com	(stonith:fence_virsh):	Started
	pgsqld	(ocf::heartbeat:pgsqlms):	Slave
Node srv3.example.com: online
	pgsqld	(ocf::heartbeat:pgsqlms):	Slave

Node Attributes:
* Node srv1.example.com:
    + master-pgsqld                   	: 1001      
* Node srv2.example.com:
    + master-pgsqld                   	: 1000      
* Node srv3.example.com:
    + master-pgsqld                   	: 990       

[root@srv1 ~]# cat /var/lib/pgsql/12/data/conf.d/repli.conf 
primary_conninfo = 'host=10.20.30.5 application_name=srv1.example.com'

[root@srv1 ~]# /usr/pgsql-12/bin/postgres -C primary_conninfo
host=10.20.30.5 application_name=srv1.example.com

Could you report:

what is your hostname from the system point of view? hostname
what is your nodename from Pacemaker/Corosync point of view? crm_node --list
after fixing ocf_exit_reason, what is the actual error message?

Thank you for your report!

Thanks to ZeroOne on github for catching and reporting this in issue #156.

zeroone · 2019-12-31T02:31:14Z

@ioguix , thanks for your reply.

[ha1]$ hostname

ha1.ftrans.local

[ha1]$ sudo crm_node --list

1 ha1.ftrans.local member
2 ha2.ftrans.local member

[ha1]$ sudo vi /var/lib/pgsql/12/data/postgresql.auto.conf

listen_addresses = '*'
wal_keep_segments = '32'
hot_standby_feedback = 'on'
primary_conninfo = 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'

When use pg_basebackup command of v12, with option -R, it will auto create the long default params in primary_conninfo's value.

[ha1]$ sudo -i -u postgres /usr/pgsql-12/bin/postgres -C primary_conninfo
application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any

With these configs, the errors are:

$ sudo pcs status

Failed Resource Actions:

* pgsqld_stop_0 on ha1.ftrans.local 'invalid parameter' (2): call=12, status=complete, exitreason='Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftra',
  last-rc-change='Tue Dec 31 09:49:31 2019', queued=0ms, exec=145ms

$ sudo vi /var/log/cluster/corosync.log

Dec 31 09:57:43 [85185] ha1.ftrans.local crm_resource:    error: unpack_rsc_op: Preventing pgsql-ha from re-starting on ha1.ftrans.local: operation monitor failed 'invalid parameter' (2)
pgsqlms(pgsqld)[85177]: Dec 31 09:57:43  ERROR: Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
'
Dec 31 09:57:43 [85068] ha1.ftrans.local       lrmd:   notice: operation_finished:      pgsqld_stop_0:85177:stderr [ ocf-exit-reason:Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any ]
Dec 31 09:57:43 [85068] ha1.ftrans.local       lrmd:   notice: operation_finished:      pgsqld_stop_0:85177:stderr [ ' ]

Thanks a lot.

@zeroone

The old regex from the recovery.conf era (pre-v12 versions) has been kept while the output format we parse is slightly different in v12. Since, v12, we parse the output of "postgres -C primary_conninfo" were there is no quotes around the whole value. Because of this, the check was reporting a wrong error if application_name was the very first parameter appearing in primary_conninfo. Thanks to @zeroone on github for testing and reporting this bug with details in issue #156.

ioguix · 2019-12-31T14:50:24Z

Hi @zeroone,

OK, I've been able to reproduce when setting application_name as the very first parameter in primary_conninfo. I've just push a fix to the master branch.

Please, can you test it on your side and confirm it fixed your issue?

Thanks again,

zeroone · 2020-01-01T12:42:07Z

@ioguix , now everything is working fine, thank you very much for your work!

ioguix · 2020-01-01T14:16:40Z

Great! And thank you for your test and report!

ioguix · 2020-03-09T23:02:08Z

PAF v2.3 has been released. I can now close this issue.

ioguix added a commit that referenced this issue Dec 30, 2019

Fix wrong exit reason

a66fc6d

Thanks to ZeroOne on github for catching and reporting this in issue #156.

ioguix mentioned this issue Jan 22, 2020

On PostgreSQL-12 "No further recovery can be attempted for pgsqld:1: stop action failed with 'invalid parameter'" #161

Closed

ioguix closed this as completed Mar 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PAF with PostgreSQL 12 & no recovery.conf #156

PAF with PostgreSQL 12 & no recovery.conf #156

ghost commented Oct 31, 2019

ghost commented Oct 31, 2019

loktissimo commented Nov 1, 2019

ioguix commented Nov 3, 2019

bdonfouet commented Dec 19, 2019 •

edited

Loading

ioguix commented Dec 20, 2019

bdonfouet commented Dec 21, 2019

ioguix commented Dec 24, 2019

zeroone commented Dec 29, 2019

ioguix commented Dec 30, 2019

zeroone commented Dec 31, 2019

ioguix commented Dec 31, 2019

zeroone commented Jan 1, 2020

ioguix commented Jan 1, 2020

ioguix commented Mar 9, 2020

PAF with PostgreSQL 12 & no recovery.conf #156

PAF with PostgreSQL 12 & no recovery.conf #156

Comments

ghost commented Oct 31, 2019

ghost commented Oct 31, 2019

loktissimo commented Nov 1, 2019

ioguix commented Nov 3, 2019

bdonfouet commented Dec 19, 2019 • edited Loading

ioguix commented Dec 20, 2019

bdonfouet commented Dec 21, 2019

ioguix commented Dec 24, 2019

zeroone commented Dec 29, 2019

ioguix commented Dec 30, 2019

zeroone commented Dec 31, 2019

ioguix commented Dec 31, 2019

zeroone commented Jan 1, 2020

ioguix commented Jan 1, 2020

ioguix commented Mar 9, 2020

bdonfouet commented Dec 19, 2019 •

edited

Loading