Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PAF with PostgreSQL 12 & no recovery.conf #156

Closed
ghost opened this issue Oct 31, 2019 · 14 comments
Closed

PAF with PostgreSQL 12 & no recovery.conf #156

ghost opened this issue Oct 31, 2019 · 14 comments

Comments

@ghost
Copy link

ghost commented Oct 31, 2019

Hi,

I'm looking for information on how to utilize PostgreSQL with PAF/Pacemaker. PostgreSQL has no recovery.conf, making the recovery_template option for PAF seem pointless. How can PAF be configured to work with PostgreSQL 12, or is this yet possible?

@ghost
Copy link
Author

ghost commented Oct 31, 2019

I got it working with this modification to the resource agent script:

--- pgsqlms.orig	2019-10-31 17:40:25.522235840 +0000
+++ pgsqlms	2019-10-31 22:00:12.065127673 +0000
@@ -86,6 +86,7 @@
 # numeric pgsql versions
 my $PGVERNUM;
 my $PGVER_10  = 100000;
+my $PGVER_12  = 120000;
 
 # Run a query using psql.
 #
@@ -576,6 +577,22 @@
     };
 }
 
+sub _create_standby_signal {
+    my $fh;
+    my (undef, undef, $uid, $gid) = getpwnam ($system_user);
+    my $standby_signal_file = "$datadir/standby.signal";
+    ocf_log ('debug', '_create_standby_signal: create file "%s"', $standby_signal_file);
+    unless (open ($fh, '>', $standby_signal_file)) {
+        ocf_exit_reason ('Could not open file "%s": %s', $standby_signal_file, $!);
+        exit $OCF_ERR_CONFIGURED;
+    }
+    close $fh;
+    unless (chown $uid, $gid, $standby_signal_file) {
+        ocf_exit_reason ('Could not set owner of "%s"', $standby_signal_file);
+        exit $OCF_ERR_CONFIGURED;
+    };
+}
+
 # Parse and return various informations about the local PostgreSQL instance as
 # reported by its controldata file.
 #
@@ -1308,49 +1325,11 @@
         exit $OCF_ERR_ARGS;
     }
 
-    # check recovery template
-    if ( ! -f $recovery_tpl ) {
-        ocf_exit_reason( 'Recovery template file "%s" does not exist',
-            $recovery_tpl );
-        exit $OCF_ERR_ARGS;
-    }
-
-    # check content of the recovery template file
-    unless ( open( $fh, '<', $recovery_tpl ) ) {
-        ocf_exit_reason( 'Could not open file "%s": %s', $recovery_tpl, $! );
-        exit $OCF_ERR_ARGS;
-    }
-    @content = <$fh>;
-    close $fh;
-
     unless ( looks_like_number($maxlag) ) {
         ocf_exit_reason( 'maxlag is not a number: "%s"', $maxlag );
         exit $OCF_ERR_INSTALLED;
     }
 
-    unless ( grep /^\s*standby_mode\s*=\s*'?on'?\s*$/, @content ) {
-        ocf_exit_reason(
-            'Recovery template file must contain "standby_mode = on"' );
-        exit $OCF_ERR_ARGS;
-    }
-
-    unless ( grep /^\s*recovery_target_timeline\s*=\s*'?latest'?\s*$/, @content ) {
-        ocf_exit_reason(
-            "Recovery template file must contain \"recovery_target_timeline = 'latest'\""
-        );
-        exit $OCF_ERR_ARGS;
-    }
-
-    unless (
-        grep /^\s*primary_conninfo\s*=.*['\s]application_name=$nodename['\s]/,
-        @content
-    ) {
-        ocf_exit_reason(
-            'Recovery template file must contain in primary_conninfo parameter "application_name=%s"',
-            $nodename );
-        exit $OCF_ERR_ARGS;
-    }
-
     # check system user
     unless ( defined getpwnam $system_user ) {
         ocf_exit_reason( 'System user "%s" does not exist', $system_user );
@@ -1380,6 +1359,46 @@
         exit $OCF_ERR_INSTALLED;
     }
 
+    if ($PGVERNUM < $PGVER_12) {
+        # check recovery template
+        if ( ! -f $recovery_tpl ) {
+            ocf_exit_reason( 'Recovery template file "%s" does not exist',
+                $recovery_tpl );
+            exit $OCF_ERR_ARGS;
+        }
+
+        # check content of the recovery template file
+        unless ( open( $fh, '<', $recovery_tpl ) ) {
+            ocf_exit_reason( 'Could not open file "%s": %s', $recovery_tpl, $! );
+            exit $OCF_ERR_ARGS;
+        }
+        @content = <$fh>;
+        close $fh;
+
+        unless ( grep /^\s*standby_mode\s*=\s*'?on'?\s*$/, @content ) {
+            ocf_exit_reason(
+                'Recovery template file must contain "standby_mode = on"' );
+            exit $OCF_ERR_ARGS;
+        }
+
+        unless ( grep /^\s*recovery_target_timeline\s*=\s*'?latest'?\s*$/, @content ) {
+            ocf_exit_reason(
+                "Recovery template file must contain \"recovery_target_timeline = 'latest'\""
+            );
+            exit $OCF_ERR_ARGS;
+        }
+
+        unless (
+            grep /^\s*primary_conninfo\s*=.*['\s]application_name=$nodename['\s]/,
+            @content
+        ) {
+            ocf_exit_reason(
+                'Recovery template file must contain in primary_conninfo parameter "application_name=%s"',
+                $nodename );
+            exit $OCF_ERR_ARGS;
+        }
+    }
+
     $PGWALDUMP = "$bindir/pg_xlogdump" if $PGVERNUM < $PGVER_10;
 
     # check binaries
@@ -1432,8 +1451,12 @@
         'pgsql_start: instance "%s" is not running, starting it as a secondary',
         $OCF_RESOURCE_INSTANCE );
 
-    # Create recovery.conf from the template file.
-    _create_recovery_conf();
+    if ( $PGVERNUM < $PGVER_12 ) {
+        # Create recovery.conf from the template file.
+        _create_recovery_conf ();
+    } else {
+        _create_standby_signal ();
+    }
 
     # Start the instance as a secondary.
     $rc = _pg_ctl_start();

@loktissimo
Copy link

It would be great to add change in main branch. I'm setting up psql12 and can test it.

@ioguix
Copy link
Member

ioguix commented Nov 3, 2019

Hi guys,

I am working on this and should publish the release candidate this week.

@caseyallenshobe, as I already have a patch with some more cleanup and polishing on it, I'll keep mine. But I do appreciate the time you invested, thank you. As far as I can see, you can keep it this way until the official release. It sounds safe.

@bdonfouet
Copy link

bdonfouet commented Dec 19, 2019

Hello,
I'm testing the 2.3_rc1 with postgresql 12 on centos, but cant get it to work!

output of pcs status command:

Cluster name: db_cluster
Stack: corosync
Current DC: master (version 1.1.20-5.el7-3c4c782f70) - partition with quorum
Last updated: Thu Dec 19 12:51:36 2019
Last change: Thu Dec 19 12:11:23 2019 by root via cibadmin on master

2 nodes configured
5 resources configured

Online: [ master slave ]

Full list of resources:

 Master/Slave Set: pgsql-ha [pgsqld]
     pgsqld	(ocf::heartbeat:pgsqlms):	FAILED master (blocked)
     pgsqld	(ocf::heartbeat:pgsqlms):	FAILED slave (blocked)
 pgsql-master-ip	(ocf::heartbeat:IPaddr2):	Stopped
 slavedb_fence	(stonith:fence_vmware_soap):	Started master
 masterdb_fence	(stonith:fence_vmware_soap):	Started slave

Failed Resource Actions:
* pgsqld_stop_0 on master 'invalid parameter' (2): call=36, status=complete, exitreason='warning',
    last-rc-change='Thu Dec 19 12:23:30 2019', queued=1ms, exec=168ms
* pgsqld_stop_0 on slave 'invalid parameter' (2): call=36, status=complete, exitreason='warning',
    last-rc-change='Thu Dec 19 12:23:30 2019', queued=0ms, exec=163ms

Daemon Status:
  corosync: active/disabled
  pacemaker: active/disabled
  pcsd: active/enabled

output of pcs config --all command

Corosync Nodes:
 master slave
Pacemaker Nodes:
 master slave

Resources:
 Master: pgsql-ha
  Meta Attrs: notify=true
  Resource: pgsqld (class=ocf provider=heartbeat type=pgsqlms)
   Attributes: bindir=/usr/pgsql-12/bin pgdata=/var/lib/pgsql/12/data
   Operations: demote interval=0s timeout=120s (pgsqld-demote-interval-0s)
               methods interval=0s timeout=5 (pgsqld-methods-interval-0s)
               monitor interval=15s role=Master timeout=10s (pgsqld-monitor-interval-15s)
               monitor interval=16s role=Slave timeout=10s (pgsqld-monitor-interval-16s)
               notify interval=0s timeout=60s (pgsqld-notify-interval-0s)
               promote interval=0s timeout=30s (pgsqld-promote-interval-0s)
               reload interval=0s timeout=20 (pgsqld-reload-interval-0s)
               start interval=0s timeout=60s (pgsqld-start-interval-0s)
               stop interval=0s timeout=60s (pgsqld-stop-interval-0s)
 Resource: pgsql-master-ip (class=ocf provider=heartbeat type=IPaddr2)
  Attributes: cidr_netmask=24 ip=xxxxxxx
  Operations: monitor interval=10s (pgsql-master-ip-monitor-interval-10s)
              start interval=0s timeout=20s (pgsql-master-ip-start-interval-0s)
              stop interval=0s timeout=20s (pgsql-master-ip-stop-interval-0s)

Stonith Devices:
 Resource: slavedb_fence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=xxxxxxx login=xxxxxx@vsphere.local passwd=xxxxxxx port=xxxxxx ssl_insecure=1
  Operations: monitor interval=60s (slavedb_fence-monitor-interval-60s)
 Resource: masterdb_fence (class=stonith type=fence_vmware_soap)
  Attributes: ipaddr=xxxxx login=xxxxxx@vsphere.local passwd=xxxxx port=xxxxx ssl_insecure=1
  Operations: monitor interval=60s (masterdb_fence-monitor-interval-60s)
Fencing Levels:

Location Constraints:
  Resource: masterdb_fence
    Disabled on: master (score:-INFINITY) (id:location-masterdb_fence-master--INFINITY)
  Resource: slavedb_fence
    Disabled on: slave (score:-INFINITY) (id:location-slavedb_fence-slave--INFINITY)
Ordering Constraints:
  promote pgsql-ha then start pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory)
  demote pgsql-ha then stop pgsql-master-ip (kind:Mandatory) (non-symmetrical) (id:order-pgsql-ha-pgsql-master-ip-Mandatory-1)
Colocation Constraints:
  pgsql-master-ip with pgsql-ha (score:INFINITY) (rsc-role:Started) (with-rsc-role:Master) (id:colocation-pgsql-master-ip-pgsql-ha-INFINITY)
Ticket Constraints:

Alerts:
 No alerts defined

Resources Defaults:
 No defaults set
Operations Defaults:
 No defaults set

Cluster Properties:
 cluster-infrastructure: corosync
 cluster-name: db_cluster
 dc-version: 1.1.20-5.el7-3c4c782f70
 have-watchdog: false
 last-lrm-refresh: 1576623829
 no-quorum-policy: ignore
 stonith-enabled: false

Quorum:
  Options:

can provide more info on the config if needed, but since i'm really new to pacemaker/corosync world, will need your imputs guys.

Regards

@ioguix
Copy link
Member

ioguix commented Dec 20, 2019

Hello @bdonfouet,

There's various place where pgsqlms returns $OCF_ERR_ARGS. Among them, most common mistakes are:

  • wrong pgdata in pgsqlms setup
  • wrong bindir in pgsqlms setup
  • wrong wal_level in postgresql setup
  • wrong primary_conninfo in postgresql setup: it must contains application_name=$hostname
  • make sure the recovery template does not exist with pgsql v12

But I'm quite surprised there's no explicit exitreason message, we always set it before existing on error...

You could also share specific logs. If the cluster has never start, errors/warning around pattern pgsqld_monitor_0 should be interesting. As I suspect something might be wrong in the RA setup, it probably fails on basic checkup during the very first probe action.

@bdonfouet
Copy link

Hi @ioguix ,
And thanks for you answer. Does the nodes names needed to be the hostnames of the servers?
Will check my setup when I will be in front of my computer.
To easyly remember "who is who", I used nodes name different than hostname, and I've referenced these node names in the /etc/hosts file.is that my mistake?

Thanks

@ioguix
Copy link
Member

ioguix commented Dec 24, 2019

Does the nodes names needed to be the hostnames of the servers?
To easyly remember "who is who", I used nodes name different than hostname, and I've referenced these node names in the /etc/hosts file.is that my mistake?

You are not forced to use the hostname in Corosync/Pacemaker. But it makes things much more easier to setup and removes some traps.

From PAF point of view, the application_name in primary_conninfo MUST be set to the local nodename as defined in Corosync/Pacemaker.

@zeroone
Copy link

zeroone commented Dec 29, 2019

Hi @ioguix ,
It seems to have some problems around below code block in pgsqlms file of v2.3 rc1.

$guc = qx{ $POSTGRES -C primary_conninfo -D "$pgdata" $start_opts};
unless ($guc =~ /['\s]application_name=$nodename['\s]/) {
    ocf_exit_reason( 'warning',
       q{Parameter "primary_conninfo" MUST contain 'application_name=%s'. }.
       q{It is currently set to '%s'}, $nodename, $guc );
        return $OCF_ERR_ARGS;
}
  1. Function ocf_exit_reason is not ocf_log, so the param 'warning' is not needed, as a result of this, the output in log file is just 'warning', no detail errors.
  2. The regex /['\s]application_name=$nodename['\s]/, seems to can not extract hostname like 'ha1.example.com' correctly, so that hostname never pass this function, and application_name at the beginning of primary_conninfo or at the end of its value, produce different results.

I just comment out above code block, and everything works fine with postgresql v12 on centos7.

I am not familiar with perl, just some info for you.

@ioguix
Copy link
Member

ioguix commented Dec 30, 2019

Hi @zeroone,

  1. Function ocf_exit_reason is not ocf_log, so the param 'warning' is not needed

Good catch. Thank you!

  1. The regex /['\s]application_name=$nodename['\s]/, seems to can not extract hostname like 'ha1.example.com' correctly[...]

I haven't been able to reproduce. I tested on clusters using:

  • cluster 1:
    • hostnames srv1, srv2 and srv3
    • nodenames srv1.example.com, srv2.example.com and srv3.example.com
  • cluster 2:
    • hostnames srv1.example.com, srv2.example.com and srv3.example.com
    • nodenames srv1.example.com, srv2.example.com and srv3.example.com

After a very quick look at them, they were starting and doing fine.

Eg.:

[root@srv1 ~]# hostname
srv1.example.com

[root@srv1 ~]# crm_mon -DnA1

Node srv1.example.com: online
	fence_vm_srv2.example.com	(stonith:fence_virsh):	Started
	fence_vm_srv3.example.com	(stonith:fence_virsh):	Started
	pgsqld	(ocf::heartbeat:pgsqlms):	Master
	pgsql-master-ip	(ocf::heartbeat:IPaddr2):	Started
Node srv2.example.com: online
	fence_vm_srv1.example.com	(stonith:fence_virsh):	Started
	pgsqld	(ocf::heartbeat:pgsqlms):	Slave
Node srv3.example.com: online
	pgsqld	(ocf::heartbeat:pgsqlms):	Slave

Node Attributes:
* Node srv1.example.com:
    + master-pgsqld                   	: 1001      
* Node srv2.example.com:
    + master-pgsqld                   	: 1000      
* Node srv3.example.com:
    + master-pgsqld                   	: 990       

[root@srv1 ~]# cat /var/lib/pgsql/12/data/conf.d/repli.conf 
primary_conninfo = 'host=10.20.30.5 application_name=srv1.example.com'

[root@srv1 ~]# /usr/pgsql-12/bin/postgres -C primary_conninfo
host=10.20.30.5 application_name=srv1.example.com

Could you report:

  • what is your hostname from the system point of view? hostname
  • what is your nodename from Pacemaker/Corosync point of view? crm_node --list
  • after fixing ocf_exit_reason, what is the actual error message?

Thank you for your report!

ioguix added a commit that referenced this issue Dec 30, 2019
Thanks to ZeroOne on github for catching and reporting this in issue #156.
@zeroone
Copy link

zeroone commented Dec 31, 2019

@ioguix , thanks for your reply.

[ha1]$ hostname

ha1.ftrans.local
[ha1]$ sudo crm_node --list

1 ha1.ftrans.local member
2 ha2.ftrans.local member
[ha1]$ sudo vi /var/lib/pgsql/12/data/postgresql.auto.conf

listen_addresses = '*'
wal_keep_segments = '32'
hot_standby_feedback = 'on'
primary_conninfo = 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any'

When use pg_basebackup command of v12, with option -R, it will auto create the long default params in primary_conninfo's value.

[ha1]$ sudo -i -u postgres /usr/pgsql-12/bin/postgres -C primary_conninfo
application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any

With these configs, the errors are:

$ sudo pcs status

Failed Resource Actions:

* pgsqld_stop_0 on ha1.ftrans.local 'invalid parameter' (2): call=12, status=complete, exitreason='Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftra',
  last-rc-change='Tue Dec 31 09:49:31 2019', queued=0ms, exec=145ms
$ sudo vi /var/log/cluster/corosync.log

Dec 31 09:57:43 [85185] ha1.ftrans.local crm_resource:    error: unpack_rsc_op: Preventing pgsql-ha from re-starting on ha1.ftrans.local: operation monitor failed 'invalid parameter' (2)
pgsqlms(pgsqld)[85177]: Dec 31 09:57:43  ERROR: Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any
'
Dec 31 09:57:43 [85068] ha1.ftrans.local       lrmd:   notice: operation_finished:      pgsqld_stop_0:85177:stderr [ ocf-exit-reason:Parameter "primary_conninfo" MUST contain 'application_name=ha1.ftrans.local'. It is currently set to 'application_name=ha1.ftrans.local host=192.168.1.173 user=replication password=PASS port=5432 sslmode=prefer sslcompression=0 gssencmode=prefer krbsrvname=postgres target_session_attrs=any ]
Dec 31 09:57:43 [85068] ha1.ftrans.local       lrmd:   notice: operation_finished:      pgsqld_stop_0:85177:stderr [ ' ]

Thanks a lot.

ioguix added a commit that referenced this issue Dec 31, 2019
The old regex from the recovery.conf era (pre-v12 versions) has been kept
while the output format we parse is slightly different in v12. Since, v12, we
parse the output of "postgres -C primary_conninfo" were there is no quotes
around the whole value. Because of this, the check was reporting a wrong error
if application_name was the very first parameter appearing in primary_conninfo.

Thanks to @zeroone on github for testing and reporting this bug with details in
issue #156.
@ioguix
Copy link
Member

ioguix commented Dec 31, 2019

Hi @zeroone,

OK, I've been able to reproduce when setting application_name as the very first parameter in primary_conninfo. I've just push a fix to the master branch.

Please, can you test it on your side and confirm it fixed your issue?

Thanks again,

@zeroone
Copy link

zeroone commented Jan 1, 2020

@ioguix , now everything is working fine, thank you very much for your work!

@ioguix
Copy link
Member

ioguix commented Jan 1, 2020

Great! And thank you for your test and report!

@ioguix
Copy link
Member

ioguix commented Mar 9, 2020

PAF v2.3 has been released. I can now close this issue.

@ioguix ioguix closed this as completed Mar 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants