[RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

ruizsebastien · 27/01/2017 18:13:03

Bonjour,

J'essaie en vain de faire fonctionner un cluster postgresql en streaming replication avec PAF /pacemaker/corosync.
CentOS 7.3
Postgresql : 9.3.15
PAF : 2.0.0
corosync : corosync-2.4.0-4.el7.x86_64
pcs : 0.9.152
pacemaker : pacemaker-1.1.15-11.el7_3.2.x86_64
J'ai suivi la doc Dalibo (http://dalibo.github.io/PAF/Quick_Start-CentOS-7.html).
J'ai quand même ajouté la création de clef : corosync-keygen .

Tout va bien jusqu'à la creation des ressources :

pcs status --full
Cluster name: cluster_pgs
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pouj-pgsql-4.xxx.lan (2) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Fri Jan 27 11:10:02 2017 Last change: Fri Jan 27 11:09:12 2017 by hacluster via crmd on pouj-pgsql-4.xxx.lan
2 nodes and 0 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
No resources

Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
* Node pouj-pgsql-3.xxx.lan (1):
PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Apres création des ressources, tout va mal :

pcs status --full
Cluster name: cluster_pgs
Stack: corosync
Current DC: pouj-pgsql-3.xxx.lan (1) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Fri Jan 27 17:03:57 2017 Last change: Fri Jan 27 16:22:41 2017 by root via cibadmin on pouj-pgsql-3.xxx.lan
2 nodes and 6 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_virsh): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_virsh): Started pouj-pgsql-3.xxx.lan
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-4.xxx.lan (blocked)
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-3.xxx.lan (blocked)
pgsqld (ocf::heartbeat:pgsqlms): Stopped
pgsql-master-ip (ocf::heartbeat:IPaddr2): Stopped
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Fri Jan 27 13:04:58 2017'
* Node pouj-pgsql-3.xxx.lan (1):
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Fri Jan 27 13:04:58 2017'
Failed Actions:
* pgsqld_stop_0 on pouj-pgsql-4.xxx.lan 'not configured' (6): call=22, status=complete, exitreason='none',
last-rc-change='Fri Jan 27 13:04:58 2017', queued=0ms, exec=68ms
* pgsqld_stop_0 on pouj-pgsql-3.xxx.lan 'not configured' (6): call=22, status=complete, exitreason='none',
last-rc-change='Fri Jan 27 13:04:58 2017', queued=0ms, exec=72ms

PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Voici mon corosync.conf :

totem {
version: 2
secauth: off
cluster_name: cluster_pgs
transport: udpu
}
nodelist {
node {
ring0_addr: pouj-pgsql-3.xxx.lan
nodeid: 1
}
node {
ring0_addr: pouj-pgsql-4.xxx.lan
nodeid: 2
}
}
quorum {
provider: corosync_votequorum
two_node: 1
}
logging {
to_logfile: yes
logfile: /var/log/cluster/corosync.log
to_syslog: yes
}

et mon fichier de conf xml :

<cib crm_feature_set="3.0.10" validate-with="pacemaker-2.5" epoch="25" num_updates="0" admin_epoch="0" cib-last-written="Fri Jan 27 13:03:00 2017" update-origin="pouj-pgsql-4.xxx.lan" update-client="crmd" update-user="hacluster" have-quorum="1" dc-uuid="2">
<configuration>
<crm_config>
<cluster_property_set id="cib-bootstrap-options">
<nvpair id="cib-bootstrap-options-have-watchdog" name="have-watchdog" value="false"/>
<nvpair id="cib-bootstrap-options-dc-version" name="dc-version" value="1.1.15-11.el7_3.2-e174ec8"/>
<nvpair id="cib-bootstrap-options-cluster-infrastructure" name="cluster-infrastructure" value="corosync"/>
<nvpair id="cib-bootstrap-options-cluster-name" name="cluster-name" value="cluster_pgs"/>
<nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
<nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
</cluster_property_set>
</crm_config>
<nodes>
<node id="1" uname="pouj-pgsql-3.xxx.lan"/>
<node id="2" uname="pouj-pgsql-4.xxx.lan"/>
</nodes>
<resources>
<primitive class="stonith" id="fence_vm_pouj-pgsql-3.xxx.lan" type="fence_virsh">
<instance_attributes id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes">
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="pouj-pgsql-3.xxx.lan"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-ipaddr" name="ipaddr" value="192.168.12.13"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-login" name="login" value="root"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-port" name="port" value="pouj-pgsql-3.xxx.lan-c7"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-action" name="action" value="off"/>
<nvpair id="fence_vm_pouj-pgsql-3.xxx.lan-instance_attributes-passwd" name="passwd" value="ultra_secret"/>
</instance_attributes>
<operations>
<op id="fence_vm_pouj-pgsql-3.xxx.lan-monitor-interval-60s" interval="60s" name="monitor"/>
</operations>
</primitive>
<primitive class="stonith" id="fence_vm_pouj-pgsql-4.xxx.lan" type="fence_virsh">
<instance_attributes id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes">
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-pcmk_host_check" name="pcmk_host_check" value="static-list"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-pcmk_host_list" name="pcmk_host_list" value="pouj-pgsql-4.xxx.lan"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-ipaddr" name="ipaddr" value="192.168.12.14"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-login" name="login" value="root"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-port" name="port" value="pouj-pgsql-4.xxx.lan-c7"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-action" name="action" value="off"/>
<nvpair id="fence_vm_pouj-pgsql-4.xxx.lan-instance_attributes-passwd" name="passwd" value="ultra_secret"/>
</instance_attributes>
<operations>
<op id="fence_vm_pouj-pgsql-4.xxx.lan-monitor-interval-60s" interval="60s" name="monitor"/>
</operations>
</primitive>
<master id="pgsql-ha">
<primitive class="ocf" id="pgsqld" provider="heartbeat" type="pgsqlms">
<instance_attributes id="pgsqld-instance_attributes">
<nvpair id="pgsqld-instance_attributes-bindir" name="bindir" value="/logiciels/postgres/product/9.3/bin"/>
<nvpair id="pgsqld-instance_attributes-pgdata" name="pgdata" value="/xxx/admin"/>
</instance_attributes>
<operations>
<op id="pgsqld-start-interval-0s" interval="0s" name="start" timeout="60s"/>
<op id="pgsqld-stop-interval-0s" interval="0s" name="stop" timeout="60s"/>
<op id="pgsqld-promote-interval-0s" interval="0s" name="promote" timeout="30s"/>
<op id="pgsqld-demote-interval-0s" interval="0s" name="demote" timeout="120s"/>
<op id="pgsqld-monitor-interval-15s" interval="15s" name="monitor" role="Master" timeout="10s"/>
<op id="pgsqld-monitor-interval-16s" interval="16s" name="monitor" role="Slave" timeout="10s"/>
<op id="pgsqld-notify-interval-0s" interval="0s" name="notify" timeout="60s"/>
</operations>
</primitive>
<meta_attributes id="pgsql-ha-meta_attributes">
<nvpair id="pgsql-ha-meta_attributes-master-node-max" name="master-node-max" value="1"/>
<nvpair id="pgsql-ha-meta_attributes-clone-max" name="clone-max" value="3"/>
<nvpair id="pgsql-ha-meta_attributes-notify" name="notify" value="true"/>
<nvpair id="pgsql-ha-meta_attributes-master-max" name="master-max" value="1"/>
<nvpair id="pgsql-ha-meta_attributes-clone-node-max" name="clone-node-max" value="1"/>
</meta_attributes>
</master>
<primitive class="ocf" id="pgsql-master-ip" provider="heartbeat" type="IPaddr2">
<instance_attributes id="pgsql-master-ip-instance_attributes">
<nvpair id="pgsql-master-ip-instance_attributes-ip" name="ip" value="192.168.12.15"/>
<nvpair id="pgsql-master-ip-instance_attributes-cidr_netmask" name="cidr_netmask" value="24"/>
<nvpair id="pgsql-master-ip-instance_attributes-nic" name="nic" value="eth0:1"/>
</instance_attributes>
<operations>
<op id="pgsql-master-ip-start-interval-0s" interval="0s" name="start" timeout="20s"/>
<op id="pgsql-master-ip-stop-interval-0s" interval="0s" name="stop" timeout="20s"/>
<op id="pgsql-master-ip-monitor-interval-10s" interval="10s" name="monitor"/>
</operations>
</primitive>
</resources>
<constraints>
<rsc_location id="location-fence_vm_pouj-pgsql-3.xxx.lan-pouj-pgsql-3.xxx.lan--INFINITY" node="pouj-pgsql-3.xxx.lan" rsc="fence_vm_pouj-pgsql-3.xxx.lan" score="-INFINITY"/>
<rsc_location id="location-fence_vm_pouj-pgsql-4.xxx.lan-pouj-pgsql-4.xxx.lan--INFINITY" node="pouj-pgsql-4.xxx.lan" rsc="fence_vm_pouj-pgsql-4.xxx.lan" score="-INFINITY"/>
<rsc_colocation id="colocation-pgsql-master-ip-pgsql-ha-INFINITY" rsc="pgsql-master-ip" rsc-role="Started" score="INFINITY" with-rsc="pgsql-ha" with-rsc-role="Master"/>
<rsc_order first="pgsql-ha" first-action="promote" id="order-pgsql-ha-pgsql-master-ip-mandatory" symmetrical="false" then="pgsql-master-ip" then-action="start"/>
<rsc_order first="pgsql-ha" first-action="demote" id="order-pgsql-ha-pgsql-master-ip-mandatory-1" symmetrical="false" then="pgsql-master-ip" then-action="stop"/>
</constraints>
<rsc_defaults>
<meta_attributes id="rsc_defaults-options">
<nvpair id="rsc_defaults-options-migration-threshold" name="migration-threshold" value="5"/>
<nvpair id="rsc_defaults-options-resource-stickiness" name="resource-stickiness" value="10"/>
</meta_attributes>
</rsc_defaults>
</configuration>
<status>
<node_state id="2" uname="pouj-pgsql-4.xxx.lan" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
<lrm id="2">
<lrm_resources/>
</lrm>
<transient_attributes id="2">
<instance_attributes id="status-2">
<nvpair id="status-2-shutdown" name="shutdown" value="0"/>
</instance_attributes>
</transient_attributes>
</node_state>
<node_state id="1" uname="pouj-pgsql-3.xxx.lan" in_ccm="true" crmd="online" crm-debug-origin="do_state_transition" join="member" expected="member">
<lrm id="1">
<lrm_resources/>
</lrm>
<transient_attributes id="1">
<instance_attributes id="status-1">
<nvpair id="status-1-shutdown" name="shutdown" value="0"/>
</instance_attributes>
</transient_attributes>
</node_state>
</status>
</cib>

(j'ai volontairement masqué certaines infos par des "xxx")

Si quelqu'un peut me donner un coup de main ce serait super.

Merci par avance.

Dernière modification par ruizsebastien (20/03/2017 15:03:47)

ioguix · 29/01/2017 15:21:23

Bonjour,

Première chose importante, une fois que vous aurez fait fonctionner l'ensemble, avant de faire vos tests de bascules, activez le fencing et le quorum :

<configuration>
    <crm_config>
      <cluster_property_set id="cib-bootstrap-options">
...
        <nvpair id="cib-bootstrap-options-stonith-enabled" name="stonith-enabled" value="false"/>
        <nvpair id="cib-bootstrap-options-no-quorum-policy" name="no-quorum-policy" value="ignore"/>
      </cluster_property_set>
    </crm_config>

Ensuite, à propos du problème lui même, vu la situation, je dirais qu'il y a une erreur ou un oubli dans la configuration des instances PostgreSQL et qui est détecté par PAF dans la fonction "pgsql_validate_all" (dans le fichier "/usr/lib/ocf/resource.d/heartbeat/pgsqlms").

J'imagine que les log de Pacemaker doivent en faire état explicitement. Cherchez dans le fichier "/var/log/cluster/corosync.log" toutes les lignes débutant par "pgsqlms" autour de l'heure de ces erreurs 'Fri Jan 27 13:04:58 2017' (commande à adapter à votre environnement):

grep "^pgsqlms" /var/log/cluster/corosync.log

Le plus souvent, les erreurs concernent le template du recovery.conf, avec un mauvais "application_name" par exemple.

Rien à voir, mais validez bien aussi le contenu de votre pg_hba.conf afin de vous assurer que les règles "reject" y sont bien présentes et au bon endroit.

ruizsebastien · 30/01/2017 10:33:55

Merci beaucoup pour vos réponses Jehan-Guillaume,

Je vais étudier ça en détail et je vous tiens au courant.

Pour le quorum, dans les docs de pacemaker il est déconseillé de l'activer dans le cas d'une architecture avec seulement 2 serveurs
C'est plutôt logique vu que pour un quorum il faut au moins être trois.

ioguix · 30/01/2017 11:26:35

Pour le quorum, dans les docs de pacemaker il est déconseillé de l'activer dans le cas d'une architecture avec seulement 2 serveurs

Ce n'est plus le cas avec les versions modernes de Corosync. Considérez la note à propos du quorum sur la page de documentation suivante: http://clusterlabs.org/doc/en-US/Pacema … lover.html

Cette note explique que la gestion du quorum est désormais possible dans un cluster à deux nœuds grâce à l'option "two_node" dans la configuration de Corosync. Ce paramètre active implicitement "wait_for_all".

Activer le quorum dans un cluster à deux nœuds est utile dans le scénario suivant par exemple:

* démarrage des deux nœuds du cluster, le CRM démarre les instances PostgreSQL, fait une promotion, etc
* après quelques temps, l'un des deux nœuds crash et se fait fencer
* grâce au paramètre "two_node" le nœud encore en vie conserve le quorum et PostgreSQL peut continuer à vivre dessus
* au redémarrage du second nœud, si ce dernier n'arrive pas à se joindre au reste du cluster (problème réseau ou autre), le paramètre "wait_for_all" l'empêche d'acquérir le quorum et il ne pourra démarrer aucune ressource.

ruizsebastien · 30/01/2017 14:30:28

J'ai du mieux (fense et quorum ok, ip virtuelle ok) :

pcs status --full
Cluster name: cluster_xxx
Stack: corosync
Current DC: pouj-pgsql-4.xxx.lan (2) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Mon Jan 30 13:12:41 2017 Last change: Mon Jan 30 12:21:27 2017 by root via cibadmin on pouj-pgsql-3.xxx.lan
2 nodes and 6 resources configured
Node pouj-pgsql-3.xxx.lan (1): UNCLEAN (online)
Node pouj-pgsql-4.xxx.lan (2): UNCLEAN (online)
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_virsh): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_virsh): Started pouj-pgsql-3.xxx.lan
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-4.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-3.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Stopped
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started pouj-pgsql-4.xxx.lan
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Mon Jan 30 13:07:56 2017'
* Node pouj-pgsql-3.xxx.lan (1):
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Mon Jan 30 13:07:56 2017'
Failed Actions:
* pgsqld_stop_0 on pouj-pgsql-4.xxx.lan 'unknown error' (1): call=22, status=complete, exitreason='none',
last-rc-change='Mon Jan 30 13:07:56 2017', queued=0ms, exec=96ms
* pgsqld_stop_0 on pouj-pgsql-3.xxx.lan 'unknown error' (1): call=22, status=complete, exitreason='none',
last-rc-change='Mon Jan 30 13:07:56 2017', queued=0ms, exec=84ms

PCSD Status:
pouj-pgsql-4.xxx.lan: Online
pouj-pgsql-3.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

J'ai plusieurs questions :

- le recovery.conf.pcmk doit-il être sur tous les serveurs (c'est ce que j'ai fait) ?
(Dans ma logique : je dirais non, le fichier ne doit être que sur le slave)
contenu de mes recovery.conf.pcmk :

pouj-pgsql-3.xxx.lan

standby_mode = 'on'
primary_conninfo = 'user=postgres password=ultrasecret host=192.168.12.15 application_name=pouj-pgsql-3.xxx.lan port=5432 client_encoding=UTF8'
recovery_target_timeline = 'latest'
restore_command = 'scp postgres@192.168.12.14:/xxx/archive/%f %p'

pouj-pgsql-4.xxx.lan

standby_mode = 'on'
primary_conninfo = 'user=postgres password=ultrasecret host=192.168.12.15 application_name=pouj-pgsql-4.xxx.lan port=5432 client_encoding=UTF8'
recovery_target_timeline = 'latest'
restore_command = 'scp postgres@192.168.12.13:/xxx/archive/%f %p'

- l'ip virtuelle est placée sur le serveur qui est sensé être slave (Le pouj-pgsql-4.xxx.lan).
Comment faire pour que ce soit l'autre qui soit master ?

[root@pouj-pgsql-4 ~]# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
link/ether 00:50:56:90:2f:02 brd ff:ff:ff:ff:ff:ff
inet 192.168.12.14/24 brd 192.168.12.255 scope global eth0
valid_lft forever preferred_lft forever
inet 192.168.12.15/24 brd 192.168.12.255 scope global secondary eth0:1
valid_lft forever preferred_lft forever
inet6 fe80::250:56ff:fe90:2f02/64 scope link
valid_lft forever preferred_lft forever

Voici les erreurs sur le serveur pouj-pgsql-3.xxx.lan :

pgsqlms(pgsqld)[3187]: 2017/01/30_13:06:56 ERROR: _confirm_stopped: instance "pgsqld" controldata indicates a running secondary instance, the instance has probably crashed
pgsqlms(pgsqld)[3258]: 2017/01/30_13:06:56 INFO: pgsql_notify: trying to start failing slave "pgsqld"...
pgsqlms(pgsqld)[3561]: 2017/01/30_13:07:56 ERROR: _confirm_stopped: instance "pgsqld" controldata indicates a running secondary instance, the instance has probably crashed
pgsqlms(pgsqld)[3561]: 2017/01/30_13:07:56 WARNING: pgsql_stop: unexpected state for instance "pgsqld" (returned 1)
trace postgresql :
2017-01-30 13:06:56 CET [3281]: [1-1] user=,db= LOG: database system was interrupted while in recovery at log time 2017-01-23 11:40:39 CET
scp: /P01CRMP01_1/archive/00000002.history: No such file or directory
2017-01-30 13:06:57 CET [3281]: [2-1] user=,db= LOG: entering standby mode
scp: /P01CRMP01_1/archive/000000010000003F00000039: No such file or directory
2017-01-30 13:06:57 CET [3281]: [3-1] user=,db= LOG: consistent recovery state reached at 3F/39000090
2017-01-30 13:06:57 CET [3281]: [4-1] user=,db= LOG: record with zero length at 3F/39000090
2017-01-30 13:06:57 CET [3305]: [1-1] user=,db= LOG: started streaming WAL from primary at 3F/39000000 on timeline 1
2017-01-30 13:06:57 CET [3307]: [1-1] user=postgres,db=postgres FATAL: the database system is starting up
2017-01-30 13:06:58 CET [3313]: [1-1] user=postgres,db=postgres FATAL: the database system is starting up

Voici les erreurs sur le serveur pouj-pgsql-4.xxx.lan :

pgsqlms(pgsqld)[2788]: 2017/01/30_13:06:55 ERROR: _confirm_stopped: instance "pgsqld" controldata indicates a running secondary instance, the instance has probably crashed
pgsqlms(pgsqld)[2863]: 2017/01/30_13:06:56 INFO: pgsql_notify: trying to start failing slave "pgsqld"...
pgsqlms(pgsqld)[2863]: 2017/01/30_13:06:57 INFO: pgsql_notify: state is "in archive recovery" after recovery attempt
pgsqlms(pgsqld)[3027]: 2017/01/30_13:07:56 ERROR: _confirm_role: psql could not connect to instance "pgsqld"
pgsqlms(pgsqld)[3027]: 2017/01/30_13:07:56 WARNING: pgsql_stop: unexpected state for instance "pgsqld" (returned 1)

traces postgresql :
2017-01-30 13:06:56 CET [2887]: [1-1] user=,db= LOG: database system was interrupted while in recovery at log time 2017-01-23 11:40:39 CET
Host key verification failed.^M
2017-01-30 13:06:57 CET [2887]: [2-1] user=,db= LOG: entering standby mode
Host key verification failed.^M
2017-01-30 13:06:57 CET [2887]: [3-1] user=,db= LOG: consistent recovery state reached at 3F/39000090
2017-01-30 13:06:57 CET [2887]: [4-1] user=,db= LOG: record with incorrect prev-link 3F/31000028 at 3F/39000090
2017-01-30 13:06:57 CET [2885]: [3-1] user=,db= LOG: database system is ready to accept read only connections
2017-01-30 13:06:57 CET [2905]: [1-1] user=,db= LOG: started streaming WAL from primary at 3F/39000000 on timeline 1
2017-01-30 13:06:57 CET [2947]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-01-30 13:07:56 CET [3037]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-01-30 13:07:56 CET [3039]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-01-30 13:09:20 CET [3266]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "192.168.12.13", user "postgres", database "postgres", SSL off
2017-01-30 13:09:20 CET [3268]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "192.168.12.13", user "postgres", database "postgres", SSL off

et une dernière question à propos de la mise en place du cluster :
Dans le choix du login j'ai mis "root". Est-ce correct ?

pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-3.xxx.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-3.xxx.lan" ipaddr="192.168.12.13" login="root" port="pouj-pgsql-3.xxx.lan-c7" action="off" passwd="ultrasecret";
pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-4.xxx.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-4.xxx.lan" ipaddr="192.168.12.14" login="root" port="pouj-pgsql-4.xxx.lan-c7" action="off" passwd="ultrasecret";

Merci vraiment pour votre aide, ce n'est vraiment pas simple...

ruizsebastien · 02/02/2017 10:19:30

Bonjour,

Quelqu'un pour me donner un coup de main sur les questions ci-dessus ?
Je suis complétement bloqué et à cours d'idée.
Je pense que tout mon problème ce résume à la conf des instances postgresql mais je ne vois pas où...

Merci.

ioguix · 02/02/2017 11:28:53

Bonjour,

Désolé pour l'attente, j'étais indisponible ces deux derniers jours.

- le recovery.conf.pcmk doit-il être sur tous les serveurs (c'est ce que j'ai fait) ?
(Dans ma logique : je dirais non, le fichier ne doit être que sur le slave)

Oui, ce template doit se trouver sur chaque nœud, c'est ce qui est d'ailleurs effectué dans le quick start de la documentation: http://dalibo.github.io/PAF/Quick_Start … esql-setup

Néanmoins, je vais prendre en compte cette question et l'indiquer plus clairement dans la documentation.

Ce fichier n'est qu'un template qui sera copié en tant que "recovery.conf" que si l'instance locale doit démarrer comme un standby. Dans un cluster, avec le jeu des bascules, des failover, etc, n'importe quel nœud doit pouvoir être un standby. Et surtout, le plus important, Pacemaker impose que toute ressource démarre comme un "slave" lors du "start", puis sera ensuite éventuellement promue en "master" avec un "promote". Ce qui explique la présence de ce fichier de template partout.

- l'ip virtuelle est placée sur le serveur qui est sensé être slave (Le pouj-pgsql-4.xxx.lan).

En théorie, cette IP ne sera placé que sur le nœud hébergeant l'instance master. Une fois que vous aurez réglé le problème associé à la configuration de vos instances PostgreSQL, nous pourrons vérifier le comportement de l'adresse IP.

Concernant la localisation du master, PAF la définie au premier démarrage du cluster en cherchant quelle est la seule instance éteinte alors qu'elle était en production (== pas un standby). Pour tous les futurs démarrages du cluster, les scores associés depuis aux instances sont ensuite utilisés (1001 pour le master donc).

Enfin, concernant vos log:
* je suis étonné qu'ils commencent tous par "ERROR: ... the instance has probably crashed", au démarrage du cluster, les instances doivent être au propre
* de même, je suis étonné que 1 minute après ce démarrage tumultueux de chaque instance, elles soient découvertes comme de nouveaux brutalement interrompues (d'autant plus si le fencing ne fonctionne pas).
* comment se fait-il qu'à 13:06:57, les DEUX instances entrent en streaming replication ("started streaming WAL from primary at 3F/39000000 on timeline 1") avec un master alors qu'elles sont toutes deux en standby ?

=> vérifiez bien les règles "reject" dans vos fichiers "pg_hba.conf"
=> supprimez l'adresse IP 192.168.12.15 AVANT le démarrage du cluster
=> arrêtez votre cluster, démarrez les instances PostgreSQL avec un master et un standby, assurez-vous que la réplication fonctionne bien, arrêtez toutes vos instances proprement, puis démarrez Pacemaker partout.

Concernant le fencing, il semble que votre configuration soit mauvaise:
* "ipaddr" doit contenir l'IP d' l'hyperviseur hébergeant la VM concernée
* concernant le login="root", vous devez indiquer ici l'utilisateur de connexion SSH vers l'hyperviseur. N'importe quel utilisateur système ayant le droit d'allumer/éteindre une VM à travers l'outil virsh fera l'affaire. La documentation de PAF donne un exemple avec root, mais dans mes maquette, j'utilise un autre utilisateur. Voir: http://dalibo.github.io/PAF/fencing.htm … -and-virsh
* vérifiez bien que "port" contient le nom de la VM tel que configuré coté hyperviseur (peut-être différent du hostname)

Je pense que la plupart des réponses à vos questions se trouvent dans la documentation de PAF. Je suis preneur si certaines choses ne vous y semble pas clair ou manquante. Merci !

ruizsebastien · 02/02/2017 11:52:00

Merci Jehan-Guillaume pour les réponses.
Je test tous ça rapidement

ruizsebastien · 03/02/2017 12:53:26

Bonjour,

Ca ne marche toujours pas.
Dans /var/log/cluster/corosync.log j'ai cette erreur qui me fait penser qu'il y a encore un problème avec le fencing :

stonith-ng: warning: log_action: fence_virsh[10564] stderr: [ Unable to connect/login to fencing device ]

Pour résumer voici tout ce que j'ai fait jusque là (en partant à chaque fois de zéro).
- 2 postgresql en streaming replication fonctionnel (j'en suis sûr !)
- arrêt des 2 instances.

2 serveurs : (IP / IP HYPERVISEUR / HOSTNAME / NAME HYPERVISEUR)
192.168.12.13 / 192.168.12.34 / pouj-pgsql-3.xxx.lan / POUJ-PGSQL-3
192.168.12.14 / 192.168.12.35 / pouj-pgsql-4.xxx.lan / POUJ-PGSQL-4

recovery.conf.pcmk :

standby_mode = 'on'
primary_conninfo = 'user=postgres password=xxxxx host=192.168.12.15 application_name=pouj-pgsql-3.xxx.lan port=5432 client_encoding=UTF8'
recovery_target_timeline = 'latest'
restore_command = 'scp postgres@192.168.12.14:/xxx/archive/%f %p'

pg_hba.conf :

host replication postgres 192.168.12.15/32 reject
host replication postgres pouj-pgsql-3 reject
host replication postgres 0.0.0.0/0 trust

/etc/hosts :

192.168.12.13 pouj-pgsql-3.xxx.lan
192.168.12.14 pouj-pgsql-4.xxx.lan
192.168.12.15 pgsql-ha
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

commandes passées pour la création du cluster :

systemctl disable postgresql-9.3
systemctl disable corosync
systemctl disable pacemaker
systemctl start pacemaker
systemctl start corosync
passwd hacluster
systemctl enable pcsd;
systemctl start pcsd;
pcs cluster auth pouj-pgsql-3.xxx.lan pouj-pgsql-4.xxx.lan -u hacluster
pcs cluster setup --name cluster_xxx pouj-pgsql-3.xxx.lan pouj-pgsql-4.xxx.lan
pcs cluster start --all

création des ressources :

pcs cluster cib cluster1.xml
pcs -f cluster1.xml resource defaults migration-threshold=5;
pcs -f cluster1.xml resource defaults resource-stickiness=10;
pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-3.xxx.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-3.xxx.lan" ipaddr="192.168.12.34" login="postgres" port="POUJ-PGSQL-3" action="off" passwd="xxx";
pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-4.xxx.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-4.xxx.lan" ipaddr="192.168.12.35" login="postgres" port="POUJ-PGSQL-4" action="off" passwd="xxxx";
pcs -f cluster1.xml constraint location fence_vm_pouj-pgsql-3.xxx.lan avoids pouj-pgsql-3.xxx.lan=INFINITY;
pcs -f cluster1.xml constraint location fence_vm_pouj-pgsql-4.xxx.lan avoids pouj-pgsql-4.xxx.lan=INFINITY;
pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
bindir=/logiciels/postgres/product/9.3/bin pgdata=/xxxxx/admin \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=30s \
op demote timeout=120s \
op monitor interval=15s timeout=10s role="Master" \
op monitor interval=16s timeout=10s role="Slave" \
op notify timeout=60s \

pcs -f cluster1.xml resource master pgsql-ha pgsqld \
master-max=1 master-node-max=1 \
clone-max=3 clone-node-max=1 notify=true

pcs -f cluster1.xml resource create pgsql-master-ip ocf:heartbeat:IPaddr2 ip=192.168.12.15 cidr_netmask=24 op monitor interval=10s
pcs -f cluster1.xml constraint colocation add pgsql-master-ip with master pgsql-ha INFINITY;
pcs -f cluster1.xml constraint order promote pgsql-ha then start pgsql-master-ip symmetrical=false;
pcs -f cluster1.xml constraint order demote pgsql-ha then stop pgsql-master-ip symmetrical=false;
pcs cluster cib-push cluster1.xml

status du cluster :

Cluster name: cluster_xxx
Stack: corosync
Current DC: pouj-pgsql-3.xxx.lan (1) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Fri Feb 3 11:52:32 2017 Last change: Fri Feb 3 08:41:47 2017 by root via cibadmin on pouj-pgsql-3.xxx.lan
2 nodes and 6 resources configured
Node pouj-pgsql-3.xxx.lan (1): UNCLEAN (online)
Node pouj-pgsql-4.xxx.lan (2): UNCLEAN (online)
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_virsh): Stopped
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_virsh): Stopped
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-4.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): FAILED pouj-pgsql-3.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Stopped
pgsql-master-ip (ocf::heartbeat:IPaddr2): Stopped
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
fence_vm_pouj-pgsql-3.xxx.lan: migration-threshold=5 fail-count=1000000 last-failure='Fri Feb 3 08:42:03 2017'
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Fri Feb 3 08:42:50 2017'
* Node pouj-pgsql-3.xxx.lan (1):
fence_vm_pouj-pgsql-4.xxx.lan: migration-threshold=5 fail-count=1000000 last-failure='Fri Feb 3 08:42:03 2017'
pgsqld: migration-threshold=5 fail-count=1000000 last-failure='Fri Feb 3 08:42:50 2017'
Failed Actions:
* fence_vm_pouj-pgsql-3.xxx.lan_start_0 on pouj-pgsql-4.xxx.lan 'unknown error' (1): call=19, status=Error, exitreason='none',
last-rc-change='Fri Feb 3 08:41:50 2017', queued=0ms, exec=12494ms
* pgsqld_stop_0 on pouj-pgsql-4.xxx.lan 'unknown error' (1): call=24, status=complete, exitreason='none',
last-rc-change='Fri Feb 3 08:42:50 2017', queued=0ms, exec=93ms
* fence_vm_pouj-pgsql-4.xxx.lan_start_0 on pouj-pgsql-3.xxx.lan 'unknown error' (1): call=19, status=Error, exitreason='none',
last-rc-change='Fri Feb 3 08:41:49 2017', queued=0ms, exec=12495ms
* pgsqld_stop_0 on pouj-pgsql-3.xxx.lan 'unknown error' (1): call=24, status=complete, exitreason='none',
last-rc-change='Fri Feb 3 08:42:50 2017', queued=0ms, exec=83ms

PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

encore coincé...

ioguix · 03/02/2017 13:17:41

Bonjour,

Concernant le fencing, de ce que je vois, vous autorisez l'utilisateur "postgres" à se connecter en SSH à vos hyperviseurs 192.168.12.34 et 192.168.12.35 pour y arrêter/démarrer vos instances qui y sont nommées POUJ-PGSQL-3 et POUJ-PGSQL-4, c'est bien ça ?

Faites le test manuellement, l'utilisateur "postgres" a-t-il le droit de se connecter en SSH sur ces hyperviseur avec le mot de passe indiqué ? Si oui, que retourne la commande "virsh list --all" ? Les commandes suivantes fonctionnent-elles ?

# démarrer la VM
virsh start POUJ-PGSQL-3

# interrompre la VM (ps: ne la "détruit" pas)
virsh destroy POUJ-PGSQL-3

Concernant votre configuration, j'ai trouvé une erreur bénigne dans les lignes suivantes:

pcs -f cluster1.xml resource master pgsql-ha pgsqld \
    master-max=1 master-node-max=1                     \
    clone-max=3 clone-node-max=1 notify=true

La paramètre "clone-max" doit être positionné à 2.

Assurez-vous que les noms de machine "pouj-pgsql-3" et "pouj-pgsql-4" que vous avez positionné dans le pg_hba.conf (sans le fqdn complet) sont bien résolus ainsi que leur reverse DNS. Sinon, positionnez les adresses IP des machines plutôt que leur nom dans le pg_hba.conf poru les reject.

Enfin, il semble que les erreurs aient eu lieues lors d'actions "stop" sur les deux instances. Pourriez-vous fournir les log de ce qui a pu se passer entre le démarrage du cluster et ce moment là ?

ruizsebastien · 03/02/2017 15:38:34

oui en fait pour le fencing j'utilise le user root comme ceci (erreur dans mon copier coller précédent) :

pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-3.poujoulat.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-3.xxx.lan" ipaddr="192.168.12.34" login="root" port="POUJ-PGSQL-3" action="off" passwd="xxxxx";
pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-4.poujoulat.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-4.xxx.lan" ipaddr="192.168.12.35" login="root" port="POUJ-PGSQL-4" action="off" passwd="xxxxx";

au niveau des commandes virsh :

[root@pouj-pgsql-3 ~]# virsh start POUJ-PGSQL-3
erreure :impossible de se connecter à l'hyperviseur
erreure :Failed to connect socket to '/var/run/libvirt/libvirt-sock': Aucun fichier ou dossier de ce type

Dernière modification par ruizsebastien (03/02/2017 15:39:03)

ioguix · 03/02/2017 17:05:11

Les commandes virsh que je vous ai fourni sont à exécuter sur l'hyperviseur comme indiqué dans mon message précédent, pas depuis les VM (comme le montre votre prompt).

ruizsebastien · 06/02/2017 11:02:46

Bonjour,

Juste pour être sûr : le user indiqué dans la commande de création du fencing "login=", dans mon cas il s'agit du user root du serveur local (POUJ-PGSQL-3 et POUJ-PGSQL-4).
Est-ce correct ?
(Avec le GUI de pcs j'arrive à arrêter et démarrer les serveurs.)

Sinon, positionnez les adresses IP des machines plutôt que leur nom dans le pg_hba.conf poru les reject.

Il faut mettre les adresses IP locales (eth0) ou celles vu par l'hyperviseur ?

Pourriez-vous fournir les log de ce qui a pu se passer entre le démarrage du cluster et ce moment là ?

Je refais un test complet tout à l'heure et je vous envoie ça.

Assurez-vous que les noms de machine "pouj-pgsql-3" et "pouj-pgsql-4" que vous avez positionné dans le pg_hba.conf (sans le fqdn complet) sont bien résolus ainsi que leur reverse DNS

Pour le test de résolution, il manquait "pouj-pgsql-3" et "pouj-pgsql-4" dans les /etc/hosts des 2 serveurs... Maintenant ça ping correctement.
Est-ce suffisant ?

Merci pour l'aide.

ioguix · 06/02/2017 11:38:19

Juste pour être sûr : le user indiqué dans la commande de création du fencing "login=", dans mon cas il s'agit du user root du serveur local (POUJ-PGSQL-3 et POUJ-PGSQL-4).
Est-ce correct ?

Non. Nous parlons ici de fencing. L'intérêt du fencing est de forcer l'arrêt (ou le reboot) de la machine sans lui demander son avis. une connexion locale à la VM en tant que root ne servira donc à rien. Ici, il faut se connecter à l'hyperviseur (192.168.12.34 ou 192.168.12.35 chez vous) pour y éteindre la VM désignée. Le compte "login=" à utiliser est donc un utilisateur système sur l'hyperviseur ayant les droits d'administration sur les VM (les commandes que je vous ai fourni la dernière fois).

Il faut mettre les adresses IP locales (eth0) ou celles vu par l'hyperviseur ?

Il faut mettre l'adresse IP de l'hyperviseur...et indiquer pour le paramètre "port=" le nom de la VM hébergée sur l'hyperviseur et qu'il doit donc éteindre.

Pour le test de résolution, il manquait "pouj-pgsql-3" et "pouj-pgsql-4" dans les /etc/hosts des 2 serveurs... Maintenant ça ping correctement.
Est-ce suffisant ?

Pas forcément, PostgreSQL fait une résolution de nom ET une résolution de nom inverse (IP -> nom) pour s'assurer que la connexion entrante correspond à la règle pg_hba (voir doc postgres). Néanmoins, j'imagine que ça devrait fonctionner désormais.
Mais sinon, comme je l'écrivais, positionnez les adresses IP des machines plutôt que leur nom dans le pg_hba.conf pour les reject. Dans le cadre de votre maquette ce sera déjà un bon point de départ pour valider l'ensemble.

ruizsebastien · 06/02/2017 15:13:17

NOTA : le problème de fencing (user issu de l'hyperviseur) est entre les mains de mon admin system, pour l'instant c'est un des points qui n'est pas réglé. Mais est-ce bloquant pour la partie "pure postgresql" ?

pouj-pgsql-3 (master) :

1 - mise en place du stremaing replication :
contruction du streaming, traces postgresql OK :

2017-02-06 13:28:48 CET [6679]: [1-1] user=,db= LOG: autovacuum launcher started
2017-02-06 13:28:48 CET [6673]: [3-1] user=,db= LOG: database system is ready to accept connections

Status de la replication PostgreSQL (streaming replication) :

---------------------------------------------------------------------------------
| NODE | IP | ROLE | WAL | VERSION |
---------------------------------------------------------------------------------
0 192.168.12.13 primary 3F/7D000130 9.3.15
1 192.168.12.14 standby 3F/7D000130 9.3.15

ARRET DES INSTANCES :

2017-02-06 13:33:35 CET [6673]: [4-1] user=,db= LOG: received fast shutdown request
2017-02-06 13:33:35 CET [6673]: [5-1] user=,db= LOG: aborting any active transactions
2017-02-06 13:33:35 CET [6679]: [2-1] user=,db= LOG: autovacuum launcher shutting down
2017-02-06 13:33:35 CET [6676]: [1-1] user=,db= LOG: shutting down
2017-02-06 13:33:35 CET [6676]: [2-1] user=,db= LOG: database system is shut down
2017-02-06 13:33:35 CET [6879]: [1-1] user=postgres,db=[unknown] FATAL: the database system is shutting down

2 - mise en place du cluster :
pg_hba.conf :

host replication postgres 192.168.12.15/24 reject
host replication postgres 192.168.12.34/24 reject
host replication postgres 0.0.0.0/0 trust

recovery.conf.pcmk :

standby_mode = 'on'
primary_conninfo = 'user=postgres password=xxxx host=192.168.12.15 application_name=pouj-pgsql-3.xxx.lan port=5432 client_encoding=UTF8'
recovery_target_timeline = 'latest'
restore_command = 'cp /xxxx/archive/%f %p'

installation pacemaker/corosync/pcs/paf

log après démarrage des services :

systemctl status pacemaker;
â pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled; vendor preset: disabled)
Active: active (running) since lun. 2017-02-06 13:45:34 CET; 21ms ago
Docs: man:pacemakerd
http://clusterlabs.org/doc/en-US/Pacema … index.html
Main PID: 7720 (pacemakerd)
CGroup: /system.slice/pacemaker.service
ââ7720 /usr/sbin/pacemakerd -f
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan systemd[1]: Started Pacemaker High Availability Cluster Manager.
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan systemd[1]: Starting Pacemaker High Availability Cluster Manager...
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan pacemakerd[7720]: notice: Additional logging available in /var/log/pacemaker.log
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan pacemakerd[7720]: notice: Switching to /var/log/cluster/corosync.log
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan pacemakerd[7720]: notice: Additional logging available in /var/log/cluster/corosync.log
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan pacemakerd[7720]: notice: Starting Pacemaker 1.1.15-11.el7_3.2
[root@pouj-pgsql-3 corosync]# systemctl status corosync;
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
Active: active (running) since lun. 2017-02-06 13:45:34 CET; 581ms ago
Process: 7703 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 7710 (corosync)
CGroup: /system.slice/corosync.service
ââ7710 corosync
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [TOTEM ] adding new UDPU member {192.168.12.13}
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [TOTEM ] adding new UDPU member {192.168.12.14}
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [TOTEM ] A new membership (192.168.12.13:352) was formed. Members joined: 1
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [QUORUM] Members[1]: 1
fÃ©vr. 06 13:45:33 pouj-pgsql-3.xxx.lan corosync[7710]: [MAIN ] Completed service synchronization, ready to provide service.
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan corosync[7703]: Starting Corosync Cluster Engine (corosync): [ OK ]
fÃ©vr. 06 13:45:34 pouj-pgsql-3.xxx.lan systemd[1]: Started Corosync Cluster Engine.

démarrage du cluster :

pcs cluster start --all
pouj-pgsql-4.xxx.lan: Starting Cluster...
pouj-pgsql-3.xxx.lan: Starting Cluster...
[root@pouj-pgsql-3 corosync]# pcs status --full
Cluster name: cluster_xxx
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pouj-pgsql-4.xxx.lan (2) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Mon Feb 6 13:50:05 2017 Last change: Mon Feb 6 13:45:57 2017 by hacluster via crmd on pouj-pgsql-4.xxx.lan
2 nodes and 0 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
No resources

Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
* Node pouj-pgsql-3.xxx.lan (1):
PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled

3 - creation des ressources :
UNIQUEMENT SUR pouj-psql-3 (puisque c'est un cluster on ne le fait que sur un serveur)

extrait de /var/log/cluster/corosync.log :

Feb 06 13:56:41 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8414] stderr: [ Unable to connect/login to fencing device ]
Feb 06 13:56:41 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8414] stderr: [ ]
Feb 06 13:56:41 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8414] stderr: [ ]
Feb 06 13:56:41 [7729] pouj-pgsql-3.xxx.lan stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_virsh (monitor). remaining timeout is 14
Feb 06 13:56:47 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8438] stderr: [ Unable to connect/login to fencing device ]
Feb 06 13:56:47 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8438] stderr: [ ]
Feb 06 13:56:47 [7729] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_virsh[8438] stderr: [ ]
Feb 06 13:57:01 [7729] pouj-pgsql-3.xxx.lan stonith-ng: notice: remote_op_done: Operation reboot of pouj-pgsql-3.xxx.lan by <no-one> for crmd.3962@pouj-pgsql-4.xxx.lan.ba7cd50f: No route to host
Feb 06 13:57:01 [7733] pouj-pgsql-3.xxx.lan crmd: notice: tengine_stonith_notify: Peer pouj-pgsql-3.xxx.lan was not terminated (reboot) by <anyone> for pouj-pgsql-4.xxx.lan: No route to host (ref=ba7cd50f-7838-4d94-b8b9-0ede6494f388) by client crmd.3962
Feb 06 13:57:13 [7729] pouj-pgsql-3.xxx.lan stonith-ng: notice: remote_op_done: Operation reboot of pouj-pgsql-3.xxx.lan by <no-one> for crmd.3962@pouj-pgsql-4.xxx.lan.703d9979: No route to host
Feb 06 13:57:13 [7733] pouj-pgsql-3.xxx.lan crmd: notice: tengine_stonith_notify: Peer pouj-pgsql-3.xxx.lan was not terminated (reboot) by <anyone> for pouj-pgsql-4.xxx.lan: No route to host (ref=703d9979-d622-435f-a86d-1709d0d6b6c1) by client crmd.3962

Les traces des instances postgresql n'ont pas bougé, pas de process postgres, pas de fichier pid, pad de fichier recovery.conf

pouj-pgsql-4 (slave) :

1 - mise en place du stremaing replication :
contruction du streaming, traces postgresql OK :

2017-02-06 13:30:11 CET [3217]: [4-1] user=,db= LOG: consistent recovery state reached at 3F/7C000090
2017-02-06 13:30:11 CET [3217]: [5-1] user=,db= LOG: redo starts at 3F/7C000090
2017-02-06 13:30:11 CET [3215]: [3-1] user=,db= LOG: database system is ready to accept read only connections
cp: cannot stat â/xxx_1/archive/000000010000003F0000007Dâ: No such file or directory
2017-02-06 13:30:11 CET [3217]: [6-1] user=,db= LOG: unexpected pageaddr 3F/6C000000 in log segment 000000010000003F0000007D, offset 0
2017-02-06 13:30:11 CET [3223]: [1-1] user=,db= LOG: started streaming WAL from primary at 3F/7D000000 on timeline 1

ARRET DES INSTANCES :

2017-02-06 13:34:22 CET [3215]: [4-1] user=,db= LOG: received fast shutdown request
2017-02-06 13:34:22 CET [3215]: [5-1] user=,db= LOG: aborting any active transactions
2017-02-06 13:34:22 CET [3219]: [1-1] user=,db= LOG: shutting down
2017-02-06 13:34:22 CET [3219]: [2-1] user=,db= LOG: database system is shut down

2 - mise en place du cluster :
pg_hba.conf :

host replication postgres 192.168.12.15/24 reject
host replication postgres 192.168.12.35/24 reject
host replication postgres 0.0.0.0/0 trust

recovery.conf.pcmk :

standby_mode = 'on'
primary_conninfo = 'user=postgres password=xxxx host=192.168.12.15 application_name=pouj-pgsql-4.xxx.lan port=5432 client_encoding=UTF8'
recovery_target_timeline = 'latest'
restore_command = 'cp /xxxx/archive/%f %p'

installation pacemaker/corosync/pcs/paf

log après démarrage des services :

systemctl status pacemaker;
â pacemaker.service - Pacemaker High Availability Cluster Manager
Loaded: loaded (/usr/lib/systemd/system/pacemaker.service; disabled; vendor preset: disabled)
Active: active (running) since lun. 2017-02-06 13:45:45 CET; 16ms ago
Docs: man:pacemakerd
http://clusterlabs.org/doc/en-US/Pacema … index.html
Main PID: 3949 (pacemakerd)
CGroup: /system.slice/pacemaker.service
ââ3949 /usr/sbin/pacemakerd -f
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan systemd[1]: Started Pacemaker High Availability Cluster Manager.
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan systemd[1]: Starting Pacemaker High Availability Cluster Manager...
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan pacemakerd[3949]: notice: Additional logging available in /var/log/pacemaker.log
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan pacemakerd[3949]: notice: Switching to /var/log/cluster/corosync.log
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan pacemakerd[3949]: notice: Additional logging available in /var/log/cluster/corosync.log
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan pacemakerd[3949]: notice: Starting Pacemaker 1.1.15-11.el7_3.2
[root@pouj-pgsql-4 corosync]# systemctl status corosync;
â corosync.service - Corosync Cluster Engine
Loaded: loaded (/usr/lib/systemd/system/corosync.service; disabled; vendor preset: disabled)
Active: active (running) since lun. 2017-02-06 13:45:45 CET; 185ms ago
Process: 3936 ExecStart=/usr/share/corosync/corosync start (code=exited, status=0/SUCCESS)
Main PID: 3943 (corosync)
CGroup: /system.slice/corosync.service
ââ3943 corosync
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [TOTEM ] adding new UDPU member {192.168.12.13}
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [TOTEM ] adding new UDPU member {192.168.12.14}
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [TOTEM ] A new membership (192.168.12.14:356) was formed. Members joined: 2
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [TOTEM ] A new membership (192.168.12.13:360) was formed. Members joined: 1
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [VOTEQ ] Waiting for all cluster members. Current votes: 1 expected_votes: 2
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [QUORUM] This node is within the primary component and will provide service.
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [QUORUM] Members[2]: 1 2
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3943]: [MAIN ] Completed service synchronization, ready to provide service.
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan corosync[3936]: Starting Corosync Cluster Engine (corosync): [ OK ]
fÃ©vr. 06 13:45:45 pouj-pgsql-4.xxx.lan systemd[1]: Started Corosync Cluster Engine.

démarrage du cluster :

pcs status --full
Cluster name: cluster_xxx
WARNING: no stonith devices and stonith-enabled is not false
Stack: corosync
Current DC: pouj-pgsql-4.xxx.lan (2) (version 1.1.15-11.el7_3.2-e174ec8) - partition with quorum
Last updated: Mon Feb 6 13:52:36 2017 Last change: Mon Feb 6 13:45:57 2017 by hacluster via crmd on pouj-pgsql-4.xxx.lan
2 nodes and 0 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
No resources

Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
* Node pouj-pgsql-3.xxx.lan (1):
PCSD Status:
pouj-pgsql-4.xxx.lan: Online
pouj-pgsql-3.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

3 - creation des ressources :
UNIQUEMENT SUR pouj-psql-3 (puisque c'est un cluster on ne le fait que sur un serveur)

extrait de /var/log/cluster/corosync.log :

Feb 06 13:58:59 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4552] stderr: [ Unable to connect/login to fencing device ]
Feb 06 13:58:59 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4552] stderr: [ ]
Feb 06 13:58:59 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4552] stderr: [ ]
Feb 06 13:58:59 [3958] pouj-pgsql-4.xxx.lan stonith-ng: info: internal_stonith_action_execute: Attempt 2 to execute fence_virsh (reboot). remaining timeout is 55
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4558] stderr: [ Unable to connect/login to fencing device ]
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4558] stderr: [ ]
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_virsh[4558] stderr: [ ]
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: info: update_remaining_timeout: Attempted to execute agent fence_virsh (reboot) the maximum number of times (2) allowed
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: error: log_operation: Operation 'reboot' [4558] (call 12 from crmd.3962) for host 'pouj-pgsql-3.xxx.lan' with device 'fence_vm_pouj-pgsql-3.xxx.lan' returned: -201 (Generic Pacemaker error)
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: notice: stonith_choose_peer: Couldn't find anyone to fence (reboot) pouj-pgsql-3.xxx.lan with any device
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: info: call_remote_stonith: None of the 2 peers are capable of fencing (reboot) pouj-pgsql-3.xxx.lan for crmd.3962 (1)
Feb 06 13:59:06 [3958] pouj-pgsql-4.xxx.lan stonith-ng: error: remote_op_done: Operation reboot of pouj-pgsql-3.xxx.lan by <no-one> for crmd.3962@pouj-pgsql-4.xxx.lan.d3545d8f: No route to host
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: tengine_stonith_callback: Stonith operation 12/29:15:0:53584abb-7a15-45bc-bd54-560bcac95718: No route to host (-113)
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: tengine_stonith_callback: Stonith operation 12 for pouj-pgsql-3.xxx.lan failed (No route to host): aborting transition.
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: abort_transition_graph: Transition aborted: Stonith failed | source=tengine_stonith_callback:749 complete=false
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: tengine_stonith_notify: Peer pouj-pgsql-3.xxx.lan was not terminated (reboot) by <anyone> for pouj-pgsql-4.xxx.lan: No route to host (ref=d3545d8f-3812-468f-9cf1-c3c5b0f6ccd0) by client crmd.3962
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: run_graph: Transition 15 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=11, Source=/var/lib/pacemaker/pengine/pe-warn-1.bz2): Complete
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: too_many_st_failures: Too many failures to fence pouj-pgsql-3.xxx.lan (11), giving up
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: info: do_log: Input I_TE_SUCCESS received in state S_TRANSITION_ENGINE from notify_crmd
Feb 06 13:59:06 [3962] pouj-pgsql-4.xxx.lan crmd: notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE | input=I_TE_SUCCESS cause=C_FSA_INTERNAL origin=notify_crmd

Les traces des instances postgresql n'ont pas bougé, pas de process postgres, pas de fichier pid, pad de fichier recovery.conf

Dernière modification par ruizsebastien (06/02/2017 15:43:01)

ioguix · 06/02/2017 18:41:02

WARNING: no stonith devices and stonith-enabled is not false

Ne pas avoir de fencing disponible et conserver stonith-enabled activé vous empêchera de démarrer quoi que ce soit dans votre cluster. Si vous voulez mettre temporairement le fencing de coté **le temps de réussir à démarrer vos instance PostgreSQL à travers Pacemaker**, il faut désactiver "stonith-enabled". Mais gardez bien à l'esprit que sans fencing, votre cluster (que ce soit avec PAF ou autre chose) finira par manger vos données.

Dans vos fichiers "pg_hba.conf", les lignes suivantes ne servent à rien:

host replication postgres 192.168.12.34/24 reject
# ou
host replication postgres 192.168.12.35/24 reject

1- il me semble que l'hyperviseur ne se connectera jamais à PostgreSQL (en tout cas je n'en vois pas l'intérêt)
2- lorsque vous spécifiez l'adresse d'une machine, le masque doit être "/32". Dans le cas présent, vos deux règles rejettent toute la plage 192.168.12.0.
3- pensez à conserver ces reject **AVANT** toute autre règle d'authentification

En revanche, j'imagine que ces mauvaises règles sont issues d'une confusion avec les règles de reject pour la réplication de chaque machine vers elle même, donc de leur IP locale (et non celle de l'hyperviseur donc)... Par exemple, dans le cas de la machine pouj-pgsql-4, les règles doivent être:

host replication postgres 192.168.12.15/32 reject
host replication postgres 192.168.12.14/32 reject

ruizsebastien · 14/03/2017 15:37:10

Bonjour,

Je travaille toujours dessus et je coince toujours à cause de cette partie :

pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-3.poujoulat.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-3.xxx.lan" ipaddr="172.31.3.97" login="sru" port="pouj-pgsql-3" action="off" passwd="xxxxx";
pcs -f cluster1.xml stonith create fence_vm_pouj-pgsql-4.poujoulat.lan fence_virsh pcmk_host_check="static-list" pcmk_host_list="pouj-pgsql-4.xxx.lan" ipaddr="172.31.3.98" login="sru" port="pouj-pgsql-4" action="off" passwd="xxxxx";

J'attends toujours de mes admin système qu'il me donne ce user qui fait le lien entre les vm et vcenter pour faire le fencing. Mais ils sont...réticents...

Mais j'ai une question : est-ce que la doc d'install de PAF (https://dalibo.github.io/PAF/Quick_Start-CentOS-7.html) est applicable dans un environnement vmware ?
est-ce que quelqu'un l'a déjà fait ?

NOTA : les ip ci-dessus ont changées. C'est normal. J'ai déplacé les vm dans un autre esxi.

Dernière modification par ruizsebastien (14/03/2017 15:38:26)

ioguix · 14/03/2017 17:41:26

Bonjour,

Oui, d'autres l'ont fait, mais il faut utiliser l'agent de fencing "fence_vmware" dans ce cas et non "fence_virsh".

ruizsebastien · 16/03/2017 15:50:28

Bonjour,

J'ai pas mal avancé : le fencing fontionne enfin, les nodes sont online du point de vue pcs et l'adresse ip virtuelle est ok :

[root@pouj-pgsql-3 ~]# pcs status --full
Cluster name: cluster_xxx
Stack: corosync
Current DC: pouj-pgsql-3.xxx.lan (1) (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Thu Mar 16 14:09:38 2017 Last change: Thu Mar 16 13:54:49 2017 by root via cibadmin on pouj-pgsql-3.xxx.lan
2 nodes and 3 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-3.xxx.lan
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started pouj-pgsql-3.xxx.lan
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
* Node pouj-pgsql-3.xxx.lan (1):
fence_vm_pouj-pgsql-4.xxx.lan: migration-threshold=5 fail-count=1 last-failure='Thu Mar 16 13:47:25 2017'
Failed Actions:
* fence_vm_pouj-pgsql-4.xxx.lan_monitor_60000 on pouj-pgsql-3.xxx.lan 'not running' (7): call=17, status=complete, exitreason='none',
last-rc-change='Thu Mar 16 13:47:25 2017', queued=0ms, exec=0ms

PCSD Status:
pouj-pgsql-4.xxx.lan: Online
pouj-pgsql-3.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Par contre dès que je créer les ressources pour postgresql, ça se gatte : le serveur pouj-pgsql-3 s'arrête et l'instance sur pouj-pgsql-4 est démarrée mais en mode recovery.

voici les commandes passées pour créer les ressources de postgresql :

# pgsqld
pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
bindir=/logiciels/postgres/product/9.3/bin pgdata=/xxx/admin \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=30s \
op demote timeout=120s \
op monitor interval=15s timeout=10s role="Master" \
op monitor interval=16s timeout=10s role="Slave" \
op notify timeout=60s \
# pgsql-ha
pcs -f cluster1.xml resource master pgsql-ha pgsqld \
master-max=1 master-node-max=1 \
clone-max=3 clone-node-max=2 notify=true

pcs -f cluster1.xml constraint colocation add pgsql-master-ip with master pgsql-ha INFINITY;
pcs -f cluster1.xml constraint order promote pgsql-ha then start pgsql-master-ip symmetrical=false;
pcs -f cluster1.xml constraint order demote pgsql-ha then stop pgsql-master-ip symmetrical=false;

En recommençant depuis le début step by step, je constate ça :
Dès que je créai la ressource pgsqld, l'instance sur pouj-pgsql-4 est démarrée mais pas celle de pouj-pgsql-3 :

2 nodes and 4 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-3.xxx.lan
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started pouj-pgsql-3.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Started pouj-pgsql-4.xxx.lan

extrait des log de corosync :

Mar 16 14:34:09 [16703] pouj-pgsql-3.xxx.lan pengine: info: LogActions: Leave fence_vm_pouj-pgsql-3.xxx.lan (Started pouj-pgsql-4.xxx.lan)
Mar 16 14:34:09 [16703] pouj-pgsql-3.xxx.lan pengine: info: LogActions: Leave fence_vm_pouj-pgsql-4.xxx.lan (Started pouj-pgsql-3.xxx.lan)
Mar 16 14:34:09 [16703] pouj-pgsql-3.xxx.lan pengine: info: LogActions: Leave pgsql-master-ip (Started pouj-pgsql-3.xxx.lan)
Mar 16 14:34:09 [16703] pouj-pgsql-3.xxx.lan pengine: notice: LogActions: Start pgsqld (pouj-pgsql-4.xxx.lan)
Mar 16 14:34:09 [16703] pouj-pgsql-3.xxx.lan pengine: notice: process_pe_message: Calculated transition 15, saving inputs in /var/lib/pacemaker/pengine/pe-input-8.bz2
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: info: do_state_transition: State transition S_POLICY_ENGINE -> S_TRANSITION_ENGINE | input=I_PE_SUCCESS cause=C_IPC_MESSAGE origin=handle_response
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: info: do_te_invoke: Processing graph 15 (ref=pe_calc-dc-1489671249-86) derived from /var/lib/pacemaker/pengine/pe-input-8.bz2
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: notice: te_rsc_command: Initiating monitor operation pgsqld_monitor_0 on pouj-pgsql-4.xxx.lan | action 6
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: notice: te_rsc_command: Initiating monitor operation pgsqld_monitor_0 locally on pouj-pgsql-3.xxx.lan | action 5
Mar 16 14:34:09 [16701] pouj-pgsql-3.xxx.lan lrmd: info: process_lrmd_get_rsc_info: Resource 'pgsqld' not found (3 active resources)
Mar 16 14:34:09 [16701] pouj-pgsql-3.xxx.lan lrmd: info: process_lrmd_rsc_register: Added 'pgsqld' to the rsc list (4 active resources)
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: info: do_lrm_rsc_op: Performing key=5:15:7:09e12e8c-e581-4c7d-a45b-6945b91a1b83 op=pgsqld_monitor_0
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: info: action_synced_wait: Managed pgsqlms_meta-data_0 process 24573 exited with rc=0
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: notice: process_lrm_event: Result of probe operation for pgsqld on pouj-pgsql-3.xxx.lan: 7 (not running) | call=49 key=pgsqld_monitor_0 confirmed=true cib-update=173
Mar 16 14:34:09 [19963] pouj-pgsql-3.xxx.lan crmd: notice: process_lrm_event: pouj-pgsql-3.xxx.lan-pgsqld_monitor_0:49 [ /tmp:5432 - no response\npg_ctl: no server running\n ]

Dans quelle direction dois-je chercher ?

ioguix · 17/03/2017 11:33:06

Bonjour,

Ce paramètre est mauvais: "clone-node-max=2 ". Vous ne pouvez pas avoir deux instances fait parti du même groupe de réplication sur le même nœud...

Pour le reste, isolez les lignes qui commencent par "pgsqlms" dans vos logs, ce sont tous les messages de l'agent PAF lui même. Vous y trouverez peut-être d'autres réponses.

ruizsebastien · 17/03/2017 11:49:03

Bonjour,
J'ai fait la modif : clone-node-max=1

Après création des ressources postgresql, j'obtiens ceci :

[root@pouj-pgsql-3 corosync]# pcs status --full
Cluster name: cluster_xxx
Stack: corosync
Current DC: pouj-pgsql-4.xxx.lan (2) (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Fri Mar 17 10:39:12 2017 Last change: Fri Mar 17 10:38:38 2017 by root via cibadmin on pouj-pgsql-3.xxx.lan
2 nodes and 6 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-3.xxx.lan
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started pouj-pgsql-3.xxx.lan
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): Slave pouj-pgsql-4.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Stopped
pgsqld (ocf::heartbeat:pgsqlms): Stopped
Slaves: [ pouj-pgsql-4.xxx.lan ]
Stopped: [ pouj-pgsql-3.xxx.lan ]
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
* Node pouj-pgsql-4.xxx.lan (2):
Migration Summary:
* Node pouj-pgsql-4.xxx.lan (2):
fence_vm_pouj-pgsql-3.xxx.lan: migration-threshold=5 fail-count=1 last-failure='Fri Mar 17 09:37:34 2017'
* Node pouj-pgsql-3.xxx.lan (1):
fence_vm_pouj-pgsql-4.xxx.lan: migration-threshold=5 fail-count=1 last-failure='Fri Mar 17 09:37:44 2017'
Failed Actions:
* fence_vm_pouj-pgsql-3.xxx.lan_monitor_60000 on pouj-pgsql-4.xxx.lan 'unknown error' (1): call=37, status=Timed Out, exitreason='none',
last-rc-change='Fri Mar 17 09:37:14 2017', queued=0ms, exec=20019ms
* fence_vm_pouj-pgsql-4.xxx.lan_monitor_60000 on pouj-pgsql-3.xxx.lan 'unknown error' (1): call=17, status=Timed Out, exitreason='none',
last-rc-change='Fri Mar 17 09:37:24 2017', queued=0ms, exec=20015ms

PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

et juste après la VM pouj-pgsql-3 s'arrête (commande halt envoyé par pacemaker ou corosync ou pcs, je ne sais pas).

les traces de corosync sur pgsqlms :

[root@pouj-pgsql-3 corosync]# grep "^pgsqlms" /var/log/cluster/corosync.log
[root@pouj-pgsql-3 corosync]#
-----------------------------------------------------------------------------
[root@pouj-pgsql-4 corosync]# grep "^pgsqlms" /var/log/cluster/corosync.log
pgsqlms(pgsqld)[1207]: 2017/03/17_10:38:42 INFO: pgsql_start: instance "pgsqld" started

il y a aussi ça qui me pose problème :

pcs -f cluster1.xml resource create pgsqld ocf:heartbeat:pgsqlms \
bindir=/logiciels/postgres/product/9.3/bin pgdata=/xxx/admin \
op start timeout=60s \
op stop timeout=60s \
op promote timeout=30s \
op demote timeout=120s \
op monitor interval=15s timeout=10s role="Master" \
op monitor interval=16s timeout=10s role="Slave" \
op notify timeout=60s \

que doit-on mettre dans "pgdata" ?
est-ce le répertoire où on trouve le postgresql.conf, les repertoires pg_xxlog, pg_clog, pg_tblspc, etc... ?

Dernière modification par ruizsebastien (17/03/2017 12:06:30)

ioguix · 17/03/2017 12:34:44

que doit-on mettre dans "pgdata" ?
est-ce le répertoire où on trouve le postgresql.conf, les repertoires pg_xxlog, pg_clog, pg_tblspc, etc... ?

Je me permets de vous rediriger vers le chapitre concerné dans la documentation... : http://dalibo.github.io/PAF/configurati … parameters

Et plus globalement, pensez à revoir cette page en entier à propos des autres pré-requis ou configuration à respecter.

Encore une fois, si l'ensemble de la documentation n'est pas suffisante (les pages d'installation, configuration, fencing, les admin cookbook, les quick start), n'hésitez pas à l'étoffer ou à proposer des améliorations.

++

ruizsebastien · 18/03/2017 19:28:06

Bonjour,

Je crois que je suis allé au bout de ce que je pouvais faire en relisant toutes les docs que je pouvais et en repointant tous les prérequis.
Tout fonctionne (fencing, ip virtuelle) mais dès que je créai les resources postgresql, c'est la cata...
Si vous pouviez me donner un dernier coup de main ? Après je laisse tomber.

Voici les infos recueillies :

pouj-pgsql-3 :

traces postgresql (décalage d'1 heure car log_timezone non définit dans postgresql.conf)

2017-03-18 17:02:08 GMT [5234]: [3-1] user=,db= LOG: restored log file "000000010000003F000000B5" from archive
2017-03-18 17:02:08 GMT [5234]: [4-1] user=,db= LOG: consistent recovery state reached at 3F/B5000090
2017-03-18 17:02:08 GMT [5234]: [5-1] user=,db= LOG: record with zero length at 3F/B5000090
2017-03-18 17:02:08 GMT [5232]: [3-1] user=,db= LOG: database system is ready to accept read only connections
2017-03-18 17:02:08 GMT [5271]: [1-1] user=,db= LOG: started streaming WAL from primary at 3F/B5000000 on timeline 1
2017-03-18 17:02:09 GMT [5289]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-03-18 17:02:09 GMT [5291]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-03-18 17:02:09 GMT [5293]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-03-18 17:02:09 GMT [5337]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-03-18 17:02:09 GMT [5339]: [1-1] user=postgres,db=postgres FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off
2017-03-18 17:03:09 GMT [5300]: [1-1] user=postgres,db=[unknown] LOG: terminating walsender process due to replication timeout

pg_hba.conf :

host replication postgres 172.31.3.99/32 reject
host replication postgres pouj-pgsql-3 reject
host replication postgres 0.0.0.0/0 trust

[root@pouj-pgsql-3 cluster]# grep "^pgsqlms" /var/log/cluster/corosync.log

pgsqlms(pgsqld)[5215]: 2017/03/18_18:02:09 ERROR: _confirm_role: psql could not connect to instance "pgsqld"
pgsqlms(pgsqld)[5215]: 2017/03/18_18:02:09 ERROR: pgsql_start: instance "pgsqld" is not running as a slave (returned 1)
pgsqlms(pgsqld)[5327]: 2017/03/18_18:02:09 ERROR: _confirm_role: psql could not connect to instance "pgsqld"
pgsqlms(pgsqld)[5327]: 2017/03/18_18:02:09 WARNING: pgsql_stop: unexpected state for instance "pgsqld" (returned 1)

[root@pouj-pgsql-3 cluster]# grep error corosync.log

Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: notice: process_lrm_event: Result of start operation for pgsqld on pouj-pgsql-3.xxx.lan: 1 (unknown error) | call=111 key=pgsqld_start_0 confirmed=true cib-update=100
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (49.12) pgsqld_start_0.111=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (49.12) pgsqld_start_0.111=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (49.14) pgsqld_start_0.21=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (49.14) pgsqld_start_0.21=unknown error: failed
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: notice: process_lrm_event: Result of stop operation for pgsqld on pouj-pgsql-3.xxx.lan: 1 (unknown error) | call=114 key=pgsqld_stop_0 confirmed=true cib-update=102
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (50.3) pgsqld_stop_0.114=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (50.3) pgsqld_stop_0.114=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (50.5) pgsqld_stop_0.24=unknown error: failed
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: info: process_graph_event: Detected action (50.5) pgsqld_stop_0.24=unknown error: failed
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:20 [1824] pouj-pgsql-3.xxx.lan stonith-ng: error: remote_op_done: Operation reboot of pouj-pgsql-3.xxx.lan by <no-one> for crmd.1826@pouj-pgsql-3.xxx.lan.9239a0fd: No such device

[root@pouj-pgsql-3 cluster]# grep warning corosync.log

Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 12 (pgsqld:0_start_0) on pouj-pgsql-3.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 12 (pgsqld:0_start_0) on pouj-pgsql-3.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 14 (pgsqld:1_start_0) on pouj-pgsql-4.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 14 (pgsqld:1_start_0) on pouj-pgsql-4.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for fence_vm_pouj-pgsql-4.xxx.lan on pouj-pgsql-3.xxx.lan: not running (7)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op start for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 3 (pgsqld_stop_0) on pouj-pgsql-3.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 3 (pgsqld_stop_0) on pouj-pgsql-3.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 5 (pgsqld_stop_0) on pouj-pgsql-4.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [1826] pouj-pgsql-3.xxx.lan crmd: warning: status_from_rc: Action 5 (pgsqld_stop_0) on pouj-pgsql-4.xxx.lan failed (target: 0 vs. rc: 1): Error
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op monitor for fence_vm_pouj-pgsql-4.xxx.lan on pouj-pgsql-3.xxx.lan: not running (7)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:0 on pouj-pgsql-3.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: pe_fence_node: Node pouj-pgsql-3.xxx.lan will be fenced because of resource failure(s)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: unpack_rsc_op_failure: Processing failed op stop for pgsqld:1 on pouj-pgsql-4.xxx.lan: unknown error (1)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: pe_fence_node: Node pouj-pgsql-4.xxx.lan will be fenced because of resource failure(s)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-3.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: check_migration_threshold: Forcing pgsql-ha away from pouj-pgsql-4.xxx.lan after 1000000 failures (max=5)
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: stage6: Scheduling Node pouj-pgsql-3.xxx.lan for STONITH
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: stage6: Scheduling Node pouj-pgsql-4.xxx.lan for STONITH
Mar 18 18:02:09 [16703] pouj-pgsql-3.xxx.lan pengine: warning: process_pe_message: Calculated transition 51 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-21.bz2
Mar 18 18:02:20 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5342] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:02:20 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5342] stderr: [ InsecureRequestWarning) ]
Mar 18 18:02:36 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5537] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:02:36 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5537] stderr: [ InsecureRequestWarning) ]
Mar 18 18:03:44 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5889] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:03:44 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[5889] stderr: [ InsecureRequestWarning) ]
Mar 18 18:04:51 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6193] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:04:51 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6193] stderr: [ InsecureRequestWarning) ]
Mar 18 18:05:59 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6544] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:05:59 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6544] stderr: [ InsecureRequestWarning) ]
Mar 18 18:07:06 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6920] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:07:06 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[6920] stderr: [ InsecureRequestWarning) ]
Mar 18 18:08:14 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7282] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:08:14 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7282] stderr: [ InsecureRequestWarning) ]
Mar 18 18:09:21 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7586] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:09:21 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7586] stderr: [ InsecureRequestWarning) ]
Mar 18 18:10:28 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7936] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:10:28 [1824] pouj-pgsql-3.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[7936] stderr: [ InsecureRequestWarning) ]

pouj-pgsql-4 :

pas de traces postgresql
pg_hba.conf :

host replication postgres 172.31.3.99/32 reject
host replication postgres pouj-pgsql-3 reject
host replication postgres 0.0.0.0/0 trust

[root@pouj-pgsql-4 cluster]# grep "^pgsqlms" /var/log/cluster/corosync.log

[root@pouj-pgsql-4 cluster]#

[root@pouj-pgsql-4 cluster]# grep error /var/log/cluster/corosync.log

[root@pouj-pgsql-4 cluster]#

[root@pouj-pgsql-4 cluster]# grep warning /var/log/cluster/corosync.log

Mar 18 18:01:12 [15463] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[15494] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:01:12 [15463] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[15494] stderr: [ InsecureRequestWarning) ]
Mar 18 18:01:21 [15463] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[15601] stderr: [ /usr/lib/python2.7/site-packages/urllib3/connectionpool.py:769: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.org/en/latest/security.html ]
Mar 18 18:01:21 [15463] pouj-pgsql-4.xxx.lan stonith-ng: warning: log_action: fence_vmware_soap[15601] stderr: [ InsecureRequestWarning) ]

ioguix · 18/03/2017 23:59:35

Bonjour,

Notez que je suis disposé à vous aider, mais disons que votre question sur le PGDATA m'a fait pensé que vous n'aviez pas lu la documentation. Plutôt que de longue phrases ici, autant profiter de la documentation que nous avons mis du temps à écrire

Pour le reste, si les pg_hba.conf présentés ici sont complets, il y a plusieurs erreurs. Pour commencer, les messages suivants de PostgreSQL indiquent qu'une connexion via un socket unix local est refusée car aucune ligne ne correspond dans le pg_hba:

FATAL: no pg_hba.conf entry for host "[local]", user "postgres", database "postgres", SSL off

J'imagine que ces connexions viennent de PAF...

Rajoutez donc en début de fichier la ligne suivante:

local all postgres peer

Reste que ça ne laisse pas beaucoup de place pour les connexions distantes d'une applications...

Ensuite, il semble que sur "pouj-pgsql-4", vous rejetiez les connexions de réplication non pas de "pouj-pgsql-4", mais de "pouj-pgsql-3"...

ruizsebastien · 19/03/2017 12:38:42

Merci pour votre aide.
En fait je connais bien postgresql depuis plusieurs années et mes aventures avec PAF ont été tellement semées d'embuches et d’échecs que j'en suis venu à douter de mes propres connaissances :-(
D'où ma question sur PGDATA...
Pour les pg_hba.conf, je fais les modifs indiquées et je relance.
Pour les connexions distantes c'était volontaire, je voulais d'abord valider le bon fonctionnement du cluster avant d'ouvrir les connexions aux applications.
Pour la dernière erreur : mauvais copier coller...

Merci en tout cas pour vos conseils :-)

-------------------------------------------------------
[MAJ après corrections]
-------------------------------------------------------
Après avoir appliquer les dernières recommandations de ioguix, le cluster fonctionne :-)

[root@pouj-pgsql-3 corosync]# pcs status --full
Cluster name: cluster_xxx
Stack: corosync
Current DC: pouj-pgsql-3.xxx.lan (1) (version 1.1.15-11.el7_3.4-e174ec8) - partition with quorum
Last updated: Sun Mar 19 11:59:00 2017 Last change: Sun Mar 19 11:53:22 2017 by root via crm_attribute on pouj-pgsql-3.xxx.lan
2 nodes and 6 resources configured
Online: [ pouj-pgsql-3.xxx.lan (1) pouj-pgsql-4.xxx.lan (2) ]
Full list of resources:
fence_vm_pouj-pgsql-3.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-4.xxx.lan
fence_vm_pouj-pgsql-4.xxx.lan (stonith:fence_vmware_soap): Started pouj-pgsql-3.xxx.lan
pgsql-master-ip (ocf::heartbeat:IPaddr2): Started pouj-pgsql-3.xxx.lan
Master/Slave Set: pgsql-ha [pgsqld]
pgsqld (ocf::heartbeat:pgsqlms): Master pouj-pgsql-3.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Slave pouj-pgsql-4.xxx.lan
pgsqld (ocf::heartbeat:pgsqlms): Stopped
Masters: [ pouj-pgsql-3.xxx.lan ]
Slaves: [ pouj-pgsql-4.xxx.lan ]
Node Attributes:
* Node pouj-pgsql-3.xxx.lan (1):
+ master-pgsqld : 1001
* Node pouj-pgsql-4.xxx.lan (2):
+ master-pgsqld : 1000
Migration Summary:
* Node pouj-pgsql-3.xxx.lan (1):
fence_vm_pouj-pgsql-4.xxx.lan: migration-threshold=5 fail-count=1 last-failure='Sun Mar 19 11:07:07 2017'
* Node pouj-pgsql-4.xxx.lan (2):
Failed Actions:
* fence_vm_pouj-pgsql-4.xxx.lan_monitor_60000 on pouj-pgsql-3.xxx.lan 'not running' (7): call=75, status=complete, exitreason='none',
last-rc-change='Sun Mar 19 11:07:07 2017', queued=0ms, exec=1ms

PCSD Status:
pouj-pgsql-3.xxx.lan: Online
pouj-pgsql-4.xxx.lan: Online
Daemon Status:
corosync: active/disabled
pacemaker: active/disabled
pcsd: active/enabled

Je me demande par contre pourquoi le heartbeat est "stopped".

Dernière modification par ruizsebastien (19/03/2017 13:23:49)

Forums PostgreSQL.fr

#1 27/01/2017 18:13:03

[RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#2 29/01/2017 15:21:23

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#3 30/01/2017 10:33:55

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#4 30/01/2017 11:26:35

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#5 30/01/2017 14:30:28

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#6 02/02/2017 10:19:30

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#7 02/02/2017 11:28:53

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#8 02/02/2017 11:52:00

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#9 03/02/2017 12:53:26

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#10 03/02/2017 13:17:41

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#11 03/02/2017 15:38:34

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#12 03/02/2017 17:05:11

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#13 06/02/2017 11:02:46

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#14 06/02/2017 11:38:19

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#15 06/02/2017 15:13:17

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#16 06/02/2017 18:41:02

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#17 14/03/2017 15:37:10

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#18 14/03/2017 17:41:26

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#19 16/03/2017 15:50:28

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#20 17/03/2017 11:33:06

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#21 17/03/2017 11:49:03

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#22 17/03/2017 12:34:44

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#23 18/03/2017 19:28:06

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#24 18/03/2017 23:59:35

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

#25 19/03/2017 12:38:42

Re : [RESOLU] [PAF] automatique failover postgresql / pacemaker /corosync

Pied de page des forums