Distroname and release: Debian Wheezy

Pacemaker and Corosync HA - 2 Node setup

In this setup we will setup a HA faiilover solution using Corosync and Pacemake, in a Active/Passive setup.

  • HA Loadbalancer and proxy for TCP and HTTP based applications! Provides the floating which we should use as "pointer".
    We will not use this here.
  • Resource Manager. Can for example start/stop corosync, apache etc, using Resource Agents.
  • Resources Agents. Can monitor resources, for example apache, and take action if one of these resources are not running, like moving the floating IP.
There is a list of resource agents on http://www.linux-ha.org/wiki/Resource_Agents
There are two type of resources. LSB (Linux Standard Base) and OCF (Open Cluster Framework).
  • LSB - resources provided by Linux Distributions,(/etc/init.d/[service] script)
  • OCF - Virtual IP-Addresses, monitor health status, start/stop resources
    • Communication layer. Provides reliable communication, manages cluster membership.
  • Installation and Setup


    • Hosts or DNS resolvers
    • NTP Must be installed and configured on all nodes
    Simple example of host entries.
    /etc/hosts	ldap1 testserver01	ldap2 testserver02


    We will install pacemaker, it should install corosync as an dependency, if not install it.
    apt-get install pacemaker


    Edit /etc/corosync/corosync.conf. The bind address is the network address, NOT the IP.
    The mcastaddr is Debians default, which is OK.
    interface {
            # The following values need to be set based on your environment
            ringnumber: 0
            mcastport: 5405
    We also want corosync to start pacemaker automatically. If we do not do this, we will have to start pacemaker manually.
    ver: 0 Indicates corosync to start pacemaker automatically. Setting it to 1, will require manually start of pacemaker!
    service {
     	# Load the Pacemaker Cluster Resource Manager
     	ver:       0
     	name:      pacemaker
    Now we run a syntax, configuration check on the file.
    corosync -f
    Copy/paste the content of corosync.conf, or scp the file to the second node.
    scp /etc/corosync/corosync.conf
    Now we are ready to start corosync on the first node. On Debian Wheezy, the init script apparently have a "bug" which makes corosync fail to start unless it's setup to start at boot.
    Make corosync starts at boot time.
    # start corosync at boot [yes|no]
    Or disable the onboot, check in the script... Not really preffered, but it works.
    #if [ "$START" != "yes" ]; then
    #        exit 0
    Start corosync
    /etc/init.d/corosync start
    Check status using the crm_mon command. The -1 parameter tells it to only run once, and not "forever".
    crm_mon -1
    Last updated: Fri Dec  6 17:07:03 2013
    Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
    Stack: openais
    Current DC: testserver01 - partition WITHOUT quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    0 Resources configured.
    Online: [ testserver01 ]
    Copy the config file to the second node.
    scp /etc/corosync/corosync.conf testserver01:/etc/corosync/
    Now on the second node, try to start corosync
    /etc/init.d/corosync start
    Check the status again. We should now hopefully see the second node joining. If not, check the firewalls!
    crm_mon -1
    Last updated: Fri Dec  6 17:07:56 2013
    Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
    Stack: openais
    Current DC: testserver01 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    0 Resources configured.
    Online: [ testserver01 testserver02 ]
    We can check for syntax and common errors using crm_verify -L

    Syntax Check, STONITH and QUORUM.

    crm_verify -L
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
    Errors found during check: config not valid
      -V may provide more details
    And we have an ERROR! Since this is a 2 node cluster only, we wants to disable STONITH. (Shoot The Other Node In The Head).
    crm configure property stonith-enabled=false
    If you in a 2 node cluster stops one of the two nodes, the node which is up fails, because the voting system fails.
    So disable QUORUM
    sudo crm configure property no-quorum-policy=ignore
    Try running crm_verify -L again, and it shows now errors.
    crm_verify -L

    Adding Resources: VIP

    Adding Floating IP / VIP
    crm configure primitive VIP ocf:IPaddr2 params ip= nic=eth0 op monitor interval=10s
    Now we should have added an VIP/Floating IP, we can test this by a simple ping. Should respond from both nodes.
    ping -c 3
    PING ( 56(84) bytes of data.
    64 bytes from icmp_req=1 ttl=64 time=0.012 ms
    64 bytes from icmp_req=2 ttl=64 time=0.011 ms
    64 bytes from icmp_req=3 ttl=64 time=0.011 ms
    --- ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 1999ms
    rtt min/avg/max/mdev = 0.011/0.011/0.012/0.002 ms

    Adding Resources: Services

    The services MUST response with the correct reponse codes, or else this wil not work as intended!
    We will be using LSB resources, in these examples.
    OCF resources are located in the folder /usr/lib/ocf/resource.d/heartbeat/

    Adding the VIP.
    crm configure primitive HA-postfix lsb:postfix op monitor interval=15s
    Now both nodes, should be able to start/stop postfix. Lets test the setup
    crm_mon -1
    Last updated: Fri Dec  6 17:33:47 2013
    Last change: Fri Dec  6 17:33:25 2013 via cibadmin on testserver01
    Stack: openais
    Current DC: testserver02 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    Online: [ testserver01 testserver02 ]
     VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
     HA-postfix	(lsb:postfix):	Started testserver02
    As we can see the VIP and postfix is running on a different server, which defiantly WILL break our cluster!
    We can solve this in different ways. Either by creating an COLOCATION, or by adding the Ressources to the same group.
    I will create a group, because it makes it easier to mananage, because I can migrate the full group at once, instead of the single resources.
    crm configure group HA-Group VIP HA-postfix
    If we check the status again, we can see that the two resources are now running on the same server.
    crm_mon -1
    Last updated: Fri Dec  6 17:40:55 2013
    Last change: Fri Dec  6 17:40:06 2013 via cibadmin on testserver02
    Stack: openais
    Current DC: testserver02 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    Online: [ testserver01 testserver02 ]
     Resource Group: HA-Group
         VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
         HA-postfix	(lsb:postfix):	Started testserver01

    Automatic Migration

    If an resource fails, for some reason, like postfix crashes, and cannot start again, we want to migrate to another server.
    Per default the migration-threshold is not defined/set to infinity, which will never migrate it.

    When we have 3 fails, migrate the node, and expire the failed resource after 60 seconds. This will allow it to automatically to move it back to this node.
    We can add migration-threshold=3 and failure-timeout=60s to the configuration, using "crm configure edit".
    crm configure edit
    primitive HA-postfix lsb:postfix \
            op monitor interval="15s" \
            meta target-role="Started" migration-threshold="3" failure-timeout=60s 


    Cleaning up the errors

    We can clean up the old "log" entries, for errors like below.
    HA-apache_monitor_15000 (node=testserver01, call=173, rc=7, status=complete): not running
    HA-apache_monitor_15000 (node=testserver02, call=9, rc=7, status=complete): not running
    crm resource cleanup HA-apache
    Cleaning up HA-apache on testserver01
    Cleaning up HA-apache on testserver02
    Waiting for 3 replies from the CRMd... OK

    Modifying Resources Manually

    Resources can be easily edited using the "edit" tool.
    crm configure edit

    Deleting a resource

    An resource must be stopped before we can delete it. If it is a member of a group, the group must be stopped and deleted first.
    List the resources.
    crm_resource --list
     VIP	(ocf::heartbeat:IPaddr2) Started 
     HA-apache	(lsb:apache2) Started 

    Stopping and deleting an resource.

    crm resource stop HA-apache
    crm configure delete HA-apache

    Migrate / Move Resource

    Migrate a resource to a specific node.
    crm_resource --resource HA-Group --move --node testserver01

    View configuration

    crm configure show
    node testserver01
    node testserver02
    primitive VIP ocf:heartbeat:IPaddr2 \
    	params ip="" nic="eth0" \
    	op monitor interval="10s"
    primitive postfix lsb:postfix \
    	op monitor interval="15s" \
    	meta target-role="Started" migration-threshold="1" failure-timeout="60s"
    group HA-Group VIP postfix
    location cli-prefer-HA-Group HA-Group \
    	rule $id="cli-prefer-rule-HA-Group" inf: #uname eq testserver01
    property $id="cib-bootstrap-options" \
    	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
    	cluster-infrastructure="openais" \
    	expected-quorum-votes="2" \
    	stonith-enabled="false" \
    	last-lrm-refresh="1386613142" \

    View status and fail counts

    crm_mon -1 --fail
    Last updated: Sat Dec  7 12:39:58 2013
    Last change: Fri Dec  6 19:51:41 2013 via cibadmin on testserver01
    Stack: openais
    Current DC: testserver01 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    Online: [ testserver01 testserver02 ]
     Resource Group: HA-Group
         VIP	(ocf::heartbeat:IPaddr2):	Started testserver02
         postfix	(lsb:postfix):	Started testserver02
    Migration summary:
    * Node testserver02: 
    * Node testserver01: 
       postfix: migration-threshold=1 fail-count=1 last-failure='Sat Dec  7 12:39:47 2013'
    Failed actions:
        postfix_monitor_15000 (node=testserver01, call=8, rc=7, status=complete): not running
    Ref: http://clusterlabs.org/wiki/Example_configurations
    Do not trust the authors words! POC, tests and experience is key

    Copyright LinuxLasse.net 2009 - 2025 All Rights Reserved.

    Valid HTML 4.01 Strict Valid CSS!