Distroname and release: Debian Wheezy

Pacemaker and Corosync HA - 2 Node setup

In this setup we will setup a HA faiilover solution using Corosync and Pacemake, in a Active/Passive setup.

HAPROXY:
  • HA Loadbalancer and proxy for TCP and HTTP based applications! Provides the floating which we should use as "pointer".
    We will not use this here.
Pacemaker:
  • Resource Manager. Can for example start/stop corosync, apache etc, using Resource Agents.
  • Resources Agents. Can monitor resources, for example apache, and take action if one of these resources are not running, like moving the floating IP.
There is a list of resource agents on http://www.linux-ha.org/wiki/Resource_Agents
There are two type of resources. LSB (Linux Standard Base) and OCF (Open Cluster Framework).
  • LSB - resources provided by Linux Distributions,(/etc/init.d/[service] script)
  • OCF - Virtual IP-Addresses, monitor health status, start/stop resources
Corosync:
    • Communication layer. Provides reliable communication, manages cluster membership.
  • Installation and Setup

    Prerequisites

    • Hosts or DNS resolvers
    • NTP Must be installed and configured on all nodes
    Simple example of host entries.
    /etc/hosts
    10.0.2.1	ldap1 testserver01
    10.0.2.100	ldap2 testserver02
    

    Installation

    We will install pacemaker, it should install corosync as an dependency, if not install it.
    apt-get install pacemaker
    

    Configuration

    Edit /etc/corosync/corosync.conf. The bind address is the network address, NOT the IP.
    The mcastaddr is Debians default, which is OK.
    /etc/corosync/corosync.conf
    interface {
            # The following values need to be set based on your environment
            ringnumber: 0
            bindnetaddr: 10.0.2.0
            mcastaddr: 226.94.1.1
            mcastport: 5405
       }
    
    We also want corosync to start pacemaker automatically. If we do not do this, we will have to start pacemaker manually.
    ver: 0 Indicates corosync to start pacemaker automatically. Setting it to 1, will require manually start of pacemaker!
    /etc/corosync/corosync.conf
    service {
     	# Load the Pacemaker Cluster Resource Manager
     	ver:       0
     	name:      pacemaker
    }
    
    Now we run a syntax, configuration check on the file.
    corosync -f
    
    Copy/paste the content of corosync.conf, or scp the file to the second node.
    scp /etc/corosync/corosync.conf 10.0.2.100:/etc/corosync/corosync.conf
    
    Now we are ready to start corosync on the first node. On Debian Wheezy, the init script apparently have a "bug" which makes corosync fail to start unless it's setup to start at boot.
    Make corosync starts at boot time.
    /etc/default/corosync
    # start corosync at boot [yes|no]
    START=yes
    
    Or disable the onboot, check in the script... Not really preffered, but it works.
    /etc/init.d/corosync
    #if [ "$START" != "yes" ]; then
    #        exit 0
    #fi
    
    Start corosync
    /etc/init.d/corosync start
    
    Check status using the crm_mon command. The -1 parameter tells it to only run once, and not "forever".
    crm_mon -1
    
    ============
    Last updated: Fri Dec  6 17:07:03 2013
    Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
    Stack: openais
    Current DC: testserver01 - partition WITHOUT quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    0 Resources configured.
    ============
    
    Online: [ testserver01 ]
    
    Copy the config file to the second node.
    scp /etc/corosync/corosync.conf testserver01:/etc/corosync/
    
    Now on the second node, try to start corosync
    /etc/init.d/corosync start
    
    Check the status again. We should now hopefully see the second node joining. If not, check the firewalls!
    crm_mon -1
    
    ============
    Last updated: Fri Dec  6 17:07:56 2013
    Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
    Stack: openais
    Current DC: testserver01 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    0 Resources configured.
    ============
    
    Online: [ testserver01 testserver02 ]
    
    We can check for syntax and common errors using crm_verify -L

    Syntax Check, STONITH and QUORUM.

    crm_verify -L
    
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
    crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
    Errors found during check: config not valid
      -V may provide more details
    
    And we have an ERROR! Since this is a 2 node cluster only, we wants to disable STONITH. (Shoot The Other Node In The Head).
    crm configure property stonith-enabled=false
    
    If you in a 2 node cluster stops one of the two nodes, the node which is up fails, because the voting system fails.
    So disable QUORUM
    sudo crm configure property no-quorum-policy=ignore
    
    Try running crm_verify -L again, and it shows now errors.
    crm_verify -L
    

    Adding Resources: VIP

    Adding Floating IP / VIP
    crm configure primitive VIP ocf:IPaddr2 params ip=10.0.2.200 nic=eth0 op monitor interval=10s
    
    Now we should have added an VIP/Floating IP, we can test this by a simple ping. Should respond from both nodes.
    ping 10.0.2.200 -c 3
    PING 10.0.2.200 (10.0.2.200) 56(84) bytes of data.
    64 bytes from 10.0.2.200: icmp_req=1 ttl=64 time=0.012 ms
    64 bytes from 10.0.2.200: icmp_req=2 ttl=64 time=0.011 ms
    64 bytes from 10.0.2.200: icmp_req=3 ttl=64 time=0.011 ms
    
    --- 10.0.2.200 ping statistics ---
    3 packets transmitted, 3 received, 0% packet loss, time 1999ms
    rtt min/avg/max/mdev = 0.011/0.011/0.012/0.002 ms
    

    Adding Resources: Services

    The services MUST response with the correct reponse codes, or else this wil not work as intended!
    http://www.linux-ha.org/wiki/LSB_Resource_Agents
    We will be using LSB resources, in these examples.
    OCF resources are located in the folder /usr/lib/ocf/resource.d/heartbeat/

    Adding the VIP.
    crm configure primitive HA-postfix lsb:postfix op monitor interval=15s
    
    Now both nodes, should be able to start/stop postfix. Lets test the setup
    crm_mon -1
    
    ============
    Last updated: Fri Dec  6 17:33:47 2013
    Last change: Fri Dec  6 17:33:25 2013 via cibadmin on testserver01
    Stack: openais
    Current DC: testserver02 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    ============
    
    Online: [ testserver01 testserver02 ]
    
     VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
     HA-postfix	(lsb:postfix):	Started testserver02
    
    As we can see the VIP and postfix is running on a different server, which defiantly WILL break our cluster!
    We can solve this in different ways. Either by creating an COLOCATION, or by adding the Ressources to the same group.
    I will create a group, because it makes it easier to mananage, because I can migrate the full group at once, instead of the single resources.
    crm configure group HA-Group VIP HA-postfix
    
    If we check the status again, we can see that the two resources are now running on the same server.
    crm_mon -1
    
    ============
    Last updated: Fri Dec  6 17:40:55 2013
    Last change: Fri Dec  6 17:40:06 2013 via cibadmin on testserver02
    Stack: openais
    Current DC: testserver02 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    ============
    
    Online: [ testserver01 testserver02 ]
    
     Resource Group: HA-Group
         VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
         HA-postfix	(lsb:postfix):	Started testserver01
    

    Automatic Migration

    If an resource fails, for some reason, like postfix crashes, and cannot start again, we want to migrate to another server.
    Per default the migration-threshold is not defined/set to infinity, which will never migrate it.

    When we have 3 fails, migrate the node, and expire the failed resource after 60 seconds. This will allow it to automatically to move it back to this node.
    We can add migration-threshold=3 and failure-timeout=60s to the configuration, using "crm configure edit".
    crm configure edit
    
    primitive HA-postfix lsb:postfix \
            op monitor interval="15s" \
            meta target-role="Started" migration-threshold="3" failure-timeout=60s 
    

    Commands

    Cleaning up the errors

    We can clean up the old "log" entries, for errors like below.
    HA-apache_monitor_15000 (node=testserver01, call=173, rc=7, status=complete): not running
    HA-apache_monitor_15000 (node=testserver02, call=9, rc=7, status=complete): not running
    
    crm resource cleanup HA-apache
    
    Cleaning up HA-apache on testserver01
    Cleaning up HA-apache on testserver02
    Waiting for 3 replies from the CRMd... OK
    

    Modifying Resources Manually

    Resources can be easily edited using the "edit" tool.
    crm configure edit
    

    Deleting a resource

    An resource must be stopped before we can delete it. If it is a member of a group, the group must be stopped and deleted first.
    List the resources.
    crm_resource --list
     VIP	(ocf::heartbeat:IPaddr2) Started 
     HA-apache	(lsb:apache2) Started 
    

    Stopping and deleting an resource.

    crm resource stop HA-apache
    crm configure delete HA-apache
    

    Migrate / Move Resource

    Migrate a resource to a specific node.
    crm_resource --resource HA-Group --move --node testserver01
    

    View configuration

    crm configure show
    
    node testserver01
    node testserver02
    primitive VIP ocf:heartbeat:IPaddr2 \
    	params ip="10.0.2.200" nic="eth0" \
    	op monitor interval="10s"
    primitive postfix lsb:postfix \
    	op monitor interval="15s" \
    	meta target-role="Started" migration-threshold="1" failure-timeout="60s"
    group HA-Group VIP postfix
    location cli-prefer-HA-Group HA-Group \
    	rule $id="cli-prefer-rule-HA-Group" inf: #uname eq testserver01
    property $id="cib-bootstrap-options" \
    	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
    	cluster-infrastructure="openais" \
    	expected-quorum-votes="2" \
    	stonith-enabled="false" \
    	last-lrm-refresh="1386613142" \
    	no-quorum-policy="ignore"
    

    View status and fail counts

    crm_mon -1 --fail
    
    ============
    Last updated: Sat Dec  7 12:39:58 2013
    Last change: Fri Dec  6 19:51:41 2013 via cibadmin on testserver01
    Stack: openais
    Current DC: testserver01 - partition with quorum
    Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
    2 Nodes configured, 2 expected votes
    2 Resources configured.
    ============
    
    Online: [ testserver01 testserver02 ]
    
     Resource Group: HA-Group
         VIP	(ocf::heartbeat:IPaddr2):	Started testserver02
         postfix	(lsb:postfix):	Started testserver02
    
    Migration summary:
    * Node testserver02: 
    * Node testserver01: 
       postfix: migration-threshold=1 fail-count=1 last-failure='Sat Dec  7 12:39:47 2013'
    
    Failed actions:
        postfix_monitor_15000 (node=testserver01, call=8, rc=7, status=complete): not running
    
    Ref: http://clusterlabs.org/wiki/Example_configurations
    Copyright LinuxLasse.net 2009 - 2017 All Rights Reserved.

    Valid HTML 4.01 Strict Valid CSS!