home
linux / bsd
about
cheat sheet
links
Location: linux / howtos / Pacemaker_and_Corosync_HA_-_2_Node_setup

Distroname and release: Debian Wheezy

Pacemaker and Corosync HA - 2 Node setup

In this setup we will setup a HA faiilover solution using Corosync and Pacemake, in a Active/Passive setup.

HAPROXY:

HA Loadbalancer and proxy for TCP and HTTP based applications! Provides the floating which we should use as "pointer".
We will not use this here.

Pacemaker:

Resource Manager. Can for example start/stop corosync, apache etc, using Resource Agents.
Resources Agents. Can monitor resources, for example apache, and take action if one of these resources are not running, like moving the floating IP.

There is a list of resource agents on http://www.linux-ha.org/wiki/Resource_Agents
There are two type of resources. LSB (Linux Standard Base) and OCF (Open Cluster Framework).

LSB - resources provided by Linux Distributions,(/etc/init.d/[service] script)
OCF - Virtual IP-Addresses, monitor health status, start/stop resources

Corosync:

Communication layer. Provides reliable communication, manages cluster membership.

Installation and Setup

Prerequisites

Hosts or DNS resolvers
NTP Must be installed and configured on all nodes

Simple example of host entries.

/etc/hosts
10.0.2.1	ldap1 testserver01
10.0.2.100	ldap2 testserver02

Installation

We will install pacemaker, it should install corosync as an dependency, if not install it.

apt-get install pacemaker

Configuration

Edit /etc/corosync/corosync.conf. The bind address is the network address, NOT the IP.
The mcastaddr is Debians default, which is OK.

/etc/corosync/corosync.conf
interface {
        # The following values need to be set based on your environment
        ringnumber: 0
        bindnetaddr: 10.0.2.0
        mcastaddr: 226.94.1.1
        mcastport: 5405
   }

We also want corosync to start pacemaker automatically. If we do not do this, we will have to start pacemaker manually.
ver: 0 Indicates corosync to start pacemaker automatically. Setting it to 1, will require manually start of pacemaker!

/etc/corosync/corosync.conf
service {
 	# Load the Pacemaker Cluster Resource Manager
 	ver:       0
 	name:      pacemaker
}

Now we run a syntax, configuration check on the file.

corosync -f

Copy/paste the content of corosync.conf, or scp the file to the second node.

scp /etc/corosync/corosync.conf 10.0.2.100:/etc/corosync/corosync.conf

Now we are ready to start corosync on the first node. On Debian Wheezy, the init script apparently have a "bug" which makes corosync fail to start unless it's setup to start at boot.
Make corosync starts at boot time.

/etc/default/corosync
# start corosync at boot [yes|no]
START=yes

Or disable the onboot, check in the script... Not really preffered, but it works.

/etc/init.d/corosync
#if [ "$START" != "yes" ]; then
#        exit 0
#fi

Start corosync

/etc/init.d/corosync start

Check status using the crm_mon command. The -1 parameter tells it to only run once, and not "forever".

crm_mon -1

============
Last updated: Fri Dec  6 17:07:03 2013
Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
Stack: openais
Current DC: testserver01 - partition WITHOUT quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ testserver01 ]

Copy the config file to the second node.

scp /etc/corosync/corosync.conf testserver01:/etc/corosync/

Now on the second node, try to start corosync

/etc/init.d/corosync start

Check the status again. We should now hopefully see the second node joining. If not, check the firewalls!

crm_mon -1

============
Last updated: Fri Dec  6 17:07:56 2013
Last change: Fri Dec  6 17:05:08 2013 via crmd on testserver01
Stack: openais
Current DC: testserver01 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
0 Resources configured.
============

Online: [ testserver01 testserver02 ]

We can check for syntax and common errors using crm_verify -L

Syntax Check, STONITH and QUORUM.

crm_verify -L

crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined
crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option
crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity
Errors found during check: config not valid
  -V may provide more details

And we have an ERROR! Since this is a 2 node cluster only, we wants to disable STONITH. (Shoot The Other Node In The Head).

crm configure property stonith-enabled=false

If you in a 2 node cluster stops one of the two nodes, the node which is up fails, because the voting system fails.
So disable QUORUM

sudo crm configure property no-quorum-policy=ignore

Try running crm_verify -L again, and it shows now errors.

crm_verify -L

Adding Resources: VIP

Adding Floating IP / VIP

crm configure primitive VIP ocf:IPaddr2 params ip=10.0.2.200 nic=eth0 op monitor interval=10s

Now we should have added an VIP/Floating IP, we can test this by a simple ping. Should respond from both nodes.

ping 10.0.2.200 -c 3
PING 10.0.2.200 (10.0.2.200) 56(84) bytes of data.
64 bytes from 10.0.2.200: icmp_req=1 ttl=64 time=0.012 ms
64 bytes from 10.0.2.200: icmp_req=2 ttl=64 time=0.011 ms
64 bytes from 10.0.2.200: icmp_req=3 ttl=64 time=0.011 ms

--- 10.0.2.200 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 1999ms
rtt min/avg/max/mdev = 0.011/0.011/0.012/0.002 ms

Adding Resources: Services

The services MUST response with the correct reponse codes, or else this wil not work as intended!
http://www.linux-ha.org/wiki/LSB_Resource_Agents
We will be using LSB resources, in these examples.
OCF resources are located in the folder /usr/lib/ocf/resource.d/heartbeat/

Adding the VIP.

crm configure primitive HA-postfix lsb:postfix op monitor interval=15s

Now both nodes, should be able to start/stop postfix. Lets test the setup

crm_mon -1

============
Last updated: Fri Dec  6 17:33:47 2013
Last change: Fri Dec  6 17:33:25 2013 via cibadmin on testserver01
Stack: openais
Current DC: testserver02 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ testserver01 testserver02 ]

 VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
 HA-postfix	(lsb:postfix):	Started testserver02

As we can see the VIP and postfix is running on a different server, which defiantly WILL break our cluster!
We can solve this in different ways. Either by creating an COLOCATION, or by adding the Ressources to the same group.
I will create a group, because it makes it easier to mananage, because I can migrate the full group at once, instead of the single resources.

crm configure group HA-Group VIP HA-postfix

If we check the status again, we can see that the two resources are now running on the same server.

crm_mon -1

============
Last updated: Fri Dec  6 17:40:55 2013
Last change: Fri Dec  6 17:40:06 2013 via cibadmin on testserver02
Stack: openais
Current DC: testserver02 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ testserver01 testserver02 ]

 Resource Group: HA-Group
     VIP	(ocf::heartbeat:IPaddr2):	Started testserver01
     HA-postfix	(lsb:postfix):	Started testserver01

Automatic Migration

If an resource fails, for some reason, like postfix crashes, and cannot start again, we want to migrate to another server.
Per default the migration-threshold is not defined/set to infinity, which will never migrate it.

When we have 3 fails, migrate the node, and expire the failed resource after 60 seconds. This will allow it to automatically to move it back to this node.
We can add migration-threshold=3 and failure-timeout=60s to the configuration, using "crm configure edit".

crm configure edit

primitive HA-postfix lsb:postfix \
        op monitor interval="15s" \
        meta target-role="Started" migration-threshold="3" failure-timeout=60s

Commands

Cleaning up the errors

We can clean up the old "log" entries, for errors like below.

HA-apache_monitor_15000 (node=testserver01, call=173, rc=7, status=complete): not running
HA-apache_monitor_15000 (node=testserver02, call=9, rc=7, status=complete): not running

crm resource cleanup HA-apache

Cleaning up HA-apache on testserver01
Cleaning up HA-apache on testserver02
Waiting for 3 replies from the CRMd... OK

Modifying Resources Manually

Resources can be easily edited using the "edit" tool.

crm configure edit

Deleting a resource

An resource must be stopped before we can delete it. If it is a member of a group, the group must be stopped and deleted first.
List the resources.

crm_resource --list
 VIP	(ocf::heartbeat:IPaddr2) Started 
 HA-apache	(lsb:apache2) Started

Stopping and deleting an resource.

crm resource stop HA-apache
crm configure delete HA-apache

Migrate / Move Resource

Migrate a resource to a specific node.

crm_resource --resource HA-Group --move --node testserver01

View configuration

crm configure show

node testserver01
node testserver02
primitive VIP ocf:heartbeat:IPaddr2 \
	params ip="10.0.2.200" nic="eth0" \
	op monitor interval="10s"
primitive postfix lsb:postfix \
	op monitor interval="15s" \
	meta target-role="Started" migration-threshold="1" failure-timeout="60s"
group HA-Group VIP postfix
location cli-prefer-HA-Group HA-Group \
	rule $id="cli-prefer-rule-HA-Group" inf: #uname eq testserver01
property $id="cib-bootstrap-options" \
	dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \
	cluster-infrastructure="openais" \
	expected-quorum-votes="2" \
	stonith-enabled="false" \
	last-lrm-refresh="1386613142" \
	no-quorum-policy="ignore"

View status and fail counts

crm_mon -1 --fail

============
Last updated: Sat Dec  7 12:39:58 2013
Last change: Fri Dec  6 19:51:41 2013 via cibadmin on testserver01
Stack: openais
Current DC: testserver01 - partition with quorum
Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff
2 Nodes configured, 2 expected votes
2 Resources configured.
============

Online: [ testserver01 testserver02 ]

 Resource Group: HA-Group
     VIP	(ocf::heartbeat:IPaddr2):	Started testserver02
     postfix	(lsb:postfix):	Started testserver02

Migration summary:
* Node testserver02: 
* Node testserver01: 
   postfix: migration-threshold=1 fail-count=1 last-failure='Sat Dec  7 12:39:47 2013'

Failed actions:
    postfix_monitor_15000 (node=testserver01, call=8, rc=7, status=complete): not running

Ref: http://clusterlabs.org/wiki/Example_configurations