Distroname and release: Debian Wheezy
Pacemaker and Corosync HA - 2 Node setup
In this setup we will setup a HA faiilover solution using Corosync and Pacemake, in a Active/Passive setup.HAPROXY:
- HA Loadbalancer and proxy for TCP and HTTP based applications! Provides the floating which we should use as "pointer".
We will not use this here.
- Resource Manager. Can for example start/stop corosync, apache etc, using Resource Agents.
- Resources Agents. Can monitor resources, for example apache, and take action if one of these resources are not running, like moving the floating IP.
There are two type of resources. LSB (Linux Standard Base) and OCF (Open Cluster Framework).
- LSB - resources provided by Linux Distributions,(/etc/init.d/[service] script)
- OCF - Virtual IP-Addresses, monitor health status, start/stop resources
- Communication layer. Provides reliable communication, manages cluster membership.
Installation and Setup
Prerequisites
- Hosts or DNS resolvers
- NTP Must be installed and configured on all nodes
/etc/hosts
10.0.2.1 ldap1 testserver01
10.0.2.100 ldap2 testserver02
Installation
We will install pacemaker, it should install corosync as an dependency, if not install it.apt-get install pacemaker
Configuration
Edit /etc/corosync/corosync.conf. The bind address is the network address, NOT the IP.The mcastaddr is Debians default, which is OK.
/etc/corosync/corosync.conf
interface {
# The following values need to be set based on your environment
ringnumber: 0
bindnetaddr: 10.0.2.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
We also want corosync to start pacemaker automatically. If we do not do this, we will have to start pacemaker manually.ver: 0 Indicates corosync to start pacemaker automatically. Setting it to 1, will require manually start of pacemaker!
/etc/corosync/corosync.conf
service {
# Load the Pacemaker Cluster Resource Manager
ver: 0
name: pacemaker
}
Now we run a syntax, configuration check on the file.
corosync -fCopy/paste the content of corosync.conf, or scp the file to the second node.
scp /etc/corosync/corosync.conf 10.0.2.100:/etc/corosync/corosync.confNow we are ready to start corosync on the first node. On Debian Wheezy, the init script apparently have a "bug" which makes corosync fail to start unless it's setup to start at boot.
Make corosync starts at boot time.
/etc/default/corosync
# start corosync at boot [yes|no]
START=yes
Or disable the onboot, check in the script... Not really preffered, but it works.
/etc/init.d/corosync
#if [ "$START" != "yes" ]; then
# exit 0
#fi
Start corosync
/etc/init.d/corosync startCheck status using the crm_mon command. The -1 parameter tells it to only run once, and not "forever".
crm_mon -1
============ Last updated: Fri Dec 6 17:07:03 2013 Last change: Fri Dec 6 17:05:08 2013 via crmd on testserver01 Stack: openais Current DC: testserver01 - partition WITHOUT quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ testserver01 ]Copy the config file to the second node.
scp /etc/corosync/corosync.conf testserver01:/etc/corosync/Now on the second node, try to start corosync
/etc/init.d/corosync startCheck the status again. We should now hopefully see the second node joining. If not, check the firewalls!
crm_mon -1
============ Last updated: Fri Dec 6 17:07:56 2013 Last change: Fri Dec 6 17:05:08 2013 via crmd on testserver01 Stack: openais Current DC: testserver01 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 0 Resources configured. ============ Online: [ testserver01 testserver02 ]We can check for syntax and common errors using crm_verify -L
Syntax Check, STONITH and QUORUM.
crm_verify -L
crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Resource start-up disabled since no STONITH resources have been defined crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: Either configure some or disable STONITH with the stonith-enabled option crm_verify[6616]: 2013/12/06_17:09:29 ERROR: unpack_resources: NOTE: Clusters with shared data need STONITH to ensure data integrity Errors found during check: config not valid -V may provide more detailsAnd we have an ERROR! Since this is a 2 node cluster only, we wants to disable STONITH. (Shoot The Other Node In The Head).
crm configure property stonith-enabled=falseIf you in a 2 node cluster stops one of the two nodes, the node which is up fails, because the voting system fails.
So disable QUORUM
sudo crm configure property no-quorum-policy=ignoreTry running crm_verify -L again, and it shows now errors.
crm_verify -L
Adding Resources: VIP
Adding Floating IP / VIPcrm configure primitive VIP ocf:IPaddr2 params ip=10.0.2.200 nic=eth0 op monitor interval=10sNow we should have added an VIP/Floating IP, we can test this by a simple ping. Should respond from both nodes.
ping 10.0.2.200 -c 3 PING 10.0.2.200 (10.0.2.200) 56(84) bytes of data. 64 bytes from 10.0.2.200: icmp_req=1 ttl=64 time=0.012 ms 64 bytes from 10.0.2.200: icmp_req=2 ttl=64 time=0.011 ms 64 bytes from 10.0.2.200: icmp_req=3 ttl=64 time=0.011 ms --- 10.0.2.200 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 1999ms rtt min/avg/max/mdev = 0.011/0.011/0.012/0.002 ms
Adding Resources: Services
The services MUST response with the correct reponse codes, or else this wil not work as intended!http://www.linux-ha.org/wiki/LSB_Resource_Agents
We will be using LSB resources, in these examples.
OCF resources are located in the folder /usr/lib/ocf/resource.d/heartbeat/
Adding the VIP.
crm configure primitive HA-postfix lsb:postfix op monitor interval=15sNow both nodes, should be able to start/stop postfix. Lets test the setup
crm_mon -1
============ Last updated: Fri Dec 6 17:33:47 2013 Last change: Fri Dec 6 17:33:25 2013 via cibadmin on testserver01 Stack: openais Current DC: testserver02 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ testserver01 testserver02 ] VIP (ocf::heartbeat:IPaddr2): Started testserver01 HA-postfix (lsb:postfix): Started testserver02As we can see the VIP and postfix is running on a different server, which defiantly WILL break our cluster!
We can solve this in different ways. Either by creating an COLOCATION, or by adding the Ressources to the same group.
I will create a group, because it makes it easier to mananage, because I can migrate the full group at once, instead of the single resources.
crm configure group HA-Group VIP HA-postfixIf we check the status again, we can see that the two resources are now running on the same server.
crm_mon -1
============ Last updated: Fri Dec 6 17:40:55 2013 Last change: Fri Dec 6 17:40:06 2013 via cibadmin on testserver02 Stack: openais Current DC: testserver02 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ testserver01 testserver02 ] Resource Group: HA-Group VIP (ocf::heartbeat:IPaddr2): Started testserver01 HA-postfix (lsb:postfix): Started testserver01
Automatic Migration
If an resource fails, for some reason, like postfix crashes, and cannot start again, we want to migrate to another server.Per default the migration-threshold is not defined/set to infinity, which will never migrate it.
When we have 3 fails, migrate the node, and expire the failed resource after 60 seconds. This will allow it to automatically to move it back to this node.
We can add migration-threshold=3 and failure-timeout=60s to the configuration, using "crm configure edit".
crm configure edit
primitive HA-postfix lsb:postfix \ op monitor interval="15s" \ meta target-role="Started" migration-threshold="3" failure-timeout=60s
Commands
Cleaning up the errors
We can clean up the old "log" entries, for errors like below.HA-apache_monitor_15000 (node=testserver01, call=173, rc=7, status=complete): not running HA-apache_monitor_15000 (node=testserver02, call=9, rc=7, status=complete): not running
crm resource cleanup HA-apache
Cleaning up HA-apache on testserver01 Cleaning up HA-apache on testserver02 Waiting for 3 replies from the CRMd... OK
Modifying Resources Manually
Resources can be easily edited using the "edit" tool.crm configure edit
Deleting a resource
An resource must be stopped before we can delete it. If it is a member of a group, the group must be stopped and deleted first.List the resources.
crm_resource --list VIP (ocf::heartbeat:IPaddr2) Started HA-apache (lsb:apache2) Started
Stopping and deleting an resource.
crm resource stop HA-apache crm configure delete HA-apache
Migrate / Move Resource
Migrate a resource to a specific node.crm_resource --resource HA-Group --move --node testserver01
View configuration
crm configure show
node testserver01 node testserver02 primitive VIP ocf:heartbeat:IPaddr2 \ params ip="10.0.2.200" nic="eth0" \ op monitor interval="10s" primitive postfix lsb:postfix \ op monitor interval="15s" \ meta target-role="Started" migration-threshold="1" failure-timeout="60s" group HA-Group VIP postfix location cli-prefer-HA-Group HA-Group \ rule $id="cli-prefer-rule-HA-Group" inf: #uname eq testserver01 property $id="cib-bootstrap-options" \ dc-version="1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff" \ cluster-infrastructure="openais" \ expected-quorum-votes="2" \ stonith-enabled="false" \ last-lrm-refresh="1386613142" \ no-quorum-policy="ignore"
View status and fail counts
crm_mon -1 --fail
============ Last updated: Sat Dec 7 12:39:58 2013 Last change: Fri Dec 6 19:51:41 2013 via cibadmin on testserver01 Stack: openais Current DC: testserver01 - partition with quorum Version: 1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff 2 Nodes configured, 2 expected votes 2 Resources configured. ============ Online: [ testserver01 testserver02 ] Resource Group: HA-Group VIP (ocf::heartbeat:IPaddr2): Started testserver02 postfix (lsb:postfix): Started testserver02 Migration summary: * Node testserver02: * Node testserver01: postfix: migration-threshold=1 fail-count=1 last-failure='Sat Dec 7 12:39:47 2013' Failed actions: postfix_monitor_15000 (node=testserver01, call=8, rc=7, status=complete): not runningRef: http://clusterlabs.org/wiki/Example_configurations