Difference between revisions of "Keepalived"

Line 3: Line 3:
 
== Introduction ==
 
== Introduction ==
  
This page contains a basic description about how to set up keepalived.
+
This page contains a basic description about how to set up a LVS (Linux Virtual Server) / ipvsadm / keepalived based loadbalancer for MySQL (Galera) loadbalancing.
  
The example on this page is specific to debian. However, it should be possible to adjust it to other distributions.
+
While the setup is more involved that simple user-space daemons and suffers from more constraints / requirements, the resulting solution is the cleanest with regards to high level design, most robust and best performing MySQL loadbalancing solution we are aware of.
  
Keepalived mode is Direct Routing.
+
The instructions on this page have been worked out and tested on Debian (latest verified version: 8.9). It should be possible to transfer this information to other distributions / versions.
  
It is required that the keepalived node(s) and the nodes for which it is loadbalancing on the same network segment. There must be no firewall between those hosts. Be careful: some virtualization systems do loadbalancing by themselves.
+
LVS is a linux kernel module and has been included the mainline kernel since roughly 2.4.something in 2003. [http://www.linuxvirtualserver.org] Most documentation available seems very outdated, however this code is part of the standard upstream linux kernel and as such perfectly supported. It is, however, tricky to find recent reference documentation or howtos.  
  
For more information please see:
+
The project homepage http://www.linuxvirtualserver.org/ has some applicable information, in particular on their wiki. There also exists a HOWTO http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/index.html which has proven useful while writing this article. But above all, consult the manpages, they are up to date and precise.
[http://www.keepalived.org/documentation.html www.keepalived.org] or the <code>man keepalived.conf</code> (which is much more helpful than the web pages).
 
  
 
Some terminology:
 
Some terminology:
Line 19: Line 18:
 
* The nodes keepalived is loadbalancing for, for example OX nodes or Galera nodes, are called ''server nodes''.
 
* The nodes keepalived is loadbalancing for, for example OX nodes or Galera nodes, are called ''server nodes''.
  
== Software installation on the keepalived node ==
+
=== High Level Design ===
  
Packages are installed using
+
The solution consists of several modules.
 +
 
 +
Main component is some kernel modules which implements the real loadbalancing / forwarnding functionality. (ip_vs, ip_vs_rr, probably more.)
 +
 
 +
There is a command line tool to manage the loadbalancing konfiguration of the kernel called ipvsadm.
 +
 
 +
It is possible to run an ipvsadm daemon which allows synchronization of connection states to a standby / slave ipvsadm/LVS instance, so that on failover "most" connections can keep intact. This is out of scope of this document. It is mentioned here to be aware of it and to not confuse it with the keepalived daemon (see below).
 +
 
 +
LVS with ipvsadm can run standalone. This is helpful in setup and testing. However for production it lacks the functionality to health-check the loadbalancing targets (i.e. database servers) and adjust the loadbalancer tables accordingly. To do this, a separate user-space instance / daemon is required, and this is the functionality provided by keepalived.
 +
 
 +
=== Routing methods ===
 +
 
 +
LVS provides several modes of routing. We will describe here Direct Routing (DR) and Tunneling (TUN). There are more routing methods available which might come interesting in special cases, but not covered in this document.
 +
 
 +
When unsure, follow the TUN path. It seems more robust in certain environments than the DR path.
 +
 
 +
==== Direct Routing ====
 +
 
 +
Direct Routing works by replacing the target MAC in a package addressed to the loadbalancer to its virtual / loadbalancer IP with the MAC of the designated target server and re-sending it.
 +
 
 +
This requires the servers to accept packages for the given IP, so they need to configure the corresponding IP on some local looopback / dummy device. It must be ensured the servers do not answer ARP requests for the given IP. Otherwise there is a race condition on which server / loadbalancer ARP response will be first received by a client, leading to unwanted results. This is called "the ARP problem" in the documentation and there are given many possible solution; however with current kernels the method explained below works reliably.
 +
 
 +
Response packages are sent directly from the server to the client, thus they don't go through the loadbalancer, but appear to come from a source where the source IP does not match the MAC address.
 +
 
 +
In addition to the requirement to be able to configure addtional "secondary" IPs on the involved machines, this method also requires that no involved networking component (routers, virtualization hypervisors, etc) discard packages which seem "forged" (like, IPs do not match MACs, etc). This is typically not a problem in "classical" networking infrastructures, but getting more and more problematic in modern virtualized / cloud infrastructures.
 +
 
 +
==== Tunneling ====
 +
 
 +
The tunneling method works by the loadbalancer encapsulating the package in an IPIP tunneling package and sending it to the corresponding server.
 +
 
 +
It also requires that the servers have configured the virtual / loadbalancer IP locally, but here on a "tunl" device. We have to cover the same "ARP Problem" as explained in the Direct Routing section above, with the same solution. We also have the situation that answers are going directly from the servers to the clients, not passing through the loadbalancer.
 +
 
 +
The Tunneling method generally works better in modern virtualized / cloud environments.
 +
 
 +
==== NAT method ====
 +
 
 +
We have not worked out / tested some NAT based setup yet, but it sounds promising to get it working in even more restrictive cloud environments, where routers typically reject packages with mismatching IPs/MACs. Feedback welcome.
 +
 
 +
== Software installation on the loadbalancer node ==
 +
 
 +
Packages are installed from standard repos using
  
 
  # apt-get install keepalived  
 
  # apt-get install keepalived  
  
Keepalived requires some kernel modules to be loaded. They are loaded by the ''ipvsadmm'' service. So we enable it using ''dpkg-reconfigure'':
+
This will install the required dependencies like ipvsadm etc.
 +
 
 +
Contrary to earlier Debian distros, currently there is no requirement to configure any special service (yet) for loading kernel modules and such. In older Debian versions (like Squeeze) some /etc/default/{ipvsadm,keepalived} files needed some tweaking to leverage kernel module loading (which seemed to fail automatically). This is currently no longer true; if working on an old (historical!) Debian version, you may have to investigate there.
 +
 
 +
Also not required, but claimed somewhere is to configure IPv4 forwarding. If experimenting with other routing methods, this may become required; it has been verified it is not required with DR or TUN.
 +
 
 +
== Configuration ==
 +
 
 +
The configuration examples given below assume a setup like
 +
 
 +
10.0.0.1 database server / galera node 1
 +
10.0.0.2 database server / galera node 2
 +
10.0.0.3 database server / galera node 3
 +
10.0.0.4 loadbalancer primary IP
 +
10.0.0.5 database client, e.g. OX middleware node
 +
10.0.0.10 loadbalancer virtual IP for writing (persistent routing / dedicated write node)
 +
10.0.0.11 loadbalancer virtual IP for reading (round-robin)
 +
 
 +
Note: with DR and TUN, it is not possible to change the port numbers on routing; thus, for each loadbalancer endpoint, the loadbalancer needs an additional virtual IP. (It is not possible to configure them on different ports on the same (e.g. primary) IP of the loadbalancer.)
 +
 
 +
=== Manual configuration / testing ===
 +
 
 +
==== Networking adjustments on the server nodes ====
 +
 
 +
The server nodes need the loadbalancer virtual IP(s) configured on some network device in order for the server processes to be able to bind on this device.
 +
 
 +
For DR, it seems natural to configure a dummy device. For TUN, you need a tunl device.
 +
 
 +
For testing, you can do it manually on the given nodes:
 +
 
 +
# for TUN
 +
ip link set up tunl0
 +
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
 +
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
 +
 
 +
# for DR
 +
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
 +
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
 +
 
 +
Then, you solve the "ARP Problem" by
  
  # dpkg-reconfigure ipvsadm
+
  echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
 +
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
  
Answer the questions with "Yes" ("load ... at boot") and then "backup" for "Daemon method".
+
==== Loadbalancer ====
  
Enable IP forwarding on the keepalived node: configure in /etc/sysctl.conf:
+
The loadbalancer also needs the virtual IPs configured as secondary IPs:
  
  net.ipv4.ip_forward=1
+
  ip addr add 10.0.0.10/32 dev eth0
 +
ip addr add 10.0.0.11/32 dev eth0
  
Enable this by either rebooting or by issuing <code>sysctl -w net.ipv4.ip_forward=1</code>.
+
Then the loadbalancer endpoints themselves can be configured with ipvsadm:
  
== Networking adjustments on the server nodes ==
+
# For TUN
 +
# Round-Robin / read instance
 +
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -i -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -i -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -i -w 10
 +
# Persistent / write instance
 +
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -i -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -i -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -i -w 10
  
The server nodes need the loadbalancer IP configured on some network device in order for the server processes to be able to bind on this device.
+
# For DR
 +
# Round-Robin / read instance
 +
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -g -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -g -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -g -w 10
 +
# Persistent / write instance
 +
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -g -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -g -w 10
 +
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -g -w 10
  
However, in the case of Galera, creating a fully configured "alias" device is bad, since the Galera nodes will pick the loadbalancer IP as primary IP of the node for example for full state transfers (SST). So when trying a SST the Galera nodes will try to connect to the loadbalancer on the SST port. This will fail because on the loadbalancer nothing listens on the SST port.
+
Note: you need to restart the MySQL service after the networking adjustments; otherwise, the MySQL daemon will not accept packages with the virtual IP as target IP. This has caused a lot of wasted time to quite some people.
  
If we instead create a dummy device and only assign an IP to it (without setting all those flags like UP), then galera can bind to the IP, but it won't use the IP as its primary IP. A configuration like this can be created using the following trick. Ad dsome pre-up, post-up, pre-down, post-down lines to the /etc/network/interfaces file as follows:
+
Note: if lazy, you can test with one server node, and extend the configuration later to all three nodes.
  
  allow-hotplug eth0
+
Note: to view the current LVS configuration, use
  iface eth0 inet dhcp
+
 
 +
# ipvsadm -L
 +
IP Virtual Server version 1.2.1 (size=4096)
 +
Prot LocalAddress:Port Scheduler Flags
 +
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
 +
TCP  10.0.0.10:mysql rr
 +
  -> 10.0.0.1:mysql              Tunnel  10    0          0
 +
  -> 10.0.0.2:mysql              Tunnel  10    0          0
 +
  -> 10.0.0.3:mysql              Tunnel  10    0          0
 +
TCP  10.0.0.11:mysql rr persistent 86400
 +
  -> 10.0.0.1:mysql              Tunnel  10    0          0
 +
  -> 10.0.0.2:mysql              Tunnel  10    0          0
 +
  -> 10.0.0.3:mysql              Tunnel  10    0          0
 +
 
 +
Note: to stop / start over, use ipvsadm -C.
 +
 
 +
Note: you can use ipvsadm -S / ipvsadm -R for easier iterative testing (see manpage).
 +
 
 +
# ipvsadm -S
 +
-A -t 10.0.0.10:mysql -s rr
 +
  -a -t 10.0.0.10:mysql -r 10.0.0.1:mysql -i -w 10
 +
-a -t 10.0.0.10:mysql -r 10.0.0.2:mysql -i -w 10
 +
-a -t 10.0.0.10:mysql -r 10.0.0.3:mysql -i -w 10
 +
-A -t 10.0.0.11:mysql -s rr -p 86400
 +
-a -t 10.0.0.11:mysql -r 10.0.0.1:mysql -i -w 10
 +
-a -t 10.0.0.11:mysql -r 10.0.0.2:mysql -i -w 10
 +
-a -t 10.0.0.11:mysql -r 10.0.0.3:mysql -i -w 10
 +
# ipvsadm -S > ipvsadm.conf
 +
# ipvsadm -R < ipvsadm.conf
 +
 
 +
==== Testing ====
 +
 
 +
You should be able to verify functionality then from the client / OX middleware node with something like (omitting authentication command line arguments for brevity)
 +
 
 +
# while true; do mysql -h10.0.0.10 -B -N -e "select @@hostname;"; sleep 1; done
 +
db3
 +
db2
 +
db1
 +
db3
 +
db2
 +
db1
 +
[...]
 +
^C
 +
# while true; do mysql -h10.0.0.11 -B -N -e "select @@hostname;"; sleep 1; done
 +
db3
 +
db3
 +
db3
 +
db3
 +
db3
 +
db3
 +
[...]
 +
^C
 +
 
 +
If it works not:
 +
 
 +
* Remember you need to restart the MySQL server after networking adjustments
 +
* Try to use tcpdump to find out on which node (loadbalancer or server) your packages actually arrive
 +
* Use arp -a to verify the server nodes did not advertise the virtual IP addresses with their MAC
 +
* Verify the usual candidates like iptables (off by default Debian; may vary in your installation), selinux/apparmor (if using SLES or RHEL), additional firewalls are not spoiling your testing
 +
 
 +
Please verify the manual setup before proceeding to the persistent / production configuration.
 +
 
 +
=== Persistent / production configuration ===
 +
 
 +
==== Networking adjustments on the server nodes ====
 +
 
 +
It is possible to attach the configuration to /etc/network/interfaces:
 +
 
 +
# TUN example
 +
# existing eth0 configuration
 +
auto eth0
 +
  iface eth0 inet static
 +
    address 10.0.0.XYZ
 +
    netmask 255.255.255.0
 +
# add the following
 
     pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
     pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
     pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
 
     pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
     post-up ip addr add 10.20.29.174/32 dev dummy0
+
    post-up ip link set up tunl0
     pre-down ip addr del 10.20.29.174/32 dev dummy0
+
     post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
 +
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
 +
    pre-down ip addr del 10.0.0.11/32 dev tunl0
 +
     pre-down ip addr del 10.0.0.10/32 dev tunl0
 +
    pre-down ip link set down tunl0
 
     post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
     post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
     post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
 
     post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
  
Here, 10.20.29.174 is the loadbalancer IP. Adjust to your environment.
+
# DR example
 +
# existing eth0 configuration
 +
auto eth0
 +
iface eth0 inet static
 +
    address 10.0.0.XYZ
 +
    netmask 255.255.255.0
 +
# add the following
 +
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
 +
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
 +
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
 +
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
 +
    pre-down ip addr del 10.0.0.11/32 dev dummy0
 +
    pre-down ip addr del 10.0.0.10/32 dev dummy0
 +
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
 +
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
  
== Configuration example: HTTP ==
+
==== Keepalived configuration (health checks skipped) ====
  
=== Keepalived configuration file ===
+
Note: keepalived will manage the secondary IPs, so no need to hard-wire them in /etc/network/interfaces or alike. Rather, deconfigure any potentially manually configured seconday IPs from previous manual testing.
  
Create a file <pre>/etc/keepalived/keepalived.conf</pre> with following contend (adapt network adresses)
+
Create a config file /etc/keepalived/keepalived.conf for basic functionality testing like
  
 
  global_defs {
 
  global_defs {
    router_id OX
+
  # This should be unique.
 +
  router_id galera-lb
 
  }
 
  }
 
   
 
   
  vrrp_sync_group OX_GROUP {
+
  vrrp_instance mysql_pool {
    group {
+
  # The interface we listen on.
        OX_GOUP
+
  interface eth0
    }
+
  }
+
  # The default state, one should be master, the others should be set to SLAVE.
 +
  state MASTER
 +
  priority 101
 +
   
 +
  # This should be the same on all participating load balancers.
 +
  virtual_router_id 19
 
   
 
   
vrrp_instance OX_VRRP {
+
  # Set the interface whose status to track to trigger a failover.
    state BACKUP
+
  track_interface {
    interface eth0
+
     eth0
    garp_master_delay 10
+
  }
    virtual_router_id 10
 
    priority 101
 
    nopreempt
 
    advert_int 1
 
    authentication {
 
        auth_type AH  # Simple 'PASS' can use
 
        auth_pass 1234 # example password '1234'
 
    }
 
     virtual_ipaddress {
 
        10.20.30.77/24 brd 10.20.30.255 dev eth0 # virtual service ip 10.20.30.67
 
    }
 
    virtual_ipaddress_excluded {
 
    }
 
}
 
 
   
 
   
virtual_server_group OX_HTTP {
+
  # Password for the loadbalancers to share.
        10.20.30.77 80        # virtual ip and port 80
+
  authentication {
}
+
    auth_type PASS
 +
    auth_pass Twagipmiv3
 +
  }
 
   
 
   
virtual_server_group OX_OL_PUSH {
+
  # This is the IP address that floats between the loadbalancers.
        10.20.30.77 44335      # VIP VPORT
+
  virtual_ipaddress {
 +
    10.0.0.10/32 dev eth0
 +
    10.0.0.11/32 dev eth0
 +
  }
 
  }
 
  }
 
   
 
   
  virtual_server group OX_HTTP {
+
# Here we add the virtual mysql read node
    delay_loop 3
+
  virtual_server 10.0.0.10 3306 {
    lvs_sched  rr
+
  delay_loop 6
    lvs_method DR
+
  # Round robin, but you can use whatever fits your needs.
    protocol  TCP
+
  lb_algo rr
    virtualhost 10.20.30.77
 
 
   
 
   
    real_server 10.20.30.123 80 {
+
  lb_kind TUN
        weight 1
+
  protocol TCP
        inhibit_on_failure
 
        HTTP_GET {
 
            url {
 
                path /servlet/TestServlet
 
                status_code 200
 
            }
 
            connect_port 80
 
            connect_timeout 10
 
        }
 
    }
 
 
   
 
   
    real_server 10.20.30.321 80 {
+
  # For each server add the following.
        weight 1
+
  real_server 10.0.0.1 3306 {
        inhibit_on_failure
+
    weight 10
        HTTP_GET {
+
  }
            url {
+
  real_server 10.0.0.2 3306 {
                path /servlet/TestServlet
+
    weight 10
                status_code 200
+
  }
            }
+
  real_server 10.0.0.3 3306 {
            connect_port 80
+
    weight 10
            connect_timeout 10
+
  }
        }
 
    }  
 
 
  }
 
  }
 
   
 
   
  virtual_server group OX_OL_PUSH {
+
# Here we add the virtual mysql write node
    delay_loop 3
+
  virtual_server 10.0.0.11 3306 {
    lvs_sched rr
+
  delay_loop 6
    lvs_method DR
+
  # Round robin, but you can use whatever fits your needs.
    protocol   UDP
+
  lb_algo rr
 +
   
 +
  lb_kind TUN
 +
  protocol TCP
 
   
 
   
    real_server 10.20.30.123 44335 {
+
  # the following two options implement that active-passive behavior
        weight 1
+
  persistence_timeout 86400
        inhibit_on_failure
+
  # make sure all OX nodes are included in that netmask
  TCP_CHECK {
+
  persistence_granularity 0.0.0.0
                  connect_port 9999
 
  connect_timeout 5
 
        }
 
    }
 
 
   
 
   
    real_server 10.20.30.321 44335 {
+
  # For each server add the following.
        weight 1
+
  real_server 10.0.0.1 3306 {
        inhibit_on_failure
+
    weight 10
        TCP_CHECK {
+
  }
                  connect_port 9999
+
  real_server 10.0.0.2 3306 {
  connect_timeout 5
+
    weight 10
        }
+
  }
     }
+
  real_server 10.0.0.3 3306 {
 +
     weight 10
 +
  }
 
  }
 
  }
  
For the client nodes: the server nodes networking adjustments from the previos section has not been tested with this configuration. It should be working. If not, take a look at the [http://oxpedia.org/wiki/index.php?title=Keepalived&oldid=6932#Real_Servers_setup old version] of the networking configuration.
+
The file should be self-explaining if you followed the manual configuration explanations above. The only unexpected things are directives like state, priority which will be explained below for multi-keepalived-setup.
 
 
== Configuration example: Keepalived for Galera Loadbalancing ==
 
  
=== Keepalived configuration ===
+
The example has been using TUN; for DR, just replace TUN by DR in the virtual_server definitions.
  
In this example we have the following networking information:
+
After a <code>service keepalived restart</code> you should be able to execute the same client connectivity tests as shown above. (Remember to cleanly unconfigure your manual setup before in order to not measure false success.)
  
* loadbalancer IP
+
==== Keepalived configuration (with health checks) ====
** <code>10.20.29.174</code> as round-robin for the read requests
 
** <code>10.20.29.175</code> as active-passive for the write requests
 
* Three galera nodes: <code>10.20.29.140</code>, <code>10.20.29.142</code>, <code>10.20.29.138</code>
 
  
Then the keepalived configuration file <code>/etc/keepalived/keepalived.conf</code> looks as follows:
+
We can configure health checks in keepalived.conf:
  
 
  global_defs {
 
  global_defs {
Line 183: Line 355:
 
   
 
   
 
   # The default state, one should be master, the others should be set to SLAVE.
 
   # The default state, one should be master, the others should be set to SLAVE.
   state MASTER
+
   state MASTE1
 
   priority 101
 
   priority 101
 
   
 
   
Line 189: Line 361:
 
   virtual_router_id 19
 
   virtual_router_id 19
 
   
 
   
   # Set the interface whose status to track to trigger a failover.                  
+
   # Set the interface whose status to track to trigger a failover.
   track_interface {          
+
   track_interface {
 
     eth0
 
     eth0
 
   }
 
   }
Line 202: Line 374:
 
   # This is the IP address that floats between the loadbalancers.
 
   # This is the IP address that floats between the loadbalancers.
 
   virtual_ipaddress {
 
   virtual_ipaddress {
     10.20.29.174/32 dev eth0
+
     10.0.0.10/32 dev eth0
     10.20.29.175/32 dev eth0
+
     10.0.0.11/32 dev eth0
 
   }
 
   }
 
  }
 
  }
 
   
 
   
 
  # Here we add the virtual mysql read node
 
  # Here we add the virtual mysql read node
  virtual_server 10.20.29.174 3306 {
+
  virtual_server 10.0.0.10 3306 {
 
   delay_loop 6
 
   delay_loop 6
 
   # Round robin, but you can use whatever fits your needs.
 
   # Round robin, but you can use whatever fits your needs.
 
   lb_algo rr
 
   lb_algo rr
 
   
 
   
   lb_kind DR
+
   lb_kind TUN
 
   protocol TCP
 
   protocol TCP
 
   
 
   
   # For each server add the following.  
+
   # For each server add the following.
   real_server 10.20.29.140 3306 {
+
   real_server 10.0.0.1 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.140"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.1"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
 
   }
 
   }
   real_server 10.20.29.142 3306 {
+
   real_server 10.0.0.2 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.142"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.2"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
 
   }
 
   }
   real_server 10.20.29.138 3306 {
+
   real_server 10.0.0.3 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.138"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.3"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
 
   }
 
   }
 +
}
 
   
 
   
 
  # Here we add the virtual mysql write node
 
  # Here we add the virtual mysql write node
  virtual_server 10.20.29.175 3306 {
+
  virtual_server 10.0.0.11 3306 {
 
   delay_loop 6
 
   delay_loop 6
 
   # Round robin, but you can use whatever fits your needs.
 
   # Round robin, but you can use whatever fits your needs.
 
   lb_algo rr
 
   lb_algo rr
 +
 +
  lb_kind TUN
 +
  protocol TCP
 +
 
   # the following two options implement that active-passive behavior
 
   # the following two options implement that active-passive behavior
   persistence_timeout 1800
+
   persistence_timeout 86400
 
   # make sure all OX nodes are included in that netmask
 
   # make sure all OX nodes are included in that netmask
   persistence_granularity 255.255.255.0
+
   persistence_granularity 0.0.0.0
   
 
  lb_kind DR
 
  protocol TCP
 
 
   
 
   
   # For each server add the following.  
+
   # For each server add the following.
   real_server 10.20.29.140 3306 {
+
   real_server 10.0.0.1 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.140"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.1"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
 
   }
 
   }
   real_server 10.20.29.142 3306 {
+
   real_server 10.0.0.2 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.142"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.2"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
 
   }
 
   }
   real_server 10.20.29.138 3306 {
+
   real_server 10.0.0.3 3306 {
 
     weight 10
 
     weight 10
 
     MISC_CHECK {
 
     MISC_CHECK {
       misc_path "/etc/keepalived/galera-checker.pl 10.20.29.138"
+
       misc_path "/etc/keepalived/checker.pl 10.0.0.3"
 
       misc_timeout 5
 
       misc_timeout 5
 
     }
 
     }
Line 276: Line 450:
 
  }
 
  }
  
Here we have configured a galera-specific node health checker. This is a custom perl script which requires some perl module for DB access:
+
The config file is the same as before, but for the MISC_CHECK sections.
  
# apt-get install libdbd-mysql-perl
+
We now need such a check.pl script which checks the galera replication status and exits 0 if fine and exits 1 if not.
  
The script is expected in <code>/etc/keepalived/galera-checker.pl</code> and looks like this:
+
Such a sample script is given below. Consult your DBAs on the checks to be performed, i.e. when should a node be considered "available" and when not.
  
 
  #!/usr/bin/perl
 
  #!/usr/bin/perl
Line 292: Line 466:
 
  # config section
 
  # config section
 
  #
 
  #
  our $username="some_db_user";
+
  our $username="checker";
  our $password="some_db_pass";
+
  our $password="aicHupdakek3";
 
  our $debug=0;
 
  our $debug=0;
 
   
 
   
Line 305: Line 479:
 
  #
 
  #
 
   
 
   
  our $host=$ARGV[0] or die "usage: $0 <IP of galera node>";  
+
  our $host=$ARGV[0] or die "usage: $0 <IP of galera node>";
 
   
 
   
 
  use DBI;
 
  use DBI;
Line 338: Line 512:
 
  exit(0);
 
  exit(0);
  
There is also a /usr/bin/clustercheck script shipped with some Galera flavors. The scenario to use that (and described in our [[HAproxy]] page) is to wrap it via xinetd to implement a webservice which can be queried from keepalived using a HTTP_GET directive. Sounds more elegant -- but works not -- keepalived cannot handle the script's output and reports an error like "Keepalived_healthcheckers: Read error with server [10.20.29.210]:9200: Connection reset by peer".
+
The script requires as a dependency the installation of the Perl MySQL interface module:
  
=== Using tunneling (lb_kind TUN) instead of direct routing (lb_kind DR) ===
+
# apt-get install libdbd-mysql-perl
 
 
Assuming you have configured keepalived in DR mode as described above, the following changes are required for TUN mode instead of DR mode.
 
 
 
On the keepalived nodes, change in the keepalived config file ''lb_kind'' to ''TUN''. Restart keepalived (if running).
 
 
 
On the server nodes, the networking adjustments need to be adjusted. We now configure a ''tun0'' tunnel device instead of a ''dummy0'' device. Additionally, since now there is traffic over this interface (compared to the previous situation, where the ''dummy0'' device was only configured for MySQL to see its IP), we need to set the link to UP there.
 
  
So in the summary, the /etc/network/interfaces file on the server nodes needs to look like this:
+
Don't forget to configure a corresponding DB user for the checker script. Execute on a Galera node:
  
  allow-hotplug eth0
+
  CREATE USER 'checker'@'%' IDENTIFIED BY 'aicHupdakek3';
  iface eth0 inet dhcp
+
  FLUSH PRIVILEGES;
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
 
    post-up ip link set up tunl0
 
    post-up ip addr add 10.20.29.174/32 dev tunl0
 
    pre-down ip addr del 10.20.29.174/32 dev tunl0
 
    pre-down ip link set down tunl0
 
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
 
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
 
  
 
=== Adding a second Keepalived node for redundancy ===
 
=== Adding a second Keepalived node for redundancy ===
 
This is optional.
 
  
 
With a single keepalived node we have a single point of failure. It is possible to add a second keepalived node which is communicating with the first keepalived node and transition from a backup state to master state upon failure of the first node.
 
With a single keepalived node we have a single point of failure. It is possible to add a second keepalived node which is communicating with the first keepalived node and transition from a backup state to master state upon failure of the first node.
 
This is tested with Galera.
 
  
 
To set up a second keepalived node as described above, create a keepalived node identical to the first one, with the following changes to the configuration file <code>/etc/keepalived/keepalived.conf</code>:
 
To set up a second keepalived node as described above, create a keepalived node identical to the first one, with the following changes to the configuration file <code>/etc/keepalived/keepalived.conf</code>:
Line 378: Line 534:
  
 
Now the backup node will notice the master going down and take over. Automatic failback also happens.
 
Now the backup node will notice the master going down and take over. Automatic failback also happens.
 +
 +
Keepalived will automatically manage the secondary IPs, so no need for any additional clustering software like corosync/pacemaker etc.
  
 
== Keepalived monitoring ==
 
== Keepalived monitoring ==
  
 
  ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT
 
  ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT
  ipvsadm -Ln -t $LOADBALANCER_ip:$LOADBALANCER_PORT --stats
+
  ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --stats
 
  ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --rate
 
  ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --rate

Revision as of 10:31, 20 September 2017

Keepalived Loadbalancer

Introduction

This page contains a basic description about how to set up a LVS (Linux Virtual Server) / ipvsadm / keepalived based loadbalancer for MySQL (Galera) loadbalancing.

While the setup is more involved that simple user-space daemons and suffers from more constraints / requirements, the resulting solution is the cleanest with regards to high level design, most robust and best performing MySQL loadbalancing solution we are aware of.

The instructions on this page have been worked out and tested on Debian (latest verified version: 8.9). It should be possible to transfer this information to other distributions / versions.

LVS is a linux kernel module and has been included the mainline kernel since roughly 2.4.something in 2003. [1] Most documentation available seems very outdated, however this code is part of the standard upstream linux kernel and as such perfectly supported. It is, however, tricky to find recent reference documentation or howtos.

The project homepage http://www.linuxvirtualserver.org/ has some applicable information, in particular on their wiki. There also exists a HOWTO http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/index.html which has proven useful while writing this article. But above all, consult the manpages, they are up to date and precise.

Some terminology:

  • The keepalived node(s) are called keepalived node or loadbalancer node.
  • The nodes keepalived is loadbalancing for, for example OX nodes or Galera nodes, are called server nodes.

High Level Design

The solution consists of several modules.

Main component is some kernel modules which implements the real loadbalancing / forwarnding functionality. (ip_vs, ip_vs_rr, probably more.)

There is a command line tool to manage the loadbalancing konfiguration of the kernel called ipvsadm.

It is possible to run an ipvsadm daemon which allows synchronization of connection states to a standby / slave ipvsadm/LVS instance, so that on failover "most" connections can keep intact. This is out of scope of this document. It is mentioned here to be aware of it and to not confuse it with the keepalived daemon (see below).

LVS with ipvsadm can run standalone. This is helpful in setup and testing. However for production it lacks the functionality to health-check the loadbalancing targets (i.e. database servers) and adjust the loadbalancer tables accordingly. To do this, a separate user-space instance / daemon is required, and this is the functionality provided by keepalived.

Routing methods

LVS provides several modes of routing. We will describe here Direct Routing (DR) and Tunneling (TUN). There are more routing methods available which might come interesting in special cases, but not covered in this document.

When unsure, follow the TUN path. It seems more robust in certain environments than the DR path.

Direct Routing

Direct Routing works by replacing the target MAC in a package addressed to the loadbalancer to its virtual / loadbalancer IP with the MAC of the designated target server and re-sending it.

This requires the servers to accept packages for the given IP, so they need to configure the corresponding IP on some local looopback / dummy device. It must be ensured the servers do not answer ARP requests for the given IP. Otherwise there is a race condition on which server / loadbalancer ARP response will be first received by a client, leading to unwanted results. This is called "the ARP problem" in the documentation and there are given many possible solution; however with current kernels the method explained below works reliably.

Response packages are sent directly from the server to the client, thus they don't go through the loadbalancer, but appear to come from a source where the source IP does not match the MAC address.

In addition to the requirement to be able to configure addtional "secondary" IPs on the involved machines, this method also requires that no involved networking component (routers, virtualization hypervisors, etc) discard packages which seem "forged" (like, IPs do not match MACs, etc). This is typically not a problem in "classical" networking infrastructures, but getting more and more problematic in modern virtualized / cloud infrastructures.

Tunneling

The tunneling method works by the loadbalancer encapsulating the package in an IPIP tunneling package and sending it to the corresponding server.

It also requires that the servers have configured the virtual / loadbalancer IP locally, but here on a "tunl" device. We have to cover the same "ARP Problem" as explained in the Direct Routing section above, with the same solution. We also have the situation that answers are going directly from the servers to the clients, not passing through the loadbalancer.

The Tunneling method generally works better in modern virtualized / cloud environments.

NAT method

We have not worked out / tested some NAT based setup yet, but it sounds promising to get it working in even more restrictive cloud environments, where routers typically reject packages with mismatching IPs/MACs. Feedback welcome.

Software installation on the loadbalancer node

Packages are installed from standard repos using

# apt-get install keepalived 

This will install the required dependencies like ipvsadm etc.

Contrary to earlier Debian distros, currently there is no requirement to configure any special service (yet) for loading kernel modules and such. In older Debian versions (like Squeeze) some /etc/default/{ipvsadm,keepalived} files needed some tweaking to leverage kernel module loading (which seemed to fail automatically). This is currently no longer true; if working on an old (historical!) Debian version, you may have to investigate there.

Also not required, but claimed somewhere is to configure IPv4 forwarding. If experimenting with other routing methods, this may become required; it has been verified it is not required with DR or TUN.

Configuration

The configuration examples given below assume a setup like

10.0.0.1 database server / galera node 1
10.0.0.2 database server / galera node 2
10.0.0.3 database server / galera node 3
10.0.0.4 loadbalancer primary IP
10.0.0.5 database client, e.g. OX middleware node
10.0.0.10 loadbalancer virtual IP for writing (persistent routing / dedicated write node)
10.0.0.11 loadbalancer virtual IP for reading (round-robin)

Note: with DR and TUN, it is not possible to change the port numbers on routing; thus, for each loadbalancer endpoint, the loadbalancer needs an additional virtual IP. (It is not possible to configure them on different ports on the same (e.g. primary) IP of the loadbalancer.)

Manual configuration / testing

Networking adjustments on the server nodes

The server nodes need the loadbalancer virtual IP(s) configured on some network device in order for the server processes to be able to bind on this device.

For DR, it seems natural to configure a dummy device. For TUN, you need a tunl device.

For testing, you can do it manually on the given nodes:

# for TUN
ip link set up tunl0
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
# for DR
ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0

Then, you solve the "ARP Problem" by

echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

Loadbalancer

The loadbalancer also needs the virtual IPs configured as secondary IPs:

ip addr add 10.0.0.10/32 dev eth0
ip addr add 10.0.0.11/32 dev eth0

Then the loadbalancer endpoints themselves can be configured with ipvsadm:

# For TUN
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -i -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -i -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -i -w 10
# For DR
# Round-Robin / read instance
/sbin/ipvsadm -A -t 10.0.0.10:3306 -s rr
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.10:3306 -r 10.0.0.3 -g -w 10
# Persistent / write instance
/sbin/ipvsadm -A -t 10.0.0.11:3306 -s rr -p 86400 -M 0.0.0.0
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.1 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.2 -g -w 10
/sbin/ipvsadm -a -t 10.0.0.11:3306 -r 10.0.0.3 -g -w 10

Note: you need to restart the MySQL service after the networking adjustments; otherwise, the MySQL daemon will not accept packages with the virtual IP as target IP. This has caused a lot of wasted time to quite some people.

Note: if lazy, you can test with one server node, and extend the configuration later to all three nodes.

Note: to view the current LVS configuration, use

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.10:mysql rr
  -> 10.0.0.1:mysql               Tunnel  10     0          0
  -> 10.0.0.2:mysql               Tunnel  10     0          0
  -> 10.0.0.3:mysql               Tunnel  10     0          0
TCP  10.0.0.11:mysql rr persistent 86400
  -> 10.0.0.1:mysql               Tunnel  10     0          0
  -> 10.0.0.2:mysql               Tunnel  10     0          0
  -> 10.0.0.3:mysql               Tunnel  10     0          0

Note: to stop / start over, use ipvsadm -C.

Note: you can use ipvsadm -S / ipvsadm -R for easier iterative testing (see manpage).

# ipvsadm -S
-A -t 10.0.0.10:mysql -s rr
-a -t 10.0.0.10:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.10:mysql -r 10.0.0.3:mysql -i -w 10
-A -t 10.0.0.11:mysql -s rr -p 86400
-a -t 10.0.0.11:mysql -r 10.0.0.1:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.2:mysql -i -w 10
-a -t 10.0.0.11:mysql -r 10.0.0.3:mysql -i -w 10
# ipvsadm -S > ipvsadm.conf
# ipvsadm -R < ipvsadm.conf

Testing

You should be able to verify functionality then from the client / OX middleware node with something like (omitting authentication command line arguments for brevity)

# while true; do mysql -h10.0.0.10 -B -N -e "select @@hostname;"; sleep 1; done
db3
db2
db1
db3
db2
db1
[...]
^C
# while true; do mysql -h10.0.0.11 -B -N -e "select @@hostname;"; sleep 1; done
db3
db3
db3
db3
db3
db3
[...]
^C

If it works not:

  • Remember you need to restart the MySQL server after networking adjustments
  • Try to use tcpdump to find out on which node (loadbalancer or server) your packages actually arrive
  • Use arp -a to verify the server nodes did not advertise the virtual IP addresses with their MAC
  • Verify the usual candidates like iptables (off by default Debian; may vary in your installation), selinux/apparmor (if using SLES or RHEL), additional firewalls are not spoiling your testing

Please verify the manual setup before proceeding to the persistent / production configuration.

Persistent / production configuration

Networking adjustments on the server nodes

It is possible to attach the configuration to /etc/network/interfaces:

# TUN example
# existing eth0 configuration
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
# add the following
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip link set up tunl0
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev tunl0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev tunl0
    pre-down ip addr del 10.0.0.11/32 dev tunl0
    pre-down ip addr del 10.0.0.10/32 dev tunl0
    pre-down ip link set down tunl0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce
# DR example
# existing eth0 configuration
auto eth0
iface eth0 inet static
    address 10.0.0.XYZ
    netmask 255.255.255.0
# add the following
    pre-up echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
    pre-up echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
    post-up ip addr add 10.0.0.10/32 brd 10.0.0.10 dev dummy0
    post-up ip addr add 10.0.0.11/32 brd 10.0.0.11 dev dummy0
    pre-down ip addr del 10.0.0.11/32 dev dummy0
    pre-down ip addr del 10.0.0.10/32 dev dummy0
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_ignore
    post-down echo 0 > /proc/sys/net/ipv4/conf/all/arp_announce

Keepalived configuration (health checks skipped)

Note: keepalived will manage the secondary IPs, so no need to hard-wire them in /etc/network/interfaces or alike. Rather, deconfigure any potentially manually configured seconday IPs from previous manual testing.

Create a config file /etc/keepalived/keepalived.conf for basic functionality testing like

global_defs {
  # This should be unique.
  router_id galera-lb
}

vrrp_instance mysql_pool {
  # The interface we listen on.
  interface eth0

  # The default state, one should be master, the others should be set to SLAVE.
  state MASTER
  priority 101

  # This should be the same on all participating load balancers.
  virtual_router_id 19

  # Set the interface whose status to track to trigger a failover.
  track_interface {
    eth0
  }

  # Password for the loadbalancers to share.
  authentication {
    auth_type PASS
    auth_pass Twagipmiv3
  }

  # This is the IP address that floats between the loadbalancers.
  virtual_ipaddress {
   10.0.0.10/32 dev eth0
   10.0.0.11/32 dev eth0
  }
}

# Here we add the virtual mysql read node
virtual_server 10.0.0.10 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
  }
  real_server 10.0.0.2 3306 {
    weight 10
  }
  real_server 10.0.0.3 3306 {
    weight 10
  }
}

# Here we add the virtual mysql write node
virtual_server 10.0.0.11 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # the following two options implement that active-passive behavior
  persistence_timeout 86400
  # make sure all OX nodes are included in that netmask
  persistence_granularity 0.0.0.0

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
  }
  real_server 10.0.0.2 3306 {
    weight 10
  }
  real_server 10.0.0.3 3306 {
    weight 10
  }
}

The file should be self-explaining if you followed the manual configuration explanations above. The only unexpected things are directives like state, priority which will be explained below for multi-keepalived-setup.

The example has been using TUN; for DR, just replace TUN by DR in the virtual_server definitions.

After a service keepalived restart you should be able to execute the same client connectivity tests as shown above. (Remember to cleanly unconfigure your manual setup before in order to not measure false success.)

Keepalived configuration (with health checks)

We can configure health checks in keepalived.conf:

global_defs {
  # This should be unique.
  router_id galera-lb
}

vrrp_instance mysql_pool {
  # The interface we listen on.
  interface eth0

  # The default state, one should be master, the others should be set to SLAVE.
  state MASTE1
  priority 101

  # This should be the same on all participating load balancers.
  virtual_router_id 19

  # Set the interface whose status to track to trigger a failover.
  track_interface {
    eth0
  }

  # Password for the loadbalancers to share.
  authentication {
    auth_type PASS
    auth_pass Twagipmiv3
  }

  # This is the IP address that floats between the loadbalancers.
  virtual_ipaddress {
   10.0.0.10/32 dev eth0
   10.0.0.11/32 dev eth0
  }
}

# Here we add the virtual mysql read node
virtual_server 10.0.0.10 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.1"
      misc_timeout 5
    }
  }
  real_server 10.0.0.2 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.2"
      misc_timeout 5
    }
  }
  real_server 10.0.0.3 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.3"
      misc_timeout 5
    }
  }
}

# Here we add the virtual mysql write node
virtual_server 10.0.0.11 3306 {
  delay_loop 6
  # Round robin, but you can use whatever fits your needs.
  lb_algo rr

  lb_kind TUN
  protocol TCP

  # the following two options implement that active-passive behavior
  persistence_timeout 86400
  # make sure all OX nodes are included in that netmask
  persistence_granularity 0.0.0.0

  # For each server add the following.
  real_server 10.0.0.1 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.1"
      misc_timeout 5
    }
  }
  real_server 10.0.0.2 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.2"
      misc_timeout 5
    }
  }
  real_server 10.0.0.3 3306 {
    weight 10
    MISC_CHECK {
      misc_path "/etc/keepalived/checker.pl 10.0.0.3"
      misc_timeout 5
    }
  }
}

The config file is the same as before, but for the MISC_CHECK sections.

We now need such a check.pl script which checks the galera replication status and exits 0 if fine and exits 1 if not.

Such a sample script is given below. Consult your DBAs on the checks to be performed, i.e. when should a node be considered "available" and when not.

#!/usr/bin/perl

# dominik.epple@open-xchange.com, 2013-06-10

use strict;
use warnings;

#
# config section
#
our $username="checker";
our $password="aicHupdakek3";
our $debug=0;

our %checks=(
  #"wsrep_cluster_size" => "3",
  "wsrep_ready" => "ON",
  "wsrep_local_state" => "4" # Synced
);
#
# config section end
#

our $host=$ARGV[0] or die "usage: $0 <IP of galera node>";

use DBI;
our $dbh = DBI->connect("DBI:mysql:;host=$host", $username, $password
                   ) || die "Could not connect to database: $DBI::errstr";

our $results = $dbh->selectall_hashref("show status like '%wsrep%'", 'Variable_name') or die "Error trying to selectall_hashref";

our %cr=();

foreach my $id (keys %$results) {
  $::cr{$id}=$results->{$id}->{"Value"};
}

$dbh->disconnect();

for my $k (keys %checks) {
  if(exists $::cr{$k}) {
    if($::checks{$k} ne $::cr{$k}) {
      print STDERR "$0: warning: mismatch in $k: expected $::checks{$k}, got $::cr{$k}\n";
      exit(1);
    }
    else {
      print STDERR "$0: info: match in $k: expected $::checks{$k}, got $::cr{$k}\n" if($::debug);
    }
  }
  else {
    print STDERR "$0: warning: no check result for $k (want $::checks{$k})\n";
  }
}

exit(0);

The script requires as a dependency the installation of the Perl MySQL interface module:

# apt-get install libdbd-mysql-perl

Don't forget to configure a corresponding DB user for the checker script. Execute on a Galera node:

CREATE USER 'checker'@'%' IDENTIFIED BY 'aicHupdakek3';
FLUSH PRIVILEGES;

Adding a second Keepalived node for redundancy

With a single keepalived node we have a single point of failure. It is possible to add a second keepalived node which is communicating with the first keepalived node and transition from a backup state to master state upon failure of the first node.

To set up a second keepalived node as described above, create a keepalived node identical to the first one, with the following changes to the configuration file /etc/keepalived/keepalived.conf:

  • Change the router_id (to the hostname, for example)
  • Change the state to BACKUP
  • Change the priority to something lower than the masters priority (e.g. 100)

Make sure the virtual_router_id and authentication information is the same on the backup keepalived node as on the master keepalived node.

Now the backup node will notice the master going down and take over. Automatic failback also happens.

Keepalived will automatically manage the secondary IPs, so no need for any additional clustering software like corosync/pacemaker etc.

Keepalived monitoring

ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --stats
ipvsadm -Ln -t $LOADBALANCER_IP:$LOADBALANCER_PORT --rate