DRBD

From OpenSSI

Jump to: navigation, search

Contents

Overview

DRBD for OpenSSI provides organizations and institutions a cost effective platform for deploying highly available clustered services. The primary focus of this technology is high availability in an OpenSSI environment. Our audience includes those seeking an effective "off-the-shelf" solution to increase availability of their enterprise applications, transactional databases, or network services.

DRBD provides highly available (HA) data storage, data protection, data redundancy, and data replication for Linux. The semantics of DRBD technology is similar to shared storage except DRBD only requires COTS hardware and an IP network. DRBD for OpenSSI enables users to deploy HA clusters without shared storage hardware.

This new technology transcends traditional DRBD limitations by incorporating OpenSSI advantages such as high-performance computing, process load-balancing, automatic service and network failover, active-active connection load-balancing (ie. UltraMonkey), shared root, shared IPC space, and single-point of management with high-availability in mind.

An OpenSSI modified DRBD tarball is available at http://www.openssi.org/contrib/drbd-ssi/ and at http://radian.org/~roger/drbd-ssi/

Requirements

Prior experience with DRBD will help tremendously.

  • Minimum recommended user level:
    • Linux systems: Intermediate to Advanced.
    • Linux networking: Intermediate.
    • Clustering: Intermediate.

Status

The status indicates the state of DRBD operation and DRBD failover in OpenSSI. The status is based on the latest available OpenSSI modified DRBD package.

DRBD is not a filesystem, but our modifications allow for seamless filesystem failover. Do not confuse this status with filesystem failover.

DRBD failover times are in the sub-second range in OpenSSI - excluding OpenSSI cluster manager timeouts.

OpenSSI-1.9

DRBD devices at /dev/drbd*. Requires an updated /sbin/mkinitrd and/or /bin/mount. Also s!drbd/0!drbd0 in /etc/rc.sysrecover and /etc/init.d/SSIfailover.

  • Debian: Good.
  • SUSE: Unknown.

OpenSSI-1.2

DRBD devices at /dev/drbd/*

  • Debian: Good.
  • Fedora Core 2: Stable.
  • RHEL: Unknown.

Roadmap

  • Automatic DRBD failover for highly available shared root CFS. 100% done. (stable) Jaideep, Roger.
  • Documentation for OpenSSI users. En-chiang, Aneesh.
  • Porting base changes to Debian OpenSSI. Aneesh.
  • Support complex multiple DRBD devices multiple local and remote CFS mounts. 100% done. (stable) Roger.
  • More reliable DRBD connection management to satisfy CFS requirements. 100% done. (good) Roger.
  • Multiple DRBD device failover using ci-linux for HA-CFS hardmounts. 100% done. (good) Roger.
  • Failover handling of StandAlone DRBD devices with ci-linux. 90% done. (Works on potential initnode's.) Roger.
  • DRBD split-brain recovery using ci-linux. 90% done. (testing) Roger.
  • OpenSSI initrd configuration to support DRBD compiled into kernel. 50% done. (kernel-2.6 only) Roger.


  • Support IO fencing to manage split-brain - eg. STONITH. 0% done. (drbd-0.8.x)
  • Support synchronous cluster filesystems to mount Primary/Primary - eg. Lustre, RedHat GFS, Oracle OCFS. 0% done. (drbd-0.8.x)
  • Enhanced split-brain recovery compatibility with ci-linux. 0% done. (drbd-0.8.x)

Performance Tuning

  • Increase al-extents.
  • Significant performance improvement with kernel-2.6 due to better IO schedulers. Use deadline IO scheduler.
  • Kernel: Recompile OpenSSI kernel for your system architecture, adjust backlogs.
  • Memory: Increase vm.lower_zone_protection to at least DRBD max-buffers.
  • Networking: QoS or dedicated network, MTU >= 4200, offloading, channel bonding, ...
  • Local storage (backing storage): RAID 0 (data striping)
  • Filesystem: ...
  • Optional: Load harddrive settings in initrd before DRBD connects (kernel-2.4).


Sample guidelines for settings in /etc/drbd.conf

 resource root {
 
  # The original DRBD author suggests protocol C.
  # My own benchmarks show protocol B is often faster than protocol C (drbd-0.7.11).
  # protocol B;
  # Use protocol C when doing transparent filesystem failover in OpenSSI.
  protocol C;
 
  net{
   # XXX For OpenSSI clusters, timeout + ping-int < CLMS nodedown timeout.
   max-buffers 2048 up to 131072, larger usually better;
   max-epoch-size 2048 up to 20000, larger usually better;
   sndbuf-size approx. 2-3x BDP;
   ko-count greater than zero but should not exceed CLMS nodedown timeout;
  }
 
  syncer{
   rate approx. 40% your max backing device write/read speed, whichever less;
  }
 
  <snip>
 }

Fedora Core users, remember to mkinitrd --cfs --drbd ... and ssi-ksync after making changes to /etc/drbd.conf.


With the latest DRBD-SSI release candidate in OpenSSI-1.9.2pre and background sync rate set to 30MB/sec.

 version: 0.7.14 (api:77/proto:74)
 SVN Revision: 1989 build by root@node1, 2005-11-20 11:05:03
  0: cs:SyncTarget st:Secondary/Primary ld:Inconsistent
     ns:0 nr:11547452 dw:11717024 dr:0 al:0 bm:2315 lo:85 pe:0 ua:85 ap:0
         [==================>.] sync'ed: 94.8% (625/11873)M 
         finish: 0:00:13 speed: 47,628 (30,228) K/sec
  1: cs:PausedSyncT st:Secondary/Primary ld:Inconsistent
     ns:0 nr:26944 dw:82432 dr:0 al:0 bm:909 lo:0 pe:0 ua:0 ap:0  
  2: cs:PausedSyncT st:Secondary/Primary ld:Inconsistent
     ns:0 nr:0 dw:0 dr:0 al:0 bm:4878 lo:0 pe:0 ua:0 ap:0
  3: cs:PausedSyncT st:Secondary/Primary ld:Inconsistent
     ns:0 nr:0 dw:0 dr:0 al:0 bm:5119 lo:0 pe:0 ua:0 ap:0
  4: cs:Unconfigured
  5: cs:Unconfigured

Caveats

The following should not happen when proper IO fencing is setup to prevent DRBD split-brain. Work has been done to automate the procedure below to recover after a DRBD split-brain, but the code requires more testing.

It is easy to assume the OpenSSI init node is always the node with good data. However this may not always be true if we restore all the nodes at the same time because OpenSSI guarantees the preferred init node comes up first. The DRBD meta-data determines the synchronization direction regardless of a node's Primary/Secondary state. [This doesn't seem to be true with DRBD in kernel-2.6.]

In a scenario where DRBD has been split for a certain undetermined time interval where both nodes update their meta-data, it is possible DRBD will no longer recognize a split-brain and consider the node with greater modifications the data synchronization source (drbd-0.7.11). Therefore the cluster administrator should bring up DRBD nodes in sequence and perform the following to guarantee proper data sync direction. You should take this precaution in the event of a split-brain.

 1. Bring up the node with good data (good node).
 2. drbdsetup --skip-sync all devices on this DRBD node.
 3. Bring up the secondary DRBD node.
 4. Verify that the DRBD devices on the good node is the SyncSource.  In the event your good
    node becomes SyncTarget, you must corrent the situation manually.
 5. If the DRBD Sync directions are correct, you may skip step 6 and 7.
 6. Disconnect and invalidate the appropriate DRBD devices on the secondary DRBD node.
 7. Reconnect all DRBD devices.
 8. Enable DRBD background synchronization on the good node.

Test for Split-Brain Recovery

This is a preliminary test with SSI failover enabled where node 1 is our init node. The results indicate node 2 also discovered the split-brain and has gone offline. This is correct because node 1 did not really go down. No STONITH was used here.

 Fedora Core release 3 (Heidelberg)
 Kernel 2.6.10-bk7-ssi3 on an i686
 
 node1 login: KD
 Entering kdb (current=0xc0423b20, pid 0) due to Keyboard Entry
 kdb> 
 ..............PAUSE 20 SECONDS.............
 kdb> go
 drbd0: meta connection shut down by peer.
 drbd1: meta connection shut down by peer.
 drbd0: short read expecting header on sock: r=0
 ics_llsendmsg: node 2 chan 4, error 32
 ics_llsendmsg: node 2 chan 4, error 32
 ics_llsendmsg: node 2 chan 4, error 32
 ics_llsendmsg: node 2 chan 4, error 32
 ics_llsendmsg: node 2 chan 4, error 32
 ics_llsendmsg: node 2 chan 4, error 32
 drbd2: meta connection shut down by peer.
 ics_llsendmsg: node 2 chan 4, error 32
 drbd1: short read expecting header on sock: r=0
 drbd2: short read expecting header on sock: r=0
 drbd0: incompatible states (both Primary!)
 drbd0: Peer node to resolve split-brain.
 drbd1: incompatible states (both Primary!)
 drbd1: Peer node to resolve split-brain.
 drbd0: short read expecting header on sock: r=-104
 drbd0: meta connection shut down by peer.
 drbd2: incompatible states (both Primary!)
 drbd2: Peer node to resolve split-brain.
 drbd1: short read expecting header on sock: r=-104
 drbd1: meta connection shut down by peer.
 drbd2: sock_sendmsg returned -104
 drbd2: short sent ReportBitMap size=4096 sent=332
 drbd2: meta connection shut down by peer.
 drbd2: short read expecting header on sock: r=0
 Node 2 has gone down!!!
 drbd0: drbd_nodedown: Doing CLMS nodedown callback for service 9.
 drbd1: drbd_nodedown: Doing CLMS nodedown callback for service 10.
 drbd2: drbd_nodedown: Doing CLMS nodedown callback for service 11.
 fsck 1.35 (28-Feb-2004)
 INIT: +++ nodedown completed on node 2

Links

DRBD on Debian Sarge

Personal tools