Problems

From OpenSSI

Revision as of 03:58, 7 July 2007 by Krstic (Talk | contribs)
(diff) ←Older revision | Current revision (diff) | Newer revision→ (diff)
Jump to: navigation, search

Any problems that you've observed with the current version of OpenSSI and their work-arounds can be posted here.

Contents

Directory /boot empty after failover

Problem: After configuring a secondary node for root failover, then shutting down the cluster and booting the secondary root node by itself, there is nothing mounted on /boot. Of course, this is only a problem if you configured separate boot partitions.

Reason: This is due to a limitation in the clusterized version of /etc/fstab. It lets you mount a partition on all nodes, a particular node, or one of a set of failover nodes (which requires shared access to the same filesystem). There's no option to mount on the root node (whichever node that might be at the moment). The /boot entry in /etc/fstab specifies the original installation node, so only that node's boot partition is mounted whenever it's part of the cluster.

Work-around: Assume that the first root node (the original installation node) is node 1, and the secondary root node is node 2. Copy the contents of node 1's /boot partition into the /boot _directory_ of the root filesystem, then remove the /boot entry from /etc/fstab. You will still boot from the separate /boot partitions on each node (they're still listed in /etc/clustertab as the boot devices for nodes 1 and 2), but the authoritative copy of the boot materials will be in the /boot directory of the root.

Boot stops after IP address obtained

Problem: Users report that sometimes cluster nodes stop booting after obtaining an address by DHCP.

Reason: This is a problem with the TFTP daemon on Debian.

Solution: Installing atftpd and running it in standalone mode seems to fix the problem. The following options are used to invoke atftpd:

--daemon --port 69 --retry-timeout 2    --no-multicast --maxthread 100 /tftpboot


fixing RPM database error after upgrading glibc using provided openssi-rh9-1.2.0-glibc.i686.tar on RH 9

Problem: Some of you might get this error message: error:db4 error(-30989) from dbcursor->c_get: DB_PAGE_NOTFOUND: requested page not found When executing rpm operation, e.g querying package

Reason: It is likely that RPM database gets small corruption

Solution: First of all, do this SOON AFTER you are doing glibc update

1. Make backup of your current RPM database
# cp -v -a /var/lib/rpm /tmp/rpm-backup (do verbosely to make sure you are 
copying the correct files, -a preserves the timestamp and permission !)
2. Delete all __db files # rm -f /var/lib/rpm/__db*
3. Do RPM database rebuild # rpm -vv --rebuilddb
4. Confirm that everything is going fine by doing # rpm -qa it will print all your current installed RPM
5. When something wrong happen, simply copy all the RPM file on /tmp/rpm-backup to /var/lib/rpm # cd /var/lib/rpm/ # cp -a -v /tmp/rpm-backup ./ (again -a is used here)

Reference: Hintskinks

PASV FTP connections on CVIP

Problem: Fail to establish FTP passive connections on the CVIP.

Reason #1: While FTP is load-balanced, LVS is not aware that connections to the passive ports need to be routed to the realserver that established the FTP control channel.

Reason #2: HA-LVS sometimes rejects connections to ports on the CVIP not listed in the IPVS services table.

Solution #1: Enable persistent connections in LVS for the FTP service. For LVS-NAT you only need to insmod ip_vs_ftp with appropriate module parameters.

Solution #2: Register all the passive listening ports in HA-LVS with setport_weight. (No need to use ipvsadm.)

Porting Woes

Problem: Hangs on boot of root node. We are porting OpenSSI to the PowerPC platform. Currently we have Debian 3.1 Sarge as our development platform and a 2.6 kernel. We have managed to get the kernel to compile, but it hangs on booting, after connecting to clms. The pre-cluster process seems to stop there. Any help would be appreciated.

We are guessing the tools did not build correctly. Can they be tested without an OpenSSI kernel running?

Node 2 Boot Fails

Problem: Hangs on boot, node2 panics and drops into the debugger.

Platform: Debian Sarge with dhcpd3 & atftpd.

 Boot Screen Error :
   .....
   Setting up IP spoofing protection: rp_filter
   Configuring network interfaces ... Disabling Privacy extentions on device c05ec740(lo)
   nm_send: Error 22 sending imalive!
   nm_send: Error 22 sending imalive!
   Kernel panic - not syncing: lost network connection to all potential root nodes!
   Instruction(i) breakpoint #0 at 0xc01272e0 (adjusted)
   0xc01272e0 panik_hook: int3

Solution: /cluster/node2/etc/network/interfaces set up my eth0 the same as node1's eth0. After changing the static address to the DHCP assigned one, it worked!

Kernel IP Forwarding Disabled After Failover

Problem: In OpenSSI-1.9.x (as of this writing) IPv4 forwarding gets disabled in kernel after init failover and when HA-LVS failover is also done.

Solution: Only if you require IP forwarding append echo 1 > /proc/sys/net/ipv4/ip_foward to /etc/rc.d/rc.sysrecover for Fedora Core.

Personal tools