Problems
From OpenSSI
Any problems that you've observed with the current version of OpenSSI and their work-arounds can be posted here.
Contents |
Directory /boot empty after failover
Problem: After configuring a secondary node for root failover, then shutting down the cluster and booting the secondary root node by itself, there is nothing mounted on /boot. Of course, this is only a problem if you configured separate boot partitions.
Reason: This is due to a limitation in the clusterized version of /etc/fstab. It lets you mount a partition on all nodes, a particular node, or one of a set of failover nodes (which requires shared access to the same filesystem). There's no option to mount on the root node (whichever node that might be at the moment). The /boot entry in /etc/fstab specifies the original installation node, so only that node's boot partition is mounted whenever it's part of the cluster.
Work-around: Assume that the first root node (the original installation node) is node 1, and the secondary root node is node 2. Copy the contents of node 1's /boot partition into the /boot _directory_ of the root filesystem, then remove the /boot entry from /etc/fstab. You will still boot from the separate /boot partitions on each node (they're still listed in /etc/clustertab as the boot devices for nodes 1 and 2), but the authoritative copy of the boot materials will be in the /boot directory of the root.
Boot stops after IP address obtained
Problem: Users report that sometimes cluster nodes stop booting after obtaining an address by DHCP.
Reason: This is a problem with the TFTP daemon on Debian.
Solution: Installing atftpd and running it in standalone mode seems to fix the problem. The following options are used to invoke atftpd:
--daemon --port 69 --retry-timeout 2 --no-multicast --maxthread 100 /tftpboot
fixing RPM database error after upgrading glibc using provided openssi-rh9-1.2.0-glibc.i686.tar on RH 9
Problem: Some of you might get this error message: error:db4 error(-30989) from dbcursor->c_get: DB_PAGE_NOTFOUND: requested page not found When executing rpm operation, e.g querying package
Reason: It is likely that RPM database gets small corruption
Solution: First of all, do this SOON AFTER you are doing glibc update
1. Make backup of your current RPM database # cp -v -a /var/lib/rpm /tmp/rpm-backup (do verbosely to make sure you are copying the correct files, -a preserves the timestamp and permission !)
2. Delete all __db files # rm -f /var/lib/rpm/__db*
3. Do RPM database rebuild # rpm -vv --rebuilddb
4. Confirm that everything is going fine by doing # rpm -qa it will print all your current installed RPM
5. When something wrong happen, simply copy all the RPM file on /tmp/rpm-backup to /var/lib/rpm # cd /var/lib/rpm/ # cp -a -v /tmp/rpm-backup ./ (again -a is used here)
Reference: Hintskinks
PASV FTP connections on CVIP
Problem: Fail to establish FTP passive connections on the CVIP.
Reason #1: While FTP is load-balanced, LVS is not aware that connections to the passive ports need to be routed to the realserver that established the FTP control channel.
Reason #2: HA-LVS sometimes rejects connections to ports on the CVIP not listed in the IPVS services table.
Solution #1: Enable persistent connections in LVS for the FTP service. For LVS-NAT you only need to insmod ip_vs_ftp with appropriate module parameters.
Solution #2: Register all the passive listening ports in HA-LVS with setport_weight. (No need to use ipvsadm.)
Porting Woes
Problem: Hangs on boot of root node. We are porting OpenSSI to the PowerPC platform. Currently we have Debian 3.1 Sarge as our development platform and a 2.6 kernel. We have managed to get the kernel to compile, but it hangs on booting, after connecting to clms. The pre-cluster process seems to stop there. Any help would be appreciated.
We are guessing the tools did not build correctly. Can they be tested without an OpenSSI kernel running?
Node 2 Boot Fails
Problem: Hangs on boot, node2 panics and drops into the debugger.
Platform: Debian Sarge with dhcpd3 & atftpd.
Boot Screen Error : ..... Setting up IP spoofing protection: rp_filter Configuring network interfaces ... Disabling Privacy extentions on device c05ec740(lo) nm_send: Error 22 sending imalive! nm_send: Error 22 sending imalive! Kernel panic - not syncing: lost network connection to all potential root nodes! Instruction(i) breakpoint #0 at 0xc01272e0 (adjusted) 0xc01272e0 panik_hook: int3
Solution: /cluster/node2/etc/network/interfaces set up my eth0 the same as node1's eth0. After changing the static address to the DHCP assigned one, it worked!
Kernel IP Forwarding Disabled After Failover
Problem: In OpenSSI-1.9.x (as of this writing) IPv4 forwarding gets disabled in kernel after init failover and when HA-LVS failover is also done.
Solution: Only if you require IP forwarding append echo 1 > /proc/sys/net/ipv4/ip_foward to /etc/rc.d/rc.sysrecover for Fedora Core.

