You aRe My Z: How to salvage your CVM cluster node if one node is crash?

Background: This is the real experience that the node is crashed and unbootable after the kernel patch activity. Unfortunately, no backup is performed and no mirror is break before the activity. This cluster is running Veritas VCS and CVM for NFS file sharing purpose.

Solution overview : Clone the OS disk from working node and modified the configuration to revert back the host information.

Detailed solution:
1. Use ufsdump to perform clone from the working node to a new harddisk. Also touch the install-db to prevent the VxVM startup, modify the /newdisk/etc/system & /newdisk/etc/vfstab, remove the root-done file.
2. Unplug all the network & FC connection on the crashed node.
3. Boot up the crashed node with the new harddisk. However it is exactly like the working node therefore hostname & IP, etc must be changed.
Some files must change to revert back the corrent host information.
/etc/hostname.[interface]
/etc/nodename
/etc/hosts
/etc/llttab
/etc/llthosts
/etc/gabtab
Other files if neccessary.

4. Start up VxVM manually.
    # vxconfigd -m disable
    # vxdctl hostid [crashnode]
    # vxdctl enable
    # rm /etc/vx/reconfig.d/state.d/install-db
5. Disable whatever rc2.d and rc3.d script (SFHA) to prevent the applications you don't want them to be started up on next reboot. Then reboot the crashed node.
6. Connect back the FC cable.
    # format (you should see the SAN storage)
    # vxdctl enable
    # vxdisk list (you should see the SAN storage)
7. Freeze the crashed node (running commands on working node)

    # haconf -makerw
    # hasys -freeze -persistent [crashnode]
    # haconf -dump -makero

8. Connect the network cable (include heartbeat) and perform ping test.

9. copy the /etc/VRTSvcs/conf/config/main.cf from working node to crash node.

10. Bring up the llt & gab.
     # /etc/rc2.d/S70llt start
     # /etc/rc2.d/S92gab start
     # lltstat -vvn
     # gabconfig -a (should see port a and membership should join correctly)
11. Unfreeze the cluster
     # haconf -makerw

     # hasys -unfreeze -persistent [crashnode]
     # haconf -dump -makero
12. Bring up the HA
     # hastart (perform on crash node, the crash node should back to running)
13. Bring up the CVM
     # hagrp -online cvm -sys [crashnode]
     the CVM group should become online.
     # vxdctl -c mode (the crash node should be slave)
14. Bring the rest of resource groups.

You aRe My Z

How to salvage your CVM cluster node if one node is crash?

No comments:

Saving The Seahorse Means Saving The Sea