最近在一个用户那测试HACMP52,发现了一个新工具,用来测试hacmp的配置正确与否。开始测试的时候,发现测试能通过,但是某台机器老被咔嚓掉,于是就开始k书,查看原因,以下是一些总结。
HACMP52 Administration and Troubleshooting Guide
自动测试过程中的测试顺序:
1. General topology tests
2. Resource group tests on non-concurrent resource groups
3. Catastrophic failure test
General topology tests
1. Bring a node up and start cluster services on all available nodes
在node_a启动测试,将两个节点的cluster服务启动
2. Stop cluster services gracefully on a node
停止node_a节点的服务
3. Restart cluster services on the node that was stopped
重起node_a节点的服务
4. Stop cluster services with takeover on another node
停止node_b节点的服务,并将资源切换(takeover)到node_a
5. Restart cluster services on the node that was stopped
重起node_b节点的服务
6. Forces cluster services to stop on another node
用强迫方式停止node_a节点的服务
7. Restart cluster services on the node that was stopped
重起node_a节点的服务
Resource group tests on non-concurrent resource groups
对于非并行资源组来说,所有的资源组测试在当前拥有资源组的节点上进行,除非启动策略指定。
1. Bring a local.network down on a node to produce a resource group fallover
2. Recover the previously failed network
3. Bring an application server down and recover from the application failure
Catastrophic Failure Test
Catastrophic test将通过随机选择的办法停掉一个cluster中的节点,被停掉的节点当时拥有一个活动的资源组。此节点被停止后,需要手工将机器启动。
CLSTRMGR_KILL, node1, Kill the cluster manager on a node
这就是测试成功后,会将发起测试的节点机器关闭(在cascading模式并且测试是从主节点启动的情况下),通过直接kill cluster manager.
下面是从机器测试日志上摘下来的开始段中对测试过程的描述:
-------------------------------------------------------
| Building Test Queue
-------------------------------------------------------
Test Plan: /usr/es/sbin/cluster/cl_testtool/auto_topology
Event 1: NODE_UP: NODE_UP,ALL,Start cluster services on all available nodes
-------------------------------------------------------
| Validate NODE_UP
-------------------------------------------------------
Event node: ALL
Configured nodes: node_a node_b
Event 2: NODE_DOWN_GRACEFUL: NODE_DOWN_GRACEFUL,node1,Stop cluster services gracefully on a node
-------------------------------------------------------
| Validate NODE_DOWN_GRACEFUL
-------------------------------------------------------
Event node: node_b
Configured nodes: node_a node_b
Event 3: NODE_UP: NODE_UP,node1,Restart cluster services on the node that was stopped
-------------------------------------------------------
| Validate NODE_UP
-------------------------------------------------------
Event node: node_b
Configured nodes: node_a node_b
Event 4: NODE_DOWN_TAKEOVER: NODE_DOWN_TAKEOVER,node2,Stop cluster services with takeover on a node
-------------------------------------------------------
| Validate NODE_DOWN_TAKEOVER
-------------------------------------------------------
Event node: node_a
Configured nodes: node_a node_b
Event 5: NODE_UP: NODE_UP,node2,Restart cluster services on the node that was stopped
-------------------------------------------------------
| Validate NODE_UP
-------------------------------------------------------
Event node: node_a
Configured nodes: node_a node_b
Event 6: NODE_DOWN_FORCED: NODE_DOWN_FORCED,node3,Stop cluster services forced on a node
-------------------------------------------------------
| Validate NODE_DOWN_FORCED
-------------------------------------------------------
Event node: node_b
Configured nodes: node_a node_b
Event 7: NODE_UP: NODE_UP,node3,Restart cluster services on the node that was stopped
至此,测试结束。
-------------------------------------------------------
| Building Test Queue
-------------------------------------------------------
Test Plan: /usr/es/sbin/cluster/cl_testtool/auto_cluster_kill
Event 1: CLSTRMGR_KILL: CLSTRMGR_KILL,node1,Kill the cluster manager on a node
-------------------------------------------------------
| Validate CLSTRMGR_KILL
-------------------------------------------------------
Event node: node_a
Configured nodes: node_a node_b
###########################################################################
## Starting Cluster Test Tool: -c -e /usr/es/sbin/cluster/cl_testtool/auto_cluster_kill##
###########################################################################
===========================================================================
||
|| Starting Test 1 - CLSTRMGR_KILL: Kill the cluster manager on a node
||
===========================================================================
-------------------------------------------------------
| Executing Command for CLSTRMGR_KILL
-------------------------------------------------------
/usr/es/sbin/cluster/cl_testtool/cl_testtool_ctrl -e CLSTRMGR_KILL -m execute 'node_a'
-------------------------------------------------------
在测试结束后,重起一个测试队列,将发起测试节点的节点机器系统关闭。
如果希望测试后把关闭的机器自动启动,可以通过以下步骤来实现(俺没试验过,呵呵):
1、编辑/etc/cluster/hacmp.term文件,改变碰到异常退出后系统采用的默认动作。clexit.rc脚本检查hacmp.term文件的存在与否,如果hacmp.term是可执行的,clexit.rc脚本调用hacmp.term,而不是自动关闭系统。
2、在运行cluster test tool前,配置节点进入自动IPL过程(auto-Initial Program Load (IPL))
在新的HACMP52中,可以通过websmit和浏览器来查看clstat的输出结果,可以直观的看到cluster的状态了,越来越方便使用了,呵呵。
文章来源于领测软件测试网 https://www.ltesting.net/