Linux高性能集群 - 资源管理和系统管理

发布: 2007-7-04 12:06 | 作者: admin



MILY: 宋体">Linux高性能集群 - 资源管理和系统管理
作者:金戈  来源:  类别:linux集群技术  日期:2004.03.24  今日/总浏览: 1/161


1 集群作业管理


2 Beowulf集群中的作业管理软件


2.1 PBS

PBS(Portable Batch System)是由NASA开发的灵活的批处理系统。它被用于集群系统、超级计算机和大规模并行系统。PBS主要有如下特征:
移植性:符合POSIX 1003.2标准,可以用于shell和批处理等各种环境。

OpenPBS(http://www.OpenPBS.org/)是PBS的Open Source的实现。商业版本的PBS可以参照:http://www.pbspro.com/

2.2 Maui

Maui是一个高级的作业调度器。它采用积极的调度策略优化资源的利用和减少作业的响应时间。Maui的资源和负载管理允许高级的参数配置:作业优先级(Job Priority)、调度和分配(Scheduling and Allocation)、公平性和公平共享(Fairness and Fairshare)和预留策略(Reservation Policy)。Maui的QoS机制允许资源和服务的直接传递、策略解除(Policy Exemption)和指定特征的受限访问。Maui采用高级的资源预留架构可以保证精确控制资源何时、何地、被谁、怎样使用。Maui的预留架构完全支持非入侵式的元调度。




3 集群系统管理


3.1 资源管理


3.2 事件服务


3.3 分布式命令和文件


分布式命令功能通常通过分布式的Shell来提供。这种Shell一般叫做dsh(distributed shell)或 psh ( parallel shell)。你可以通过rsh或ssh来实现分布式Shell。


3.4 监控和诊断


3.5 硬件控制

远程电源管理:主要是远程关闭、打开和重启结点与查询结点电源状态。在IBM eServer Cluster 1300中采用ASM。
远程控制台:当远程结点出现问题或出现一些特殊的软件需要时,需要直接登录到结点上完成操作。KVM Switch可以满足这种需求,但是当结点很多时,KVM Switch就会很复杂。而且KVM Switch需要手工切换,不能通过软件方法使用。Terminal Server克服了KVM Switch的缺点。Terminal Server与结点的串口相连,并把串口虚拟成管理结点上终端设备,当然这需要对结点的操作系统做些相应的配置。

3.6 系统安装

网络启动:设置需要的安装的结点网络启动,然后管理结点远程重启需要安装的结点。网络启动的结点启动后从启动服务器获得一个小的操作系统内核。网络启动一般采用Intel的PXE(Pre-Execution Environment)标准。 PXELinux是支持PXE的网络启动服务器。它可以在网络启动的结点启动一个小的Linux核心并运行指定的Init程序。由Init程序负责后续的安装。
网络安装:这个操作系统内核负责从安装服务器(通常是一个文件服务器)上取得安装软件包或系统镜像并在本地实施系统安装。有多种Linux工具可以完成基于网络的系统安装。这些工具中的典型代表是:KickStart、ALICE (Automatic Linux Installation and Configuration Environment)、SIS(System Install Suite)和PartImage。这些工具可以分为如下几类:
a. 基于Script的安装:这种安装方式中,安装过程由安装脚本(Script)控制,可以通过修改安装脚本来配置安装过程。这种安装方式中,安装服务器实际上是一个文件服务器,它向结点提供要安装的软件包。除了软件包不是来自本地外,这种安装方法和本地安装并没有太大的区别,本地安装的各个步骤(配置硬件、安装软件包、配置系统等)它都要经过。KickStart属于这中安装方法。基于Script的安装比较灵活,但是它是操作系统依赖型的。象KickStart只支持Redhat Linux。
b. 基于Imaging的安装:和基于Script的安装不同,基于Imaging的安装并不需要经过本地安装的各个步骤。它只需要把存储在文件服务上的需要安装的系统映象(Image)拷贝到本地的硬盘上。这个系统映象来源于一个已经安装和配置好的样机。Imaging的安装方式是独立于操作系统,但是它依赖于网络启动的操作系统内核支持的文件系统。Imaging的很大缺点是很难提供独立于操作系统的配置方法。PartImage属于Imaging安装方法。而SIS是Script和Imaging混合型的安装方式。SIS利用Linux的chroot命令在安装服务器的一个文件目录下安装一个虚拟的操作系统映象。同时SIS支持用户提供Shell脚本完成安装后的配置。
c. 基于Cloning的安装:和Imaging安装方式相同的是,Cloning安装也采用系统映象。但是Cloning中的系统映象是样机上硬盘分区的Clone。因此,Cloning安装不需要识别系统镜像中的文件系统类型。所以它是独立于文件系统的,它只依赖于操作系统内核支持的硬盘设备类型(IDE或SCSI)。和Imaging一样,Cloning的很大缺点是很难提供独立于操作系统的配置方法。而且相对于Imaging而言,Cloning效率更低。你可以简单的用dd命令实现Clone。

安装工具    安装方法    支持的系统    支持的网络协议    
KickStart    Script    Redhat Linux    NFS、FTP    
SIS    Script和Imaging混合    Redhat Linux
SuSE Linux
Turbo Linux
…    rsync    
PartImage    Imaging    EXT2、FAT、NTFS、HPFS…    私有协议    

3.7 域管理



4 几种集群系统管理软件



IBM CSM(Cluster Systems Management )是IBM eServer Cluster 1300上的系统管理软件。IBM的Linux集群战略的一部分就是把运行在RS/6000 SP平台上的PSSP软件移植到基于xSeries的Linux集群系统上。CSM大部分功能来源于SP平台,但是它也集成了WebSM 2000、xSeries、开放源码工具和其他技术。CSM是一款功能很全面的管理工具,而且还在不断的发展中。

4.2 XCAT

XCAT是用于IBM eServer Cluster 1300上的系统管理软件。它由Egan Ford开发。它基本上是由shell脚本写成,相当简捷。但是它实现了集群系统管理大部分的内容,是个非常出色的管理软件。

4.3 Mon


项目    CSM    XCAT    Mon    
支持的集群系统    IBM eServer Cluster 1300    IBM eServer Cluster 1300    不特定于某个集群系统    
支持的操作系统    Redhat、SuSE    Redhat,结点可以采用Imaging和Cloning安装其他操作系统,甚至于Windows    在Linux上开发,但是以运行在Solaris而著名。很容易移植到其他Unix和非Unix操作系统上    
资源管理    提供统一的、可扩展的,全面的资源管理,但是由于强大而使用起来很复杂。    基本没有    基本没有    
事件服务    提供事件订阅发布机制,并预先定义了很多系统事件和对事件的响应    将来会于Mon集成以完成事件服务    支持    
配置管理    支持    无    无    
监控和诊断    支持分布式Shell(dsh)、支持SNMP    支持并发Shell(psh)、并发ping(pping)    支持SNMP    
硬件控制     远程电源管理(rpower)远程控制台(rconsole)    远程电源管理(rpower) 远程控制台(rcon、wcon)    无    
系统安装    支持KickStart和SIS 支持PXE    支持KickStart、Imaging和Cloning 支持PXE和etherboot    无    
域管理    全面    基本没有    基本没有    
集成性    除了必须的开放源码软件包,不与任何其他软件集成。但是底层资源管理和事件服务提供编程接口,集成很方便。上层可以通过命令调用集成。    自动安装PBS、Maui、Myrinet和MPI。将来会支持 SgridEngine Scheduler    基本没有,应该可以通过命令行集成    
易用性    提供强大命令行工具和简单的GUI工具    命令行工具,将来会和Ganglia集成提供一定的GUI    提供命令行和基于Web的工具    

6 关于作者

金戈,IBM软件工程师,在IBM中国开发中心主持Linux集群系统开发工作。你可以通过 jinge@cn.ibm.com 和他联系。


    OpenPBS(http://www.OpenPBS.org/)是PBS(Portable Batch System)的Open Source的实现。商业版本的PBS可以参照:http://www.pbspro.com/

javascript:window.open(this.src);" style="CURSOR: pointer" onload="return imgzoom(this,550)">下载OpenPBS源码

HowTo steps

The following steps are what we use to install PBS from scratch on our systems. Please send corrections and additions to Ole.H.Nielsen@fysik.dtu.dk.

Ensure that tcl8.0 and tk8.0 are installed on the system. Look into the PBS docs to find out about these packages. The homepage is at
http://www.scriptics.com/products/tcltk/. Get Linux RPMs from your favorite distribution, or build it yourself on other UNIXes.
If you installed the PBS binary RPMs on Linux, skip to step 4.

Configure PBS for your choice of spool-directory and the central server machine (named "zeise" in our examples):
./configure --set-server-home=/var/spool/PBS --set-default-server=zeise

On Compaq Tru64 UNIX make sure that you use the Compaq C-compiler in stead of the GNU gcc by doing "setenv CC cc". You should add these flags to the above configure command: --set-cflags="-g3 -O2". It is also important that the /var/spool/PBS does not include any soft-links, such as /var -> /usr/var, since this triggers a bug in the PBS code.
If you compiled PBS for a different architecture before, make sure to clean up before running configure:

  gmake distclean

Run a GNU-compatible make in order to build PBS.
On AIX 4.1.5 edit src/tools/Makefile to add a library: LIBS = -lld

On Compaq Tru64 UNIX use the native Compaq C-compiler:

gmake CC=cc

The default CFLAGS are "-g -O2", but the Compaq compiler requires "-g3 -O2" for optimization. Set this with:
./configure (flags) --set-cflags="-g3 -O2"

After the make has completed, install the PBS files as the root superuser:
gmake install

Create the "nodes" file in the central server's (zeise) directory /var/spool/PBS/server_priv containing hostnames, see the PBS 2.2 Admin Guide p.8 (Sec. 2.2 "Installation Overview" point 8.). Substitute the spool-directory name /var/spool/PBS by your own choice (the Linux RPM uses /var/spool/pbs). Check the file /var/spool/PBS/pbs_environment and ensure that important environment variables (such as the TZ timezone variable) have been included by the installation process. Add any required variables in this file.

Initialize the PBS server daemon and scheduler:
/usr/local/sbin/pbs_server -t create

The "-t create" should only be executed once, at the time of installation !!
The pbs_server and pbs_sched should be started at boot time: On Linux this is done automatically by /etc/rc.d/init.d/pbs. Otherwise use your UNIX's standard method (e.g. /etc/rc.local) to run the following commands at boot time:

/usr/local/sbin/pbs_server -a true

The "-a true" sets the scheduling attribute to True, so that jobs may start running.

Create queues using the "qmgr" command, see the manual page for "pbs_server_attributes" and "pbs_queue_attributes": List the server configuration by the print server command. The output can be used as input to qmgr, so this is a way to make a backup of your server setup. You may stick the output of qmgr (for example, you may use the setup listed below) into a file (removing the first 2 lines which are actually not valid commands). Pipe this file into qmgr like this: cat file | qmgr and everything is configured in a couple of seconds !
Our current configuration is:

# qmgr
Max open servers: 4
Qmgr: print server
# Create queues and set their attributes.
# Create and define queue verylong
create queue verylong
set queue verylong queue_type = Execution
set queue verylong Priority = 40
set queue verylong max_running = 10
set queue verylong resources_max.cput = 72:00:00
set queue verylong resources_min.cput = 12:00:01
set queue verylong resources_default.cput = 72:00:00
set queue verylong enabled = True
set queue verylong started = True
# Create and define queue long
create queue long
set queue long queue_type = Execution
set queue long Priority = 60
set queue long max_running = 10
set queue long resources_max.cput = 12:00:00
set queue long resources_min.cput = 02:00:01
set queue long resources_default.cput = 12:00:00
set queue long enabled = True
set queue long started = True
# Create and define queue medium
create queue medium
set queue medium queue_type = Execution
set queue medium Priority = 80
set queue medium max_running = 10
set queue medium resources_max.cput = 02:00:00
set queue medium resources_min.cput = 00:20:01
set queue medium resources_default.cput = 02:00:00
set queue medium enabled = True
set queue medium started = True
# Create and define queue small
create queue small
set queue small queue_type = Execution
set queue small Priority = 100
set queue small max_running = 10
set queue small resources_max.cput = 00:20:00
set queue small resources_default.cput = 00:20:00
set queue small enabled = True
set queue small started = True                  
# Create and define queue default
create queue default
set queue default queue_type = Route
set queue default max_running = 10
set queue default route_destinations = small
set queue default route_destinations += medium
set queue default route_destinations += long
set queue default route_destinations += verylong
set queue default enabled = True
set queue default started = True
# Set server attributes.
set server scheduling = True
set server max_user_run = 6
set server acl_host_enable = True
set server acl_hosts = *.fysik.dtu.dk
set server acl_hosts = *.alpha.fysik.dtu.dk
set server default_queue = default
set server log_events = 63
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.cput = 01:00:00
set server resources_default.neednodes = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 60
set server default_node = 1#shared    

Install the PBS software on the client nodes, repeating steps 1-3 above.

Configure the PBS nodes so that they know the server: Check that the file /var/spool/PBS/server_name contains the name of the PBS server (zeise in this example), and edit it if appropriate. Also make sure that this hostname resolves correctly (with or without the domain-name), otherwise the pbs_server may refuse connections from the qmgr command.
Create the file /var/spool/PBS/mom_priv/config on all PBS nodes (server and clients) with the contents:

# The central server must be listed:
$clienthost zeise

where the correct servername must replace "zeise". You may add other relevant lines as recommended in the manual, for example for restricting access and for logging:
$logevent 0x1ff
$restricted *.your.domain.name

(list the domain names that you want to give access).
For maintenance of the configuration file, we use rdist to duplicate /var/spool/PBS/mom_priv/config from the server to all PBS nodes.

Start the MOM mini-servers on both the server and the client nodes:

or "/etc/rc.d/init.d/pbs start" on Linux. Make sure that MOM is started at boot time. See discussion under point 5.
On Compaq Tru64 UNIX 4.0E+F there may be a problem with starting pbs_mom too soon. Some network problem makes pbs_mom report errors in an infinite loop, which fills up the logfiles' filesystem within a short time ! Several people told me that they don't have this problem, so it's not understood at present.
The following section is only relevant if you have this problem on Tru64 UNIX.

On Tru64 UNIX start pbs_mom from the last entry in /etc/inittab:

# Portable Batch System batch execution mini-server
pbsmom::once:/etc/rc.pbs > /dev/console 2>&1

The file /etc/rc.pbs delays the startup of pbs_mom:
# Portable Batch System (PBS) startup
# On Digital UNIX, pbs_mom fills up the mom_logs directory
# within minutes after reboot.  Try to sleep at startup
# in order to avoid this.
if [ -x $/pbs_mom ]; then
    echo PBS startup.
    # Sleep for a while
    sleep 120
    $/pbs_mom       # MOM
    echo Done.
    echo Could not execute PBS commands !

Queues defined above do not work until you start them:
qstart  default small medium long verylong
qenable default small medium long verylong

This needs to be done only once and for all, at the time when you install PBS.

Make sure that the PBS server has all nodes correctly defined. Use the pbsnodes -a command to list all nodes.
Add nodes using the qmgr command:

# qmgr
Max open servers: 4
Qmgr: create node node99 properties=ev67

where the node-name is node99 with the properties=ev67. Alternatively, you may simply list the nodes in the file /var/spool/PBS/server_priv/nodes:
server:ts ev67
node99 ev67

The :ts indicates a time-shared node; nodes without :ts are cluster nodes where batch jobs may execute. The second column lists the properties that you associate with the node. Restart the pbs_server after editing manually the nodes file.

After you first setup your system, to get the jobs to actually run you need to set the server scheduling attribute to true. This will normally be done for you at boot time (see point 5 in this file), but for this first time, you will need to do this by hand using the qmgr command:
# qmgr
Max open servers: 4
Qmgr: set server scheduling=true

