Linux高 性能 集群 - 资源管理和系统管理 作者:金戈 来源: 类别:linux集群技术 日期:2004.03.24 今日/总浏览: 1/161 本文是Linux高性能集群系列文章的第五部分。这一部分首先介绍集群系统" name="description" />
http://www.leftworld.net/wenzhang/show.php?id=541
MILY: 宋体">Linux高性能集群 - 资源管理和系统管理 作者:金戈 来源: 类别:linux集群技术 日期:2004.03.24 今日/总浏览: 1/161 | |
|
OpenPBS源码
OpenPBS(http://www.OpenPBS.org/)是PBS(Portable Batch System)的Open Source的实现。商业版本的PBS可以参照:http://www.pbspro.com/。
PBS是一种可伸缩性的任务队列和工作管理系统,最开始它是为NASA开发的。它是在网络的、多平台的unix系统环境下工作的,包括异构集群工作站、超级计算机以及大规模的并行系统。
HowTo steps
The following steps are what we use to install PBS from scratch on our systems. Please send corrections and additions to Ole.H.Nielsen@fysik.dtu.dk.
Ensure that tcl8.0 and tk8.0 are installed on the system. Look into the PBS docs to find out about these packages. The homepage is at http://www.scriptics.com/products/tcltk/. Get Linux RPMs from your favorite distribution, or build it yourself on other UNIXes.
If you installed the PBS binary RPMs on Linux, skip to step 4.
Configure PBS for your choice of spool-directory and the central server machine (named "zeise" in our examples):
./configure --set-server-home=/var/spool/PBS --set-default-server=zeise
On Compaq Tru64 UNIX make sure that you use the Compaq C-compiler in stead of the GNU gcc by doing "setenv CC cc". You should add these flags to the above configure command: --set-cflags="-g3 -O2". It is also important that the /var/spool/PBS does not include any soft-links, such as /var -> /usr/var, since this triggers a bug in the PBS code.
If you compiled PBS for a different architecture before, make sure to clean up before running configure:
gmake distclean
Run a GNU-compatible make in order to build PBS.
On AIX 4.1.5 edit src/tools/Makefile to add a library: LIBS = -lld
On Compaq Tru64 UNIX use the native Compaq C-compiler:
gmake CC=cc
The default CFLAGS are "-g -O2", but the Compaq compiler requires "-g3 -O2" for optimization. Set this with:
./configure (flags) --set-cflags="-g3 -O2"
After the make has completed, install the PBS files as the root superuser:
gmake install
Create the "nodes" file in the central server's (zeise) directory /var/spool/PBS/server_priv containing hostnames, see the PBS 2.2 Admin Guide p.8 (Sec. 2.2 "Installation Overview" point 8.). Substitute the spool-directory name /var/spool/PBS by your own choice (the Linux RPM uses /var/spool/pbs). Check the file /var/spool/PBS/pbs_environment and ensure that important environment variables (such as the TZ timezone variable) have been included by the installation process. Add any required variables in this file.
Initialize the PBS server daemon and scheduler:
/usr/local/sbin/pbs_server -t create
/usr/local/sbin/pbs_sched
The "-t create" should only be executed once, at the time of installation !!
The pbs_server and pbs_sched should be started at boot time: On Linux this is done automatically by /etc/rc.d/init.d/pbs. Otherwise use your UNIX's standard method (e.g. /etc/rc.local) to run the following commands at boot time:
/usr/local/sbin/pbs_server -a true
/usr/local/sbin/pbs_sched
The "-a true" sets the scheduling attribute to True, so that jobs may start running.
Create queues using the "qmgr" command, see the manual page for "pbs_server_attributes" and "pbs_queue_attributes": List the server configuration by the print server command. The output can be used as input to qmgr, so this is a way to make a backup of your server setup. You may stick the output of qmgr (for example, you may use the setup listed below) into a file (removing the first 2 lines which are actually not valid commands). Pipe this file into qmgr like this: cat file | qmgr and everything is configured in a couple of seconds !
Our current configuration is:
# qmgr
Max open servers: 4
Qmgr: print server
#
# Create queues and set their attributes.
#
#
# Create and define queue verylong
#
create queue verylong
set queue verylong queue_type = Execution
set queue verylong Priority = 40
set queue verylong max_running = 10
set queue verylong resources_max.cput = 72:00:00
set queue verylong resources_min.cput = 12:00:01
set queue verylong resources_default.cput = 72:00:00
set queue verylong enabled = True
set queue verylong started = True
#
# Create and define queue long
#
create queue long
set queue long queue_type = Execution
set queue long Priority = 60
set queue long max_running = 10
set queue long resources_max.cput = 12:00:00
set queue long resources_min.cput = 02:00:01
set queue long resources_default.cput = 12:00:00
set queue long enabled = True
set queue long started = True
#
# Create and define queue medium
#
create queue medium
set queue medium queue_type = Execution
set queue medium Priority = 80
set queue medium max_running = 10
set queue medium resources_max.cput = 02:00:00
set queue medium resources_min.cput = 00:20:01
set queue medium resources_default.cput = 02:00:00
set queue medium enabled = True
set queue medium started = True
#
# Create and define queue small
#
create queue small
set queue small queue_type = Execution
set queue small Priority = 100
set queue small max_running = 10
set queue small resources_max.cput = 00:20:00
set queue small resources_default.cput = 00:20:00
set queue small enabled = True
set queue small started = True
#
# Create and define queue default
#
create queue default
set queue default queue_type = Route
set queue default max_running = 10
set queue default route_destinations = small
set queue default route_destinations += medium
set queue default route_destinations += long
set queue default route_destinations += verylong
set queue default enabled = True
set queue default started = True
#
# Set server attributes.
#
set server scheduling = True
set server max_user_run = 6
set server acl_host_enable = True
set server acl_hosts = *.fysik.dtu.dk
set server acl_hosts = *.alpha.fysik.dtu.dk
set server default_queue = default
set server log_events = 63
set server mail_from = adm
set server query_other_jobs = True
set server resources_default.cput = 01:00:00
set server resources_default.neednodes = 1
set server resources_default.nodect = 1
set server resources_default.nodes = 1
set server scheduler_iteration = 60
set server default_node = 1#shared
Install the PBS software on the client nodes, repeating steps 1-3 above.
Configure the PBS nodes so that they know the server: Check that the file /var/spool/PBS/server_name contains the name of the PBS server (zeise in this example), and edit it if appropriate. Also make sure that this hostname resolves correctly (with or without the domain-name), otherwise the pbs_server may refuse connections from the qmgr command.
Create the file /var/spool/PBS/mom_priv/config on all PBS nodes (server and clients) with the contents:
# The central server must be listed:
$clienthost zeise
where the correct servername must replace "zeise". You may add other relevant lines as recommended in the manual, for example for restricting access and for logging:
$logevent 0x1ff
$restricted *.your.domain.name
(list the domain names that you want to give access).
For maintenance of the configuration file, we use rdist to duplicate /var/spool/PBS/mom_priv/config from the server to all PBS nodes.
Start the MOM mini-servers on both the server and the client nodes:
/usr/local/sbin/pbs_mom
or "/etc/rc.d/init.d/pbs start" on Linux. Make sure that MOM is started at boot time. See discussion under point 5.
On Compaq Tru64 UNIX 4.0E+F there may be a problem with starting pbs_mom too soon. Some network problem makes pbs_mom report errors in an infinite loop, which fills up the logfiles' filesystem within a short time ! Several people told me that they don't have this problem, so it's not understood at present.
The following section is only relevant if you have this problem on Tru64 UNIX.
On Tru64 UNIX start pbs_mom from the last entry in /etc/inittab:
# Portable Batch System batch execution mini-server
pbsmom::once:/etc/rc.pbs > /dev/console 2>&1
The file /etc/rc.pbs delays the startup of pbs_mom:
#!/bin/sh
#
# Portable Batch System (PBS) startup
#
# On Digital UNIX, pbs_mom fills up the mom_logs directory
# within minutes after reboot. Try to sleep at startup
# in order to avoid this.
PBSDIR=/usr/local/sbin
if [ -x $/pbs_mom ]; then
echo PBS startup.
# Sleep for a while
sleep 120
$/pbs_mom # MOM
echo Done.
else
echo Could not execute PBS commands !
fi
Queues defined above do not work until you start them:
qstart default small medium long verylong
qenable default small medium long verylong
This needs to be done only once and for all, at the time when you install PBS.
Make sure that the PBS server has all nodes correctly defined. Use the pbsnodes -a command to list all nodes.
Add nodes using the qmgr command:
# qmgr
Max open servers: 4
Qmgr: create node node99 properties=ev67
where the node-name is node99 with the properties=ev67. Alternatively, you may simply list the nodes in the file /var/spool/PBS/server_priv/nodes:
server:ts ev67
node99 ev67
The :ts indicates a time-shared node; nodes without :ts are cluster nodes where batch jobs may execute. The second column lists the properties that you associate with the node. Restart the pbs_server after editing manually the nodes file.
After you first setup your system, to get the jobs to actually run you need to set the server scheduling attribute to true. This will normally be done for you at boot time (see point 5 in this file), but for this first time, you will need to do this by hand using the qmgr command:
# qmgr
Max open servers: 4
Qmgr: set server scheduling=true