SGE Cluster Documentation

The Duke Mathematics Sun Grid Engine Cluster consists of 16 Xeon E1650v3 Hexa Core 3.5Ghz systems with 64GB of RAM each connected via gigabit ethernet. These machines are denoted grid1.math.duke.edu through grid16.math.duke.edu.

The head node, grid.math.duke.edu is similarly configured but should not be used for computational jobs. Jobs should be submitted from grid.math.duke.edu via the qsub command described below.

You can also login directly to one of the grid?.math.duke.edu machines, but the preferred method is using the qsub command since that will optimize our resources for the largest number of users.

You might also want to look at Mike Gratton's Brief Sun Grid Engine Guide which provides some helpful scripts and another tutorial on using the Sun Grid Engine.

Preliminary Comments

SGE can be difficult to install and configure, but if you login to grid.math.duke.edu, you should automatically have everything you need preconfigured and ready to run.

The following is an excerpt from Jeffrey B. Layton's page formerly at http://docs.warewult-cluster.org/contrib/sge.html with minor cosmetic and local changes (the original page is no longer available, the link uses an archived version of the page).

Using SGE

Now that we have SGE installed and configured, let's test it. We're going to try a very simple example that just runs the date command on a single node. Create a job script called sge-date.job containing the following :

#!/bin/bash
#$ -cwd
/bin/date

Lines beginning with #$ are special comments that are passed to SGE as options. Lets look at the what this scripts will do :

  • #!/bin/bash : The first line tells SGE to run the job script using the bash shell.
  • #$ -cwd : The second line is a special command to tell SGE to put the results in the directory where you submitted it.
  • /bin/date : The third line is the actual command to be run. In this case, it's the date command.

Now let's run this command by submitting the job script to SGE.

grid{sge}2: qsub sge-date.job
your job 2 ("sge-date.job") has been submitted

Notice that you can (and should) run this job script as a user. After you submit the job, SGE will assign a Job ID. This Job ID is unique. In this case, the Job ID is 2. Then when the job is done, SGE will create two files in the directory where you submitted the job. For this jobs the first one, sge-date.job.e2, contains any error messages from SGE and/or the job. This is where you look for problems if you job fails. The second file, sge-date.job.o2 contains the output from the job (things written to stdout).

The qsub command allows you to submit a job to SGE. There are several options you can use when submitting a job. Look at the man pages for qsub to find out these options.

SGE tools

There are various commands for submitting you jobs to SGE, tracking the status of your jobs, and for manipulating your jobs. Let's briefly go over these commands.

qstat

This command allows you to get the status of SGE and the jobs that are running or are waiting to be run (queued). Let's explore what qstat can do for us. Create a simple job script, called sleeper.sh, that does nothing but sleep for 60 seconds and then stops.

#!/bin/bash
#$ -cwd
sleep 60

While this job does nothing it allows us to see the output from qstat for several jobs. So, let's submit 6 copies of this job in quick succession and run qstat. Here's the output from my cluster.

grid{sge}4: qsub sleeper.sh
...
grid{sge}9: qsub sleeper.sh
grid{sge}10: qstat
job-ID  prior name       user         state submit/start at     queue      master  ja-task-ID
---------------------------------------------------------------------------------------------
      5     0 sleeper.sh laytonj      r     02/18/2004 18:55:11 admin1.q   MASTER
      4     0 sleeper.sh laytonj      r     02/18/2004 18:55:11 admin2.q   MASTER
      6     0 sleeper.sh laytonj      r     02/18/2004 18:55:11 admin3.q   MASTER
      7     0 sleeper.sh laytonj      r     02/18/2004 18:55:11 admin4.q   MASTER
      8     0 sleeper.sh laytonj      qw    02/18/2004 18:55:05
      9     0 sleeper.sh laytonj      qw    02/18/2004 18:55:17

 

Because each node has it's own queue, qstat can be a bit confusing to understand for parallel jobs. In the above example, there are four jobs running (numbers 4 through 7). Jobs 8 and 9 are waiting to run as designated by the qw (queue waiting) state.

You can also get more information than is given above, by using the -f option with qstat. Here's the same output above but with qstat -f.


$ qstat -f
queuename            qtype used/tot. load_avg arch      states
----------------------------------------------------------------------------
admin1.q             BIP   1/1       0.00     glinux
     11     0 sleeper.sh laytonj      r     02/18/2004 18:57:41 MASTER
----------------------------------------------------------------------------
admin2.q             BIP   1/1       0.00     glinux
     10     0 sleeper.sh laytonj      r     02/18/2004 18:57:41 MASTER
----------------------------------------------------------------------------
admin3.q             BIP   1/1       0.00     glinux
     12     0 sleeper.sh laytonj      r     02/18/2004 18:57:41 MASTER
----------------------------------------------------------------------------
admin4.q             BIP   1/1       0.01     glinux
     13     0 sleeper.sh laytonj      r     02/18/2004 18:57:41 MASTER

############################################################################
 - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS - PENDING JOBS
############################################################################
     14     0 sleeper.sh laytonj      qw    02/18/2004 18:57:28
     15     0 sleeper.sh laytonj      qw    02/18/2004 18:57:28

The output is fairly straight forward, but it will take a little bit of time to get used to it so that it becomes second nature to interpreting it.

qdel

This command allows you to delete a job from SGE. For example, if you have submitted a job that you didn't want to run or one that you want to stop while it's running. You find the Job ID from the qstat command. Then you type:

grid{sge}11: qdel JOB_ID

where JOB_ID is the Job ID for that particular job.

qmon

SGE includes a nice GUI interface called, qmon. It can be used to administer SGE (by the SGE administrator) and by users to submit jobs to SGE. Trying the command qmon and then explore what it can do (try the various buttons).

qhost

The qhost command allows you to get a status of the nodes that are being used by SGE. Here is an example for my cluster.


$ qhost
HOSTNAME             ARCH       NPROC  LOAD   MEMTOT   MEMUSE   SWAPTO   SWAPUS
-------------------------------------------------------------------------------
global               -              -     -        -        -        -        -
admin1               glinux         1  0.00   495.5M     6.2M   515.8M      0.0
admin2               glinux         1  0.01   242.3M     5.9M   515.8M      0.0
admin3               glinux         1  0.00   495.5M     6.3M   515.8M      0.0
admin4               glinux         1  0.01   242.3M     5.4M   517.7M      0.0

 

Notice that the load on the nodes is zero (nothing is running). The output lists the number of CPUs per node (NPROC), the total memory available on the node (MEMTOT), the memory in use (MEMUSE), the swap space available (SWAPTO), and the used swap space (SWAPUS).

Sample Job Scripts

I've seen some very complicated job scripts (mostly for PBS). They can become overly complicated very quickly and will be difficult to understand and edit. I'm going to give you some simple scripts that you can use for your SGE jobs. I didn't write these, but I trust the people who did (I've tested them and they work just fine on my cluster).

I'm going to present a simple script for serial jobs. That is, jobs that only run on a single node. Then I'll present a sample script for MPI jobs for MPICH, and then a sample script for LAM-MPI jobs. Finally, I'll present a sample script for PVM jobs.

Serial Jobs

#!/bin/bash
#
# Set the name of the job.
#$ -N sge-date-run
#
# Make sure that the .e and .o file arrive in the 
#working directory
#$ -cwd
#Merge the standard out and standard error to one file
#$ -j y 
#
# My code is re-runnable
#$ -r y
#
# The max walltime for this job is 31 minutes
#$ -l h_rt=00:31:00

(Program command here)

Recall that the #$ symbol combination is used in the script to indicate an SGE option.

MPI: MPICH

#!/bin/sh
# 
# EXAMPLE MPICH SCRIPT FOR SGE
# To use, change "MPICH_JOB", "NUMBER_OF_CPUS" 
# and "MPICH_PROGRAM_NAME" to real values. 
#
# Your job name 
#$ -N MPICH_JOB
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for MPICH. Set your number of processors here.
#$ -pe mpich NUMBER_OF_CPUS 
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.
#

echo "Got $NSLOTS processors."
echo "Machines:"
cat $TMPDIR/machines

#
# Use full pathname to make sure we are using the right mpirun
#

/usr/bin/mpirun -np $NSLOTS \
-machinefile $TMPDIR/machines MPICH_PROGRAM_NAME

#
# Commands to do something with the data after the
# program has finished.
#

You will have to change the mpirun path to correspond to where you have instaled it on your system.

MPI: LAM-MPI

#!/bin/sh
# 
# EXAMPLE LAM SCRIPT FOR SGE
# To use, change "LAM_JOB", "NUMBER_OF_CPUS" 
# and "LAM_PROGRAM_NAME" to real values. 
#
# Your job name 
#$ -N LAM_JOB
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for LAM. Set your number of processors here. 
#$ -pe lam NUMBER_OF_CPUS
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.

echo "Got $NSLOTS processors."
echo "Machines:"
cat $TMPDIR/hostfile

#
# This MUST be in your LAM run script, otherwise
# multiple LAM jobs will NOT RUN 

export LAM_MPI_SOCKET_SUFFIX=$JOB_ID.$JOB_NAME

#
# Use full pathname to make sure we are using the right mpirun

/usr/mpi/lam/bin/mpirun -np $NSLOTS LAM_PROGRAM_NAME

#
# Commands to do something with the data after the
# program has finished.
#

You will have to change the mpirun path to correspond to where you have instaled it on your system.

PVM

#!/bin/sh
# 
# EXAMPLE PVM SCRIPT FOR SGE
# To use, change "PVM_JOB", "NUMBER_OF_CPUS" 
# and "PVM_PROGRAM_NAME" to real values. 
#
# Your job name 
#$ -N PVM_JOB
#
# Use current working directory
#$ -cwd
#
# Join stdout and stderr
#$ -j y
#
# pe request for PVM. Set your number of processors here. 
#$ -pe pvm NUMBER_OF_CPUS
#
# Run job through bash shell
#$ -S /bin/bash
#
# The following is for reporting only. It is not really needed
# to run the job. It will show up in your output file.

echo "Got $NSLOTS processors."
echo "Machines:"
cat $TMPDIR/hostfile

#
# This MUST be in your PVM run script, otherwise
# PVM jobs will NOT RUN

export PVM_VMID=$JOB_ID.$JOB_NAME

#
# Run the PVM program:

PVM_PROGRAM_NAME

#
# Commands to do something with the data after the
# program has finished.
#

Parting Comments

SGE is a very powerful scheduling/queuing system. It has many options to help effectively use all of your resources (your nodes). Take a little bit of time to look at the man pages. The SGE mailing list is also very friendly. Don't hesitate to post there and ask questions. Then when you become an expert you can help others.

Acknowledgements

I want to thank Greg Kurtzer for his hard work with Warewulf and for packaging SGE so effectively. I also want to thank him for his answers to my silly questions about SGE as I started to learn it. Finally, I want to thank Doug Eadline for his help with SGE and some of the sample scripts.

Copyright, Jeffrey B. Layton, 2004.