Submitting Jobs

All jobs run on the HPC systems should be submitted via the queuing system, Univa Grid Engine. To submit a job it is necessary to write a script which describes the resources your job needs and how to run the job. This is then submitted via the qsub command.

Job Scripts

This is a simple job script requesting a single core, 1GB of ram and 24 hours runtime:

#!/bin/sh
#$ -cwd           # Set the working directory for the job to the current directory
#$ -pe smp 1      # Request 1 core
#$ -l h_rt=24:0:0 # Request 24 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM
./code

This can then be run with qsub:

qsub example.sh

The main purpose of a job script is to request the required resources and setup the environment. The main resources to be requested on the cluster are cores, memory and runtime.

Don't forget to load the application you need using the module command.

Array jobs

Note that if you intend to submit multiple similar jobs, you should submit them as an array instead.

This reduces load on the scheduler and streamlines the job submission process. Please see the Arrays section for more details.

Resource Specification

Resources are specified at the top of the file using the #$ <resource> syntax. As a minimum you should specify the following in your job scripts:

  • The wallclock runtime of your job, using -l h_rt. This defaults to 1 hour (see below).
  • The amount of memory your job requires, using -l h_vmem.

Grid engine supports the concept of "soft requests" which the scheduler will attempt to meet where possible - in general these should not be used here.

Memory Requests

For single core or smp jobs, memory requests are handled per core, so requesting 2 cores and 5G would result in 2 cores and 10G.

So a request of:

#$ -l h_vmem=5G

would result in 5G being allocated, whereas a request of:

#$ -pe smp 2
#$ -l h_vmem=5G

would result in 10G being allocated. If your job consumes more than this amount, it will be killed by the scheduler.

If you fail to specify memory units, your job submission will fail with an error. This is to avoid accidentally specifying 1Byte for your job, which would be the default if units are not specified.

Parallel Job Memory

For parallel jobs, each node is booked for exclusive use and uses all available memory on each node: you do not need to specify a memory requirement.

Using all memory on a node

Note that available memory on a node will be slightly lower than the theoretical maximum, due to memory consumed by Operating System processes. If your job requires a whole node, omitting the -l h_vmem option and choosing -l excl will request a full node and ensure that the maximum amount of available memory is used by your job. Note that the queuing time for a whole node is considerably longer than with small jobs.

Requesting exclusive use of a node

If requesting exclusive use of a node, please ensure that you correctly set slots to fill the node. Nodes booked for exclusive use do not need memory requirements specified.

Job Runtimes

Jobs have a default wallclock time limit of 1 hour. In general, if your jobs take more than a few days to run then you should implement checkpointing to enable the job to be restarted if necessary. It is recognised however that this may involve a lot of effort for jobs which need slightly longer. It is therefore possible to request a longer run time, up to 240 hours, using -l h_rt=240:0:0 in the job script. The 240 hour limit is a global setting, and cannot be changed for individual users. It is necessary to allow us to perform maintenance on the cluster at 10 days notice.

Please note however:

  • Jobs requesting more than 72 hours may not be scheduled as quickly as those requesting less.
  • Jobs are automatically killed once the wallclock limit expires, this cannot be changed once a job is running.
  • Jobs running for longer than 5 days are at risk of being killed if the system urgently needs to be shutdown for maintenance. In this case, we would try to contact affected users first.

If you have jobs which will need more than 240 hours, or regularly use more than 72 hours you should consider:

  • Attempting to parallelise the job
  • Consider if the job can be broken into smaller parts
  • Profiling the code to check for bottlenecks
  • Implementing checkpointing ( a method of regularly dumping the job's state so that it can be restarted)

Queues

Users should not specify a queue or project when submitting jobs: the scheduler is configured to assign jobs correctly to nodes, with preference for restricted nodes if you have access to them.

In addition to the primary queue, there is a queue designed to minimise waiting times for short jobs and interactive sessions. This short queue runs on restricted and public nodes and is automatically selected if your runtime request is 1 hour or less.

Environment Specification

The job script specifies the environment the job will execute in, this includes loaded modules and environment variables.

A number of environment variables are also made available by UGE:

Variable Description
SGE_O_WORKDIR Jobscript directory
SGE_CWD_PATH Working directory
TMPDIR Job specific temporary directory
JOB_NAME Job Name
QUEUE Queue
PE Parallel Environment
NSLOTS Number of slots
NHOSTS Number of hosts
SGE_HGR_m_mem_free Total memory requested (slots * mem)
SGE_HGR_gpu ID of GPU granted
SGE_BINDING Cores job is bound to on host
SGE_TASK_FIRST First task in array
SGE_TASK_LAST Last task in array
SGE_TASK_STEPSIZE Array step size
SGE_TASK_ID Current Task ID in array
PE_HOSTFILE Location of host file for MPI

Additional information for these options can be found on specific pages:

Other Options

Option Description
-pe request a parallel environment - this will usually be smp to request a number of cores on the same node or parallel to request more than one whole machine.
-l exclusive=true requests exclusive use of a node.
-m e send an email once the job is completed. You can alternatively use -m bea to get notifications on job start, finish and abortion.
-M email@example.com specify an alternate destination address for emails.
-o redirect the standard output from the job.
-e redirect the standard error from the job.
-cwd execute the job from the current working directory.
-V forward the current environment to the context of the job (note that LD_LIBRARY_PATH does not get passed to the job for security reasons). Although useful for interactive sessions, it should be avoided for batch submissions as it may cause module conflicts and path issues.
-j y combine the standard output and the standard error stream.
-N give a specific name to the job. If this parameter is not specified, the job name will be the same as the name of the script being run.

Job Names

Note the following characters are not allowed in job names: ",/:'\[]{}|()@%,` and whitespace.

For full details on qsub options see the qsub man page.

Job Output

Output from the job will include an output and an error file unless the -j y option is given to join the streams. The default naming scheme for these files is <jobname>.o<jobID> and <jobname>.e<jobID> and will contain all text sent to the standard output (stdout) and standard error (stderr) respectively.

For jobs using a parallel environment, in addition to stdout and stderr files there may be parallel env stdout and stderr files. These have the default naming scheme of <jobname>.po<job ID> and <jobname>.pe<jobID>.

For array jobs the filenames will also include the task id e.g. <jobname>.o<jobid>.<taskid>

Example Job Scripts

SMP Job

This should be used for jobs which can use multiple CPU cores on a single machine, e.g. using OpenMP.

#!/bin/sh
#$ -cwd           # Set the working directory for the job to the current directory
#$ -j y           # Join stdout and stderr
#$ -pe smp 4      # Request 4 CPU cores
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM / core, i.e. 4GB total

module load example

./code

Parallel Job

Jobs which require multi-node parallel processing, such as MPI, are run on nodes with a low-latency infiniband connection. You submission script will need to include the choice of infiniband-connected nodes to use.

mpirun slots

Using the $NSLOTS variable will automatically pass the number of requested cores to the mpirun command.

#!/bin/sh
#$ -cwd                 # Set the working directory for the job to the current directory
#$ -j y
#$ -pe parallel 128     # Request 128 cores/4 nxv nodes
#$ -l infiniband=nxv    # Choose infiniband island (ccn nxn nxv)
#$ -l h_rt=24:0:0       # Request 24 hour runtime

module load intelmpi

mpirun -np $NSLOTS ./code

Please run only multi-node jobs in the parallel pe, since single-node jobs tend to block larger parallel jobs from running.

Interactive Jobs

The qlogin command can be used to schedule an interactive job. This accepts the same arguments as qsub (e.g. for requesting runtime). Instead of running a script this presents a shell which can be used to run interactive commands on the node.

It is also possible to run interactive jobs with X forwarding. To do this you should ssh -X to the headnode then use qsh to start the interactive job (or use qlogin but copy the $DISPLAY environment variable from the headnode).

Typically interactive jobs should be used to troubleshoot issues or assist with setting up your batch job. It should not be used as an alternative to writing submission scripts to use with qsub as this wastes computing resources that would otherwise be used by queued jobs. Please note that the headnodes are rebooted reasonably frequently so interactive jobs should only be used for short tasks.

At busy times you may not be able to get an interactive session as there may be no spare cores to immediately service your request, however you will increase your chances considerably if you choose the default runtime of 1 hour.

Node Selection

In some situations, you may require your job to run on a certain node type (to select a certain CPU architecture, or ensure the runtime is not exceeded by array tasks executing on different node types). Use the -l cpuarch=intel or -l cpuarch=amd to force jobs to run only on nodes matching the requested architecture. Similarly use -l node_type=dn (or sm/nxv) to force node type. Please note that this will substantially increase queuing times, and should only be requested with good reason.

Job Holds

Jobs can be held from running via the use of qhold or qsub with the -hold_jid argument. Holding a job means it remains in the queue but will not be considered for scheduling until the hold is removed or the requirement is met. This allows users and administrators greater control over queued jobs.

In the event of issues with a job, an administrator may hold the job until the issue is resolved. Users are unable to remove these holds but they may still delete the job if required.

Jobs can be held with the following commands:

# Hold a job
qhold <job_id>

# Release a job
qrls <job_id>

Queuing Dependent Jobs

To hold a job until a preceding job has completed -hold_jid can be used.

# Preceding job:
$ qsub job_one.sh
$ qstat
500 0.00000 job_one abc123       qw    09/19/2016 10:11:35 1
# Held job
$ qsub job_two.sh -hold_jid 500
$ qstat
500 0.00000 job_one abc123       qw    09/19/2016 10:11:35 1
501 0.00000 job_two abc123       hqw   09/19/2016 10:12:30 1

Once the first job completes the second job will be released for scheduling, this enables processing steps to be broken down into separate jobs.

Further Information

  • The man pages on the cluster systems give information on the queuing system and MPI functions.
  • The MPI Specifications contain information on the MPI functions including examples and advice.
  • HPC Libraries