Submitting jobs¶

All jobs run on the HPC systems should be submitted via the queuing system, Univa Grid Engine. To submit a job it is necessary to write a script which describes the resources your job needs and how to run the job. This is then submitted via the qsub command.

Common commands used to manipulate submitted jobs are:

qstat checks the status of submitted jobs.
qhold holds a job from running.
qrls releases a job from a held status.
qdel deletes a job from the queue, or a specific task from an array.

For more detailed information about any of the above commands, please refer to its man page.

Please also see our video which describes the job submission process, checking node and queue statuses, exit codes and job optimisation (cores, memory and runtime).

Job scripts¶

Writing your job scripts with a text editor¶

We recommend that you write your job scripts using a text editor installed on the cluster. The editors we support are as follows:

vim (pre-installed, no module required)
nano (preferred by beginners - requires module load nano to use)

Whilst not supported by us, we also offer emacs, via the module load emacs. No other text editor is supported on Apocrita.

Job scripts written on Windows

Job scripts transferred from a Windows PC contain invisible control characters that result in job submission failures, and require conversion to use on the cluster.

To open a file called example.sh, run vim example.sh or nano example.sh. If the file exists in the current directory, it will be opened for editing, otherwise a new, empty file will be created.

Simple job script¶

This is a simple job script requesting a single core, 1GB of ram and 1 hour runtime:

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -pe smp 1      # Request 1 core
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM
./code

This can then be run with qsub:

qsub example.sh

The main purpose of a job script is to request the required resources and setup the environment. The main resources to be requested on the cluster are cores, memory and runtime.

Don't forget to load the application you need using the module command.

Array jobs

Note that if you intend to submit multiple similar jobs, you should submit them as an array instead.

This reduces load on the scheduler and streamlines the job submission process. Please see the Arrays section for more details.

Resource specification¶

Resources are specified at the top of the file using the #$ <resource> syntax. As a minimum you should specify the following in your job scripts:

The wallclock runtime of your job, using -l h_rt. This defaults to 1 hour (see below).
The amount of memory your job requires, using -l h_vmem.

Grid engine supports the concept of "soft requests" which the scheduler will attempt to meet where possible - in general these should not be used here.

Memory requests¶

For all jobs requesting -pe smp, memory requests are handled per core, so requesting 2 cores and 5G would result in 2 cores and 10G.

So a request of:

#$ -l h_vmem=5G

would result in 5G being allocated, whereas a request of:

#$ -pe smp 2
#$ -l h_vmem=5G

would result in 10G being allocated. If your job consumes more than this amount, it will be killed by the scheduler.

If you fail to specify memory units, your job submission will fail with an error. This is to avoid accidentally specifying 1Byte for your job, which would be the default if units are not specified.

GPU job Memory

Do not request more than 11G of memory per core (-l h_vmem=11G) for GPU job as it could lock out other users from using free GPUs.

SMP job Memory

With SMP jobs, if you don't specify memory with -l h_vmem you will get a default value of 1GB / core requested.

Parallel job Memory

For parallel jobs, each node is booked for exclusive use and uses all available memory on each node: you do not need to specify a memory requirement.

Jobs requiring large amounts of RAM

If you have jobs with very large RAM requirements you may want to make use of our public highmem nodes. You will need to pass the -l highmem option in the job script. Jobs which do not require very large amounts of memory should submit to the regular nodes by not adding the -l highmem option.

Requesting exclusive use of a node

Note that available memory on a node will be slightly lower than the theoretical maximum, due to memory consumed by Operating System processes. If your job requires a whole node, omitting the -l h_vmem option and choosing -l exclusive will request a full node and ensure that the maximum amount of available memory is used by your job. Note that the queuing time for a whole node is considerably longer than with small jobs.

If requesting exclusive use of a node, please ensure that you correctly set slots to fill the node. Nodes booked for exclusive use do not need memory requirements specified.

Requesting exclusive access may cause your job to queue for a long period because the scheduler needs to allocate all resources in a single node. As exclusive jobs block other jobs from running on the same node concurrently, please ensure the jobs are utilising all cores and memory correctly.

Job runtimes¶

Runtime, signified by h_rt in the job script, defines the maximum length of time a job is allow to run for. If the h_rt value is exceeded, the job will be killed automatically by the scheduler.

While the number of cores and RAM quantity will impact the queuing time significantly, the maximum runtime has a very low impact on the queuing time. Since the job scheduler will kill a job that exceeds the runtime value, for most users the best thing to do is specify either:

1 hour (to take advantage of the high priority short queue. This is the also the default runtime if h_rt is not specified)
240 hours (10 days - the maximum runtime)

There are some edge cases that could mean a job with h_rt value of 1 or 3 days will get queued ahead of a 10 day job, but these usually relate to situations where we have reserved resources at a future date (e.g. maintenance periods). There may also be some small benefit for users who run large multi-node parallel jobs too, since we commonly reserve resources for large jobs, but the main thing to remember is that you don’t want your job to die due to some arbitrary limit you have artificially set, so most users should just set 240 hours, unless they want to access the short queue.

The 240 hour limit is a global setting, and cannot be changed for individual jobs or users. If you are submitting long running jobs, you should consider:

Attempting to parallelise the job
Consider if the job can be broken into smaller parts
Profiling the code to check for bottlenecks
Implementing checkpointing (a method of regularly dumping the job's state so that it can be restarted - check if your application supports this)

Queues¶

Users should not specify a queue or project when submitting jobs: the scheduler is configured to assign jobs correctly to nodes, with preference for restricted nodes if you have access to them.

High Priority Short Queue¶

For short jobs of up to one hour, we have created a high priority short queue to run these jobs with minimal queuing time, which is useful for job script validation, code syntax checking and resource optimisation. The short queue has been enabled on a variety of nodes, including those which have been purchased by other research groups that are not available to all users for jobs longer than 1 hour.

An example use case for the short queue is to check that a job starts running correctly, modules have been loaded, the environment set, execution commands are found and their arguments can be understood, all in a short period of time. If there was an error and you submitted with more than 1 hour, it could be a few hours or days before the job starts running and immediately fails due to a typo or incorrect settings.

If you would like to do some quick debugging or testing, run qlogin without any parameters to start an interactive job on the short queue since a default value of 1 hour is applied if you do not specify a h_rt value.

Environment specification¶

The job script specifies the environment the job will execute in, this includes loaded modules and environment variables.

A number of environment variables are also made available by UGE:

Variable	Description
`SGE_CWD_PATH`	Working directory
`SGE_O_WORKDIR`	Working directory when submitted
`TMPDIR`	Job specific temporary directory
`JOB_NAME`	Job Name
`QUEUE`	Queue
`PE`	Parallel Environment
`NSLOTS`	Number of slots
`NHOSTS`	Number of hosts
`SGE_HGR_m_mem_free`	Total memory requested (slots * mem)
`SGE_HGR_gpu`	ID of GPU granted
`SGE_BINDING`	Cores job is bound to on host
`SGE_TASK_FIRST`	First task in array
`SGE_TASK_LAST`	Last task in array
`SGE_TASK_STEPSIZE`	Array step size
`SGE_TASK_ID`	Current Task ID in array
`PE_HOSTFILE`	Location of host file for MPI

Additional information for these options can be found on specific pages:

GPUs
TMPDIR

Other options¶

Option	Description
`-pe`	request a parallel environment - this will be either `smp` to request a number of cores on the same node or `parallel` to request more than one whole machine.
`-l exclusive`	requests exclusive use of a serial node (automatically applied to parallel nodes).
`-m e`	send an email once the job is completed. You can alternatively use `-m bea` to get notifications on job start, finish and abortion. Please avoid sending mail for array jobs, as the resulting quantity of messages tends to cause email service disruption
`-M email@example.com`	specify an alternate destination address for emails.
`-o`	redirect the standard output from the job.
`-e`	redirect the standard error from the job.
`-cwd`	execute the job from the current working directory.
`-j y`	combine the standard output and the standard error stream.
`-N`	give a specific name to the job. If this parameter is not specified, the job name will be the same as the name of the script being run.

Job Names

Note the following characters are not allowed in job names: ",/:'\[]{}|()@%,` and whitespace.

For full details on qsub options see the qsub man page.

Job output¶

Output from the job will include an output and an error file unless the -j y option is given to join the streams. The default naming scheme for these files is <jobname>.o<jobID> and <jobname>.e<jobID> and will contain all text sent to standard output and standard error respectively.

For array jobs the filenames will also include the task id, such as <jobname>.o<jobid>.<taskid>

Example job scripts¶

Serial job (single-core)¶

Jobs which do not support multi-threading should use the single-core serial job script as seen below:

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -j y           # Join stdout and stderr
#$ -pe smp 1      # Request 1 CPU core
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM / core, i.e. 1GB total

module load example

./code

Adjust the memory and runtime requests as applicable. Please refer to the memory and runtime sections for more information.

Serial job (multi-core)¶

A multi-core serial job should be used for jobs which can use multiple CPU cores on a single machine, such as those using OpenMP, and only those jobs. Requesting many cores for a job which cannot use them is wasteful.

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -j y           # Join stdout and stderr
#$ -pe smp 4      # Request 4 CPU cores
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM / core, i.e. 4GB total

module load example

./code --threads ${NSLOTS}

Please check the application documentation for a threading option (common options include but are not limited to: --threads, -t, --cores, --multicore, --parallel and -p). We recommend using the value of NSLOTS to reference the number of cores requested rather than a hard-coded value, to ease the process when scaling up your job.

If you are using running an application which supports OpenMP, you should check that the OMP_NUM_THREADS variable has been set correctly to the value of NSLOTS, otherwise your application may run with poor performance. Some application modules may automatically set this if unset when loaded. Check the application documentation pages for more information.

Parallel job¶

Jobs which require multi-node parallel processing, such as those which use MPI, are run on nodes with a low-latency infiniband connection. You submission script will need to include the choice of infiniband-connected nodes to use.

mpirun slots

Using the $NSLOTS variable will automatically pass the number of requested cores to the mpirun command.

Exporting variables in multi-node MPI jobs

When using MPI jobs across multiple nodes you may need explicitly to export environment variables to each of the MPI processes, such as when using libraries which are not found by default. The way to do this is explained in the documentation for the specific MPI implementation you are using.

For example, to export the environment variable LD_LIBRARY_PATH using Open MPI you should use

mpirun -x LD_LIBRARY_PATH ...

With Intel MPI environment variables are usually exported by default, but when not, the corresponding option would look like

mpirun -genvlist LD_LIBRARY_PATH ...

#!/bin/bash
#$ -cwd                 # Set the working directory for the job to the current directory
#$ -j y
#$ -pe parallel 96      # Request 96 cores/2 ddy nodes
#$ -l infiniband=ddy-i  # Choose infiniband island (ddy-i)
#$ -l h_rt=240:0:0      # Request 240 hours runtime

module load intelmpi

mpirun -np $NSLOTS ./code

Single-node jobs on infiniband nodes

Only run multi-node jobs in the parallel pe, single-node jobs should be run in the smp pe so they don't block larger parallel jobs from running.

Large parallel jobs

If your job is requesting 120 or more cores, your job might queue for a long period of time and may benefit from running on a Tier 2 cluster instead of Apocrita. If you are unsure about your eligibility or have any questions about Tier 2 facilities, please contact us.

Interactive jobs¶

The qlogin command can be used to schedule an interactive job. This accepts the same arguments as qsub (e.g. for requesting runtime) but without the leading #$ you would put into a job script file. Instead of running a script, a qlogin presents a shell which can be used to run interactive commands on the node.

For example, to request 2 cores and 1GB RAM per core for a total of 2GB RAM, for 1 hour:

qlogin -pe smp 2 -l h_vmem=1G -l h_rt=1:0:0

Adjust your request as required. Unlike job scripts, interactive qlogin sessions aren't queued if the requested resources aren't immediately available. If you receive a message like this:

warning: no suitable queues
Your job 1234567 ("QLOGIN") has been submitted
waiting for interactive job to be scheduled ...
Your "qlogin" request could not be scheduled, try again later.

Then you either need to try again later, lower your resource request or move your workload into a job script and submit it to the scheduler and it will be run once the requested resources become available.

Interactive sessions should be used sparingly (for interactive debugging, etc.) since they use resources which could be used by batch jobs. You must have an active connection to the cluster throughout the entirety of an interactive job. If your connection drops, the session will end and you may lose your results. Batch jobs submitted with qsub do not require an active connection and will continue to run whilst you are not connected to the cluster.

At busy times you may not be able to get an interactive session as there may be no spare cores to immediately service your request, however you will increase your chances considerably if you choose the default runtime of 1 hour.

Node selection¶

In some situations, you may require your job to run on a certain node type (for example to select a certain CPU architecture or GPU type). All node selections are in the format -l <complex>[=<value>] where the value is only required for non-boolean complexes; The value can be either omitted or set to TRUE or true when requesting boolean complexes. See the below table for a list of supported node selections:

Complex Name	Description	Supported Values
`cpuarch`	CPU architecture	`intel` and `amd`
`exclusive`	Request entire nodes	(boolean)
`gpu`	Request nodes with GPU support	(boolean)
`gpu_type`	Run on specific GPU types	`ampere` and `volta`
`gpuhighmem`	GPU nodes with a large amount of RAM	(boolean)
`highmem`	Nodes with a large amount of RAM	(boolean)
`infiniband`	Parallel jobs within an InfiniBand island	(node type)
`node_type`	Run on a specific node type	(node type)
`owned`	Run on owned / restricted nodes (if eligible)	(boolean)

Job holds¶

Jobs can be held from running via the use of qhold or qsub with the -hold_jid argument. Holding a job means it remains in the queue but will not be considered for scheduling until the hold is removed or the requirement is met. This allows users and administrators greater control over queued jobs.

In the event of issues with a job, an administrator may hold the job until the issue is resolved. Users are unable to remove these holds but they may still delete the job if required.

Jobs can be held with the following commands:

# Hold a job
qhold <job_id>

# Release a job
qrls <job_id>

Queuing dependent jobs¶

To hold a job until a preceding job has completed -hold_jid can be used.

# Preceding job:
$ qsub job_one.sh
$ qstat
500 0.00000 job_one abc123       qw    09/19/2020 10:11:35 1

# Held job
$ qsub -hold_jid 500 job_two.sh
$ qstat
500 0.00000 job_one abc123       qw    09/19/2020 10:11:35 1
501 0.00000 job_two abc123       hqw   09/19/2020 10:12:30 1

Once the first job completes the second job will be released for scheduling. This enables processing steps to be broken down into separate jobs. An example where this may be useful is within a pipeline, where a subsequent stage requires output from an earlier step, so this ensures a linear progression.

Further information¶

The man pages on the cluster systems give information on the queuing system and MPI functions.