Skip to content

Submitting jobs

All jobs run on the HPC systems should be submitted via the queuing system, Univa Grid Engine. To submit a job it is necessary to write a script which describes the resources your job needs and how to run the job. This is then submitted via the qsub command.

Common commands used to manipulate submitted jobs are:

  • qstat checks the status of submitted jobs (man page).
  • qhold holds a job from running (man page).
  • qrls releases a job from a held status (man page).
  • qdel deletes a job from the queue, or a specific task from an array (man page).

Job scripts

Writing your job scripts with a text editor

We recommend that you write your job scripts using a text editor installed on the cluster, such as:

  • vim
  • emacs
  • nano (preferred by beginners - requires module load nano to use)

Job scripts written on Windows

Job scripts transferred from a Windows PC contain invisible control characters that result in job submission failures, and require conversion to use on the cluster.

To open a file called example.sh, run vim example.sh or nano example.sh. If the file exists in the current directory, it will be opened for editing, otherwise a new, empty file will be created.

Simple job script

This is a simple job script requesting a single core, 1GB of ram and 1 hour runtime:

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -pe smp 1      # Request 1 core
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM
./code

This can then be run with qsub:

qsub example.sh

The main purpose of a job script is to request the required resources and setup the environment. The main resources to be requested on the cluster are cores, memory and runtime.

Don't forget to load the application you need using the module command.

Array jobs

Note that if you intend to submit multiple similar jobs, you should submit them as an array instead.

This reduces load on the scheduler and streamlines the job submission process. Please see the Arrays section for more details.

Resource specification

Resources are specified at the top of the file using the #$ <resource> syntax. As a minimum you should specify the following in your job scripts:

  • The wallclock runtime of your job, using -l h_rt. This defaults to 1 hour (see below).
  • The amount of memory your job requires, using -l h_vmem.

Grid engine supports the concept of "soft requests" which the scheduler will attempt to meet where possible - in general these should not be used here.

Memory requests

For all jobs requesting -pe smp, memory requests are handled per core, so requesting 2 cores and 5G would result in 2 cores and 10G.

So a request of:

#$ -l h_vmem=5G

would result in 5G being allocated, whereas a request of:

#$ -pe smp 2
#$ -l h_vmem=5G

would result in 10G being allocated. If your job consumes more than this amount, it will be killed by the scheduler.

If you fail to specify memory units, your job submission will fail with an error. This is to avoid accidentally specifying 1Byte for your job, which would be the default if units are not specified.

SMP job Memory

With SMP jobs, if you don't specify memory with -l h_vmem you will get a default value of 1GB / core requested.

Parallel job Memory

For parallel jobs, each node is booked for exclusive use and uses all available memory on each node: you do not need to specify a memory requirement.

Jobs requiring large amounts of RAM

If you have jobs with very large RAM requirements you may want to make use of our public highmem nodes. You will need to pass the -l highmem option in the job script. Jobs which do not require very large amounts of memory should submit to the regular nodes by not adding the -l highmem option.

Requesting exclusive use of a node

Note that available memory on a node will be slightly lower than the theoretical maximum, due to memory consumed by Operating System processes. If your job requires a whole node, omitting the -l h_vmem option and choosing -l exclusive will request a full node and ensure that the maximum amount of available memory is used by your job. Note that the queuing time for a whole node is considerably longer than with small jobs.

If requesting exclusive use of a node, please ensure that you correctly set slots to fill the node. Nodes booked for exclusive use do not need memory requirements specified.

Requesting exclusive access may cause your job to queue for a long period because the scheduler needs to allocate all resources in a single node. As exclusive jobs block other jobs from running on the same node concurrently, please ensure the jobs are utilising all cores and memory correctly.

Job runtimes

Runtime, signified by h_rt in the job script, defines the maximum length of time a job is allow to run for. If the h_rt value is exceeded, the job will be killed automatically by the scheduler.

While the number of cores and RAM quantity will impact the queuing time significantly, the maximum runtime has a very low impact on the queuing time. Since the job scheduler will kill a job that exceeds the runtime value, for most users the best thing to do is specify either:

  • 1 hour (to take advantage of the high priority short queue. This is the also the default runtime if h_rt is not specified)
  • 240 hours (10 days - the maximum runtime)

There are some edge cases that could mean a job with h_rt value of 1 or 3 days will get queued ahead of a 10 day job, but these usually relate to situations where we have reserved resources at a future date (e.g. maintenance periods). There may also be some small benefit for users who run large multi-node parallel jobs too, since we commonly reserve resources for large jobs, but the main thing to remember is that you don’t want your job to die due to some arbitrary limit you have artificially set, so most users should just set 240 hours, unless they want to access the short queue.

The 240 hour limit is a global setting, and cannot be changed for individual jobs or users. If you are submitting long running jobs, you should consider:

  • Attempting to parallelise the job
  • Consider if the job can be broken into smaller parts
  • Profiling the code to check for bottlenecks
  • Implementing checkpointing (a method of regularly dumping the job's state so that it can be restarted - check if your application supports this)

Queues

Users should not specify a queue or project when submitting jobs: the scheduler is configured to assign jobs correctly to nodes, with preference for restricted nodes if you have access to them.

High Priority Short Queue

For short jobs of up to one hour, we have created a high priority short queue to run these jobs with minimal queuing time, which is useful for job script validation, code syntax checking and resource optimisation. The short queue has been enabled on a variety of nodes, including those which have been purchased by other research groups that are not available to all users for jobs longer than 1 hour.

An example use case for the short queue is to check that a job starts running correctly, modules have been loaded, the environment set, execution commands are found and their arguments can be understood, all in a short period of time. If there was an error and you submitted with more than 1 hour, it could be a few hours or days before the job starts running and immediately fails due to a typo or incorrect settings.

If you would like to do some quick debugging or testing, run “qlogin” without any parameters to start an interactive job on the short queue since a default value of 1 hour is applied if you do not specify a h_rt value.

Environment specification

The job script specifies the environment the job will execute in, this includes loaded modules and environment variables.

A number of environment variables are also made available by UGE:

Variable Description
SGE_O_WORKDIR Jobscript directory
SGE_CWD_PATH Working directory
TMPDIR Job specific temporary directory
JOB_NAME Job Name
QUEUE Queue
PE Parallel Environment
NSLOTS Number of slots
NHOSTS Number of hosts
SGE_HGR_m_mem_free Total memory requested (slots * mem)
SGE_HGR_gpu ID of GPU granted
SGE_BINDING Cores job is bound to on host
SGE_TASK_FIRST First task in array
SGE_TASK_LAST Last task in array
SGE_TASK_STEPSIZE Array step size
SGE_TASK_ID Current Task ID in array
PE_HOSTFILE Location of host file for MPI

Additional information for these options can be found on specific pages:

Other options

Option Description
-pe request a parallel environment - this will be either smp to request a number of cores on the same node or parallel to request more than one whole machine.
-l exclusive requests exclusive use of a serial node (automatically applied to parallel nodes).
-m e send an email once the job is completed. You can alternatively use -m bea to get notifications on job start, finish and abortion. Please avoid sending mail for array jobs, as the resulting quantity of messages tends to cause email service disruption
-M email@example.com specify an alternate destination address for emails.
-o redirect the standard output from the job.
-e redirect the standard error from the job.
-cwd execute the job from the current working directory.
-V forward the current environment to the context of the job (note that LD_LIBRARY_PATH does not get passed to the job for security reasons). Although useful for interactive sessions, it should be avoided for batch submissions as it may cause module conflicts and path issues.
-j y combine the standard output and the standard error stream.
-N give a specific name to the job. If this parameter is not specified, the job name will be the same as the name of the script being run.

Job Names

Note the following characters are not allowed in job names: ",/:'\[]{}|()@%,` and whitespace.

For full details on qsub options see the qsub man page.

Job output

Output from the job will include an output and an error file unless the -j y option is given to join the streams. The default naming scheme for these files is <jobname>.o<jobID> and <jobname>.e<jobID> and will contain all text sent to the standard output (stdout) and standard error (stderr) respectively.

For array jobs the filenames will also include the task id e.g. <jobname>.o<jobid>.<taskid>

Example job scripts

Serial job (single-core)

Jobs which do not support multi-threading should utilise the single-core serial job script as seen below:

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -j y           # Join stdout and stderr
#$ -pe smp 1      # Request 1 CPU core
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM / core, i.e. 1GB total

module load example

./code

Adjust the memory and runtime requests as applicable. Please refer to the memory and runtime sections for more information.

Serial job (multi-core)

This should be used for jobs which can use multiple CPU cores on a single machine, e.g. using OpenMP. Only request multiple cores if your application can utilise them all, otherwise cores will be unused, which is wasteful.

#!/bin/bash
#$ -cwd           # Set the working directory for the job to the current directory
#$ -j y           # Join stdout and stderr
#$ -pe smp 4      # Request 4 CPU cores
#$ -l h_rt=1:0:0  # Request 1 hour runtime
#$ -l h_vmem=1G   # Request 1GB RAM / core, i.e. 4GB total

module load example

./code --threads ${NSLOTS}

Please check the application documentation for a threading option (common options include but are not limited to: --threads, -t, --cores, --multicore, --parallel and -p). We recommend using the value of NSLOTS to reference the number of cores requested rather than a hard-coded value, to ease the process when scaling up your job.

If you are using running an application which supports OpenMP, you should check that the OMP_NUM_THREADS variable has been set correctly to the value of NSLOTS, otherwise your application may run with poor performance. Some application modules may automatically set this if unset when loaded. Check the application documentation pages for more information.

Parallel job

Jobs which require multi-node parallel processing, such as MPI, are run on nodes with a low-latency infiniband connection. You submission script will need to include the choice of infiniband-connected nodes to use.

mpirun slots

Using the $NSLOTS variable will automatically pass the number of requested cores to the mpirun command.

#!/bin/bash
#$ -cwd                 # Set the working directory for the job to the current directory
#$ -j y
#$ -pe parallel 48      # Request 48 cores/2 sdv nodes
#$ -l infiniband=sdv-i  # Choose infiniband island (nxn sdv-i)
#$ -l h_rt=240:0:0      # Request 240 hours runtime

module load intelmpi

mpirun -np $NSLOTS ./code

Single-node jobs on infiniband nodes

Only run multi-node jobs in the parallel pe, single-node jobs should be run in the smp pe so they don't block larger parallel jobs from running.

Large parallel jobs

If your job is requesting 120 or more cores, your job might queue for a long period of time and may benefit from running on a Tier 2 cluster instead of Apocrita. If you are unsure about your eligibility or have any questions about Tier 2 facilities, please contact us.

Interactive jobs

The qlogin command can be used to schedule an interactive job. This accepts the same arguments as qsub (e.g. for requesting runtime). Instead of running a script this presents a shell which can be used to run interactive commands on the node.

If you require X forwarding inside your interactive job, firstly pass the -X option to your SSH command when connecting to the cluster. Within your qlogin session, the $DISPLAY environment variable can be used to display graphical windows on your machine.

Interactive sessions should be used sparingly (for interactive debugging, etc.) since they use resources which could be used by batch jobs. You must have an active connection to the cluster throughout the entirety of an interactive job. If your connection drops, the session will end and you may lose your results. Batch jobs submitted with qsub do not require an active connection and will continue to run whilst you are not connected to the cluster.

At busy times you may not be able to get an interactive session as there may be no spare cores to immediately service your request, however you will increase your chances considerably if you choose the default runtime of 1 hour.

Node selection

In some situations, you may require your job to run on a certain node type (for example to select a certain CPU architecture or GPU type). All node selections are in the format -l <complex>[=<value>] where the value is only required for non-boolean complexes; The value can be either omitted or set to TRUE or true when requesting boolean complexes. See the below table for a list of supported node selections:

Complex Name Description Supported Values
avx2 Nodes which support AVX2 (boolean)
cpuarch CPU architecture intel and amd
exclusive Request entire nodes (boolean)
gpu Request nodes with GPU support (boolean)
gpu_type Run on specific GPU types kepler and volta
highmem Nodes with a large amount of RAM (boolean)
infiniband Parallel jobs within an InfiniBand island (node type)
node_type Run on a specific node type (node type)
owned Run on owned / restricted nodes (if eligible) (boolean)

AVX2 Instruction set

Most of our nodes now support the AVX2 instruction set. However our legacy SM & NXN nodes do not. Use the "avx2" complex to explicitly select a node that does support AVX2. More information is available here.

Job holds

Jobs can be held from running via the use of qhold or qsub with the -hold_jid argument. Holding a job means it remains in the queue but will not be considered for scheduling until the hold is removed or the requirement is met. This allows users and administrators greater control over queued jobs.

In the event of issues with a job, an administrator may hold the job until the issue is resolved. Users are unable to remove these holds but they may still delete the job if required.

Jobs can be held with the following commands:

# Hold a job
qhold <job_id>

# Release a job
qrls <job_id>

Queuing dependent jobs

To hold a job until a preceding job has completed -hold_jid can be used.

# Preceding job:
$ qsub job_one.sh
$ qstat
500 0.00000 job_one abc123       qw    09/19/2020 10:11:35 1

# Held job
$ qsub -hold_jid 500 job_two.sh
$ qstat
500 0.00000 job_one abc123       qw    09/19/2020 10:11:35 1
501 0.00000 job_two abc123       hqw   09/19/2020 10:12:30 1

Once the first job completes the second job will be released for scheduling. This enables processing steps to be broken down into separate jobs. An example where this may be useful is within a pipeline, where a subsequent stage requires output from an earlier step, so this ensures a linear progression.

Further information

  • The man pages on the cluster systems give information on the queuing system and MPI functions.