Submitting jobs¶
All jobs run on the HPC systems should be submitted via the queuing system,
Univa Grid Engine.
To submit a job it is necessary to write a script which describes the resources
your job needs and how to run the job. This is then submitted via the qsub
command.
Common commands used to manipulate submitted jobs are:
qstat
checks the status of submitted jobs (man page).qhold
holds a job from running (man page).qrls
releases a job from a held status (man page).qdel
deletes a job from the queue, or a specific task from an array (man page).
Please also see our video which describes the job submission process, checking node and queue statuses, exit codes and job optimisation (cores, memory and runtime).
Job scripts¶
Writing your job scripts with a text editor¶
We recommend that you write your job scripts using a text editor installed on the cluster, such as:
vim
emacs
nano
(preferred by beginners - requiresmodule load nano
to use)
Job scripts written on Windows
Job scripts transferred from a Windows PC contain invisible control characters that result in job submission failures, and require conversion to use on the cluster.
To open a file called example.sh
, run vim example.sh
or nano example.sh
.
If the file exists in the current directory, it will be opened for editing,
otherwise a new, empty file will be created.
Simple job script¶
This is a simple job script requesting a single core, 1GB of ram and 1 hour runtime:
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -pe smp 1 # Request 1 core
#$ -l h_rt=1:0:0 # Request 1 hour runtime
#$ -l h_vmem=1G # Request 1GB RAM
./code
This can then be run with qsub
:
qsub example.sh
The main purpose of a job script is to request the required resources and setup the environment. The main resources to be requested on the cluster are cores, memory and runtime.
Don't forget to load the application you need using the module command.
Array jobs
Note that if you intend to submit multiple similar jobs, you should submit them as an array instead.
This reduces load on the scheduler and streamlines the job submission process. Please see the Arrays section for more details.
Resource specification¶
Resources are specified at the top of the file using the #$ <resource>
syntax. As a minimum you should specify the following in your job scripts:
- The wallclock runtime of your job, using
-l h_rt
. This defaults to 1 hour (see below). - The amount of memory your job requires, using
-l h_vmem
.
Grid engine supports the concept of "soft requests" which the scheduler will attempt to meet where possible - in general these should not be used here.
Memory requests¶
For all jobs requesting -pe smp
, memory requests are handled per core, so
requesting 2
cores and 5G
would result in 2
cores and 10G
.
So a request of:
#$ -l h_vmem=5G
would result in 5G
being allocated, whereas a request of:
#$ -pe smp 2
#$ -l h_vmem=5G
would result in 10G
being allocated. If your job consumes more than this
amount, it will be killed by the scheduler.
If you fail to specify memory units, your job submission will fail with an error. This is to avoid accidentally specifying 1Byte for your job, which would be the default if units are not specified.
GPU job Memory
Do not request more than 11G of memory per core (-l h_vmem=11G
) for GPU
job as it could lock out other users from using free GPUs.
SMP job Memory
With SMP jobs, if you don't specify memory with -l h_vmem
you will get
a default value of 1GB / core requested.
Parallel job Memory
For parallel jobs, each node is booked for exclusive use and uses all available memory on each node: you do not need to specify a memory requirement.
Jobs requiring large amounts of RAM
If you have jobs with very large RAM requirements you may want to make use
of our public highmem nodes. You will need to pass the
-l highmem
option in the job script. Jobs which do not require very
large amounts of memory should submit to the regular nodes by not
adding the -l highmem
option.
Requesting exclusive use of a node
Note that available memory on a node will be slightly lower than the
theoretical maximum, due to memory consumed by Operating System processes.
If your job requires a whole node, omitting the -l h_vmem
option and
choosing -l exclusive
will request a full node and ensure that the
maximum amount of available memory is used by your job. Note that the
queuing time for a whole node is considerably longer than with small jobs.
If requesting exclusive use of a node, please ensure that you correctly set
slots
to fill the node. Nodes booked for exclusive use do not need memory
requirements specified.
Requesting exclusive access may cause your job to queue for a long period because the scheduler needs to allocate all resources in a single node. As exclusive jobs block other jobs from running on the same node concurrently, please ensure the jobs are utilising all cores and memory correctly.
Job runtimes¶
Runtime, signified by h_rt
in the job script, defines the maximum length
of time a job is allow to run for. If the h_rt
value is exceeded, the job
will be killed automatically by the scheduler.
While the number of cores and RAM quantity will impact the queuing time significantly, the maximum runtime has a very low impact on the queuing time. Since the job scheduler will kill a job that exceeds the runtime value, for most users the best thing to do is specify either:
- 1 hour (to take advantage of the
high priority short queue. This is the also
the default runtime if
h_rt
is not specified) - 240 hours (10 days - the maximum runtime)
There are some edge cases that could mean a job with h_rt
value of 1 or 3
days will get queued ahead of a 10 day job, but these usually relate to
situations where we have reserved resources at a future date (e.g. maintenance
periods). There may also be some small benefit for users who run large
multi-node parallel jobs too, since we commonly reserve resources for
large jobs, but the main thing to remember is that you don’t want your job
to die due to some arbitrary limit you have artificially set, so most users
should just set 240 hours, unless they want to access the short queue.
The 240 hour limit is a global setting, and cannot be changed for individual jobs or users. If you are submitting long running jobs, you should consider:
- Attempting to parallelise the job
- Consider if the job can be broken into smaller parts
- Profiling the code to check for bottlenecks
- Implementing checkpointing (a method of regularly dumping the job's state so that it can be restarted - check if your application supports this)
Queues¶
Users should not specify a queue or project when submitting jobs: the scheduler is configured to assign jobs correctly to nodes, with preference for restricted nodes if you have access to them.
High Priority Short Queue¶
For short jobs of up to one hour, we have created a high priority short queue to run these jobs with minimal queuing time, which is useful for job script validation, code syntax checking and resource optimisation. The short queue has been enabled on a variety of nodes, including those which have been purchased by other research groups that are not available to all users for jobs longer than 1 hour.
An example use case for the short queue is to check that a job starts running correctly, modules have been loaded, the environment set, execution commands are found and their arguments can be understood, all in a short period of time. If there was an error and you submitted with more than 1 hour, it could be a few hours or days before the job starts running and immediately fails due to a typo or incorrect settings.
If you would like to do some quick debugging or testing, run qlogin
without
any parameters to start an interactive job on the short queue since a default
value of 1 hour is applied if you do not specify a h_rt
value.
Environment specification¶
The job script specifies the environment the job will execute in, this includes loaded modules and environment variables.
A number of environment variables are also made available by UGE:
Variable | Description |
---|---|
SGE_CWD_PATH |
Working directory |
SGE_O_WORKDIR |
Working directory when submitted |
TMPDIR |
Job specific temporary directory |
JOB_NAME |
Job Name |
QUEUE |
Queue |
PE |
Parallel Environment |
NSLOTS |
Number of slots |
NHOSTS |
Number of hosts |
SGE_HGR_m_mem_free |
Total memory requested (slots * mem) |
SGE_HGR_gpu |
ID of GPU granted |
SGE_BINDING |
Cores job is bound to on host |
SGE_TASK_FIRST |
First task in array |
SGE_TASK_LAST |
Last task in array |
SGE_TASK_STEPSIZE |
Array step size |
SGE_TASK_ID |
Current Task ID in array |
PE_HOSTFILE |
Location of host file for MPI |
Additional information for these options can be found on specific pages:
Other options¶
Option | Description |
---|---|
-pe |
request a parallel environment - this will be either smp to request a number of cores on the same node or parallel to request more than one whole machine. |
-l exclusive |
requests exclusive use of a serial node (automatically applied to parallel nodes). |
-m e |
send an email once the job is completed. You can alternatively use -m bea to get notifications on job start, finish and abortion. Please avoid sending mail for array jobs, as the resulting quantity of messages tends to cause email service disruption |
-M email@example.com |
specify an alternate destination address for emails. |
-o |
redirect the standard output from the job. |
-e |
redirect the standard error from the job. |
-cwd |
execute the job from the current working directory. |
-V |
forward the current environment to the context of the job (note that LD_LIBRARY_PATH does not get passed to the job for security reasons). Although useful for interactive sessions, it should be avoided for batch submissions as it may cause module conflicts and path issues. |
-j y |
combine the standard output and the standard error stream. |
-N |
give a specific name to the job. If this parameter is not specified, the job name will be the same as the name of the script being run. |
Job Names
Note the following characters are not allowed in job names:
",/:'\[]{}|()@%,`
and whitespace.
For full details on qsub
options see the qsub
man page.
Job output¶
Output from the job will include an output and an error file unless the -j y
option is given to join the streams. The default naming scheme for these files
is <jobname>.o<jobID>
and <jobname>.e<jobID>
and will contain all text sent
to
standard output
and
standard error
respectively.
For array jobs the filenames will also include the task id, such as
<jobname>.o<jobid>.<taskid>
Example job scripts¶
Serial job (single-core)¶
Jobs which do not support multi-threading should use the single-core serial job script as seen below:
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -j y # Join stdout and stderr
#$ -pe smp 1 # Request 1 CPU core
#$ -l h_rt=1:0:0 # Request 1 hour runtime
#$ -l h_vmem=1G # Request 1GB RAM / core, i.e. 1GB total
module load example
./code
Adjust the memory and runtime requests as applicable. Please refer to the memory and runtime sections for more information.
Serial job (multi-core)¶
A multi-core serial job should be used for jobs which can use multiple CPU cores on a single machine, such as those using OpenMP, and only those jobs. Requesting many cores for a job which cannot use them is wasteful.
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -j y # Join stdout and stderr
#$ -pe smp 4 # Request 4 CPU cores
#$ -l h_rt=1:0:0 # Request 1 hour runtime
#$ -l h_vmem=1G # Request 1GB RAM / core, i.e. 4GB total
module load example
./code --threads ${NSLOTS}
Please check the application documentation for a threading option (common
options include but are not limited to: --threads
, -t
, --cores
,
--multicore
, --parallel
and -p
). We recommend using the value of
NSLOTS
to reference the number of cores requested rather than a hard-coded
value, to ease the process when scaling up your job.
If you are using running an application which supports OpenMP, you should check
that the OMP_NUM_THREADS
variable has been set correctly to the value of
NSLOTS
, otherwise your application may run with poor performance. Some
application modules may automatically set this if unset when loaded. Check the
application documentation pages for more information.
Parallel job¶
Jobs which require multi-node parallel processing, such as those which use MPI, are run on nodes with a low-latency infiniband connection. You submission script will need to include the choice of infiniband-connected nodes to use.
mpirun slots
Using the $NSLOTS
variable will automatically pass the number of
requested cores to the mpirun command.
Exporting variables in multi-node MPI jobs
When using MPI jobs across multiple nodes you may need explicitly to export environment variables to each of the MPI processes, such as when using libraries which are not found by default. The way to do this is explained in the documentation for the specific MPI implementation you are using.
For example, to export the environment variable LD_LIBRARY_PATH
using
Open MPI you should use
mpirun -x LD_LIBRARY_PATH ...
With Intel MPI environment variables are usually exported by default, but when not, the corresponding option would look like
mpirun -genvlist LD_LIBRARY_PATH ...
#!/bin/bash
#$ -cwd # Set the working directory for the job to the current directory
#$ -j y
#$ -pe parallel 96 # Request 96 cores/2 ddy nodes
#$ -l infiniband=ddy-i # Choose infiniband island (ddy-i)
#$ -l h_rt=240:0:0 # Request 240 hours runtime
module load intelmpi
mpirun -np $NSLOTS ./code
Single-node jobs on infiniband nodes
Only run multi-node jobs in the parallel
pe, single-node jobs should be
run in the smp
pe so they don't block larger parallel jobs from running.
Large parallel jobs
If your job is requesting 120 or more cores, your job might queue for a long period of time and may benefit from running on a Tier 2 cluster instead of Apocrita. If you are unsure about your eligibility or have any questions about Tier 2 facilities, please contact us.
Interactive jobs¶
The qlogin
command can be used to schedule an interactive job. This accepts
the same arguments as qsub
(e.g. for requesting runtime). Instead of running a
script this presents a shell which can be used to run interactive commands on
the node.
If you require X forwarding inside your interactive job, firstly pass the
-X
option to your SSH command when connecting to the cluster. Within your
qlogin
session, the $DISPLAY
environment variable can be used to display
graphical windows on your machine.
Interactive sessions should be used sparingly (for interactive debugging, etc.)
since they use resources which could be used by batch jobs. You must have an
active connection to the cluster throughout the entirety of an interactive
job. If your connection drops, the session will end and you may lose your
results. Batch jobs submitted with qsub
do not require an active
connection and will continue to run whilst you are not connected to the
cluster.
At busy times you may not be able to get an interactive session as there may be no spare cores to immediately service your request, however you will increase your chances considerably if you choose the default runtime of 1 hour.
Node selection¶
In some situations, you may require your job to run on a certain node type
(for example to select a certain CPU architecture or GPU type). All node
selections are in the format -l <complex>[=<value>]
where the value is only
required for non-boolean complexes; The value can be either omitted or set
to TRUE
or true
when requesting boolean complexes. See the below table
for a list of supported node selections:
Complex Name | Description | Supported Values |
---|---|---|
cpuarch |
CPU architecture | intel and amd |
exclusive |
Request entire nodes | (boolean) |
gpu |
Request nodes with GPU support | (boolean) |
gpu_type |
Run on specific GPU types | ampere and volta |
gpuhighmem |
GPU nodes with a large amount of RAM | (boolean) |
highmem |
Nodes with a large amount of RAM | (boolean) |
infiniband |
Parallel jobs within an InfiniBand island | (node type) |
node_type |
Run on a specific node type | (node type) |
owned |
Run on owned / restricted nodes (if eligible) | (boolean) |
Job holds¶
Jobs can be held from running via the use of qhold
or qsub
with the
-hold_jid
argument. Holding a job means it remains in the queue but will not
be considered for scheduling until the hold is removed or the requirement is
met. This allows users and administrators greater control over queued jobs.
In the event of issues with a job, an administrator may hold the job until the issue is resolved. Users are unable to remove these holds but they may still delete the job if required.
Jobs can be held with the following commands:
# Hold a job
qhold <job_id>
# Release a job
qrls <job_id>
Queuing dependent jobs¶
To hold a job until a preceding job has completed -hold_jid
can be used.
# Preceding job:
$ qsub job_one.sh
$ qstat
500 0.00000 job_one abc123 qw 09/19/2020 10:11:35 1
# Held job
$ qsub -hold_jid 500 job_two.sh
$ qstat
500 0.00000 job_one abc123 qw 09/19/2020 10:11:35 1
501 0.00000 job_two abc123 hqw 09/19/2020 10:12:30 1
Once the first job completes the second job will be released for scheduling. This enables processing steps to be broken down into separate jobs. An example where this may be useful is within a pipeline, where a subsequent stage requires output from an earlier step, so this ensures a linear progression.
Further information¶
- The man pages on the cluster systems give information on the queuing system and MPI functions.