Array jobs¶
A common requirement is to be able to run the same job a large number of times, with different input parameters. Whilst this could be done by submitting lots of individual jobs, a more efficient and robust way is to use an array job. Using an array job also allows you to circumvent the maximum jobs per user limitation, and manage the submission process more elegantly.
Arrays can be thought of as a for
loop:
for NUM in 1 2 3
do
echo $NUM
done
Is equivalent to:
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 1-3
echo ${SGE_TASK_ID}
Here the -t
flag configures the number of iterations in your qsub script and
the counter (the equivalent of $NUM
in the for loop example) is
$SGE_TASK_ID
.
To run an array job use the -t
option to specify the range of tasks to run.
Now, when the job is run the script will be run with $SGE_TASK_ID
set to each
value specified by -t
. The values for -t
can be any integer range, with the
option to increase the step size. In the following example,: -t 20-30:5
will
produce 20 25 30
and run 3 tasks.
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 20-30:5
echo "Sleeping for ${SGE_TASK_ID} seconds"
sleep ${SGE_TASK_ID}
The only difference between the individual tasks is the value of the
$SGE_TASK_ID
environment variable. This value can be used to reference
different parameter sets etc. from within a job.
Output files for array jobs will include the task id to differentiate output from each task e.g.
testarray.o123.1
testarray.o123.2
testarray.o123.3
Running single task from array job
If you need to run a single task from array job (for example 5th task of 100
hit the execution time limit and you wish to run it again) you can pass
number of task with -t key when submitting: qsub -t 5 array_job.sh
Email notifications and large arrays
Please ensure that job email notifications are not enabled in job scripts for arrays with lots of tasks, as the sending of a large number of email messages causes problems with the receiving mail servers, and even service disruption.
Processing files¶
If you need to process lots of files, then you can set up an appropriate list
using ls -1
. e.g. if your files are all named EN<something>.txt
:
ls -1 EN*.txt > list_of_files.txt
Now find out how many files there are:
$ wc -l list_of_files.txt
35 list_of_files.txt
Then set the -t
value to the appropriate number:
#$ -t 1-35
You can then use sed to select the correct line of the file for each iteration:
INPUT_FILE=$(sed -n "${SGE_TASK_ID}p" list_of_files.txt)
Which results in the final script:
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 1-35
INPUT_FILE=$(sed -n "${SGE_TASK_ID}p" list_of_files.txt)
example-program < $INPUT_FILE
Processing directories¶
Consider processing the contents of a collection of 1000 directories, called test1 to test1000.
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 1-1000
cd test${SGE_TASK_ID}
./program < input
Tasks are started in order of the array index.
Passing arguments to an application¶
The following example runs an application with differing arguments obtained from a text file:
$ cat list_of_args.txt
-i 50 52 54 -s 10
-i 60 62 64 -s 20
-i 70 72 74 -s 30
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y
#$ -l h_rt=1:0:0
#$ -t 1-3
INPUT_ARGS=$(sed -n "${SGE_TASK_ID}p" list_of_args.txt)
./program $INPUT_ARGS
Would result in 3 job tasks being submitted, using a different set of input arguments, specified on each line of the text file.
Task concurrency¶
Task concurrency (-tc N
) is the number of array tasks allowed to run at the
same time, this can be used to limit the number of tasks running for larger
jobs, and jobs that may impact storage performance.
If you are running code that would possibly read or write to the same files on
the filesystem, you may need to use this option to avoid filesystem blocking.
Also, large numbers of jobs starting or finishing at the same moment puts an
extra load on the scheduler using the tc
throttle can limit this.
#!/bin/bash
#$ -cwd # Run the code from the current directory
#$ -pe smp 1
#$ -l h_vmem=1G
#$ -j y # Merge the standard output and standard error
#$ -l h_rt=1:0:0 # Limit each task to 1 hr
#$ -t 1-1000
#$ -tc 5
cd test${SGE_TASK_ID}
./program < input
Concurrency default value
If a tc
value is not supplied, we set a default value of 100 to array
jobs. This is to avoid accidental impact on shared resources such as storage.
We allow you to set a higher concurrency value than this, but please be
vigilant of any potential issues your job might cause, such as each array
task writing to a single file.
You can alter the tc
value while the job is running with qalter
. For
example, to change the concurrency of an array job to a value of ten:
qalter -tc 10 <jobid>
Deleting specific tasks from a queued array job¶
If you want only to delete certain tasks from an array (for example tasks 1-10
are running, but you want to delete 20-60), use the -t
option for qdel
(see the man page).
In our example, if our jobid is 3388, and we want to delete tasks 20-60 from our array, and leave the rest running, do:
qdel 3388 -t 20-60
Need help?¶
If you need help writing or using array job submission scripts, please see Getting Help.