Tuning Jobs

Tuning your jobs can make a big difference to the amount of time they spend on the cluster, while ensuring that resources are not wasted..

Cores

Asking for more cores may mean a longer wait for the job to finish - for example, a particular job will run for three hours if it has four cores, and one hour if it has twelve cores. But waiting for twelve cores could mean it sits in the queue for eight hours before the job starts, meaning the results don't come back for nine hours rather than three.

It is also important to make sure your job can use more than one core, otherwise your job may be queued for days for no reason at all, and waste resources when it runs.

Walltime

If your job finishes successfully at a much lower walltime than requested it is advisable to reduce the requested walltime as the scheduler will be able to prioritise your job above others requesting, say 10 days. However, it is more important that a job completes successfully, so its better to overestimate substantially until you reliably know the walltime of your job. For well-behaved jobs, overestimating walltime does not actually waste resources, unlike overestimating memory requirements.

Memory tuning

Asking for the right amount of memory is also important to ensure the job runs on the cluster as soon as it can. Ask for too much and the job can queue for a long time and reserve the whole memory allocation exclusively while the job runs - ask for too little, and the job will be terminated by the scheduler. Either way, resources are wasted.

Memory requirements for exclusive node access

For jobs using full and multiple nodes, such as parallel jobs, and smp jobs requesting -l exclusive=true, the memory requirement does not need to be specified, since the full node memory will be used.

Try with an initial sensible value, based on the data size and previous experience with the application. If you aren't sure, the standard way to work out the memory needed is to try the default (1GB per core), and enable email notifications in your job submission script with -m e and -M example@example.com. The requested memory can be easily changed in the job script. When the job finishes, an email will be sent providing the Max vmem value which displays the memory used by the job. Depending on whether the memory was under or over-estimated, you can use this information to refine your memory request for the next job.

Memory requests with array jobs

When running an array job, please run a task from the array individually first to get memory usage requirements. Submitting large array jobs which subsequently fail due to exceeding memory requirements, or jobs which only use a small fraction of the requested memory are a waste of valuable cluster resources.

Analysing Job history via the stats site

Once you have jobs that run successfully, you can view details of your recent jobs on the Apocrita stats page - use your Apocrita credentials for access, then select View your jobs detail from the side menu. This will show a list of dates on which your jobs finished: select the date you're interested in, and a list of jobs and their attributes will be displayed.

At this point, the easiest tuning you can do is to look at the requested memory (Req mem) and Maximum memory used (Max mem used) fields. The size is in GB. If there's a discrepancy between the memory you requested and the amount you used, you can then reduce the amount of requested memory for the next iteration of the job. This will ensure that your job has a good chance of running sooner than it otherwise would have done.

e.g. Given the following job output for an SMP job running on 6 cores, the total memory required (Max vmem) is ca. 54GB and could be run by selecting 9GB per core -l h_vmem=9G. Only requesting the required amount of memory will allow the scheduler to run more jobs allowing requests to be fulfilled more quickly.

Job 123456 (test.sh) Complete
User             = abc123
Queue            = test.q@node99
Host             = node99.apocrita
Start Time       = 09/11/2013 18:03:03
End Time         = 09/11/2013 22:59:06
User Time        = 17:22:55
System Time      = 00:45:04
Wallclock Time   = 04:56:03
CPU              = 19:03:18
Max vmem         = 53.776G
Exit Status      = 0

Storage

Temporary local node storage is significantly faster than accessing GPFS. The scheduler sets environment variables specifying available temporary space:

TMPDIR=/tmp/3596165.1.serial.q
TMP=/tmp/3596165.1.serial.q

Copying datasets to these directories before use will speed up access. To take full advantage of this speed up you can:

# Copy all datafiles to ${TMPDIR}
cp /data/Example/data ${TMPDIR}
# Run script outputting to ${TMPDIR}
./script -i ${TMPDIR}/data -o ${TMPDIR}/output_data
# Copy output data back
cp ${TMPDIR}/output_data /data/Example/

More detail is available in the Storage section of this site.