Tuning job requests¶

Tuning your jobs can make a big difference to the amount of time they spend on the cluster, while ensuring that resources are not wasted.

Cores¶

Asking for more cores may mean a longer wait for the job to finish - for example, a particular job will run for three hours if it has four cores, and one hour if it has twelve cores. But waiting for twelve cores could mean it sits in the queue for eight hours before the job starts, meaning the results don't come back for nine hours rather than three.

It is also important to make sure your job can use more than one core, otherwise your job may be queued for days for no reason at all, and waste resources when it runs.

Walltime¶

If your job finishes successfully at a much lower walltime than requested it is advisable to reduce the requested walltime as the scheduler will be able to prioritise your job above others requesting, say 10 days. However, it is more important that a job completes successfully, so it is better to overestimate substantially until you reliably know the walltime of your job. For well-behaved jobs, overestimating walltime does not actually waste resources, unlike overestimating memory requirements.

Memory tuning¶

Asking for the right amount of memory is also important to ensure the job runs on the cluster as soon as it can. Ask for too much and the job can queue for a long time and reserve the whole memory allocation exclusively while the job runs - ask for too little, and the job will be terminated by the scheduler. Either way, resources are wasted.

Memory requirements for exclusive node access

For jobs using full and multiple nodes, such as parallel jobs, and smp jobs requesting -l exclusive, the memory requirement does not need to be specified, since the full node memory will be used.

Try with an initial sensible value, based on the data size and previous experience with the application. If you aren't sure, the standard way to work out the memory needed is to try the default (1GB per core), and enable email notifications in your job submission script with -m e and -M example@example.com. The requested memory can be easily changed in the job script. When the job finishes, an email will be sent providing the Max vmem value which displays the memory used by the job. Depending on whether the memory was under or over-estimated, you can use this information to refine your memory request for the next job.

Memory requests with array jobs

When running an array job, please run a task from the array individually first to get memory usage requirements. Submitting large array jobs which subsequently fail due to exceeding memory requirements, or jobs which only use a small fraction of the requested memory are a waste of valuable cluster resources.

Analysing job history using `jobstats` utility¶

You can view details of your recent jobs by running the jobstats command. By default it will show a list of your last 25 completed jobs. Usage information can be seen with the -h flag:

Exit Status

Sometimes you can see an exit status of 0 even though your job failed. The two main causes of this are either: a subsequent command exits successfully (for example exit or echo "finished"), or if your job contains sub-processes and the main process is not alerted if any sub-process fails.

$ jobstats -h

USAGE: jobstats [ -a ] [ -b BEGIN_DATE ] [ -c ] [ -e END_DATE ] [ -f | -s ] [ -g | -m | -p "NODE PATTERN" ]
                [ -h ] [ -i csv|tsv|ssv ] [ -j JOB_NUMBER[.TID[-TID] ] | -u USER ] [ -l ] [ -n JOBS ]

OPTIONS:
 -a Show all jobs (no limit to output jobs)
 -b Show jobs started after DATE 00:00:00 (DATE format is "DD/MM/YY")
 -c Strip colours from output
 -e Show jobs started before DATE 23:59:59 (DATE format is "DD/MM/YY")
 -f Show only failed jobs. Can not be used together with -s option
 -g Show only GPU jobs. Can not be used together with -m and -p options
 -h Displays this help prompt and exits
 -i Prepare list of jobs for import to CSV (comma separated), TSV (tab separated) or SSV (semicolon separated) format
 -j Show JOB_NUMBER job with optional array task ID (TID) or array task range in numerical order
 -l Show less fields (to fit screen when using large fonts: SUBMITTED and STARTED fields omitted)
 -m Show only High Memory nodes jobs. Can not be used together with -g and -p options
 -n Show last JOBS jobs. (Default: 25)
 -p Show only nodes that matching pattern (wildcard allowed). Can not be used together with -g and -m options
 -s Show only successful jobs. Can not be used together with -f option
 -u Username to show jobs for. (Default: USERNAME)

If your terminal is set to display output in larger size fonts, you may use the -l option to reduce the output table width (fields SUBMITTED and STARTED will be omitted) to fit the screen.

At this point, the easiest tuning you can do is to look at the requested memory (MEM R) and maximum memory used (MEM U) fields. The size is in GB. If there's a discrepancy between the memory you requested and the amount you used, you can then reduce the amount of requested memory for the next iteration of the job. This will ensure that your job has a good chance of running sooner than it otherwise would have done, and does not waste resources.

The other field to look at is the job efficiency (EFF). This field shows how efficient your job was utilising the requested cores as a percentage. If your job requested multiple cores but only used one core, the efficiency will be low. For example, a job which requested four cores but only used one core will report an efficiency of 25% or less.

Analysing job history via the stats site¶

You can view details of your recent jobs on the Apocrita stats page - use your Apocrita credentials for access, then select View your recent jobs from the side menu. The default view is to show your most recent jobs. To display a list of jobs ran on a specific date, or date range, select the relevant date filter and complete the date field(s).

At this point, the easiest tuning you can do is to look at the requested memory (Req mem) and maximum memory used (Mem used) fields. The size is in GB. If there's a discrepancy between the memory you requested and the amount you used, you can then reduce the amount of requested memory for the next iteration of the job. This will ensure that your job has a good chance of running sooner than it otherwise would have done, and does not waste resources.

Analysing resource usage via job emails¶

Enabling job completion emails will provide information about your completed job, including wallclock time and exit status.

Virtual memory vs real memory used

Your job script should request only the real memory, max rss, used by your job, not the virtual memory, max vmem. Please note the appropriate fields in the example job completion email below - you can also use get_job_ram_usage -j <jobid> and qacct -j <jobid> to determine the real memory usage of a specific job. Some applications, such as java, will map much larger virtual memory, but your job may only use a fraction of it.

Job 123456 (test.sh) Complete
User             = abc123
Queue            = test.q@node99
Host             = node99.apocrita
Start Time       = 09/11/2017 18:03:03
End Time         = 09/11/2017 22:59:06
User Time        = 17:22:55
System Time      = 00:45:04
Wallclock Time   = 04:56:03
CPU              = 19:03:18
Max vmem         = 53.776G
Max rss          = 8.05G
Exit Status      = 0

Storage¶

Temporary local node storage is significantly faster than accessing GPFS. The scheduler sets environment variables specifying available temporary space:

TMPDIR=/tmp/3596165.1.all.q
TMP=/tmp/3596165.1.all.q

Copying datasets to these directories before use will speed up access. To take full advantage of this speed up you can:

# Copy all datafiles to ${TMPDIR}
cp /data/Example/data ${TMPDIR}
# Run script outputting to ${TMPDIR}
./script -i ${TMPDIR}/data -o ${TMPDIR}/output_data
# Copy output data back
cp ${TMPDIR}/output_data /data/Example/

More detail is available in the Storage section of this site.