In addition to the standard Univa Grid Engine command-line utilities, we have an extensive stats site, for graphical reporting of all aspects of the various nodes and queues. You can also see summaries of your previous jobs, in order to ensure your configuration options are well-suited for your jobs.
Personal job history¶
You can check your personal job history (including resources requested vs. resources used) on the Personal Job History page. Jobs with values highlighted in red show that the job exceeded one of its requested resource limits and was killed because of this.
The key things to check here are:
- requested walltime vs walltime
- requested memory vs. memory used
- the exit status of the job.
A non-zero exit status of your job means that your job produced an error. It is important to check that your jobs exit with status of zero.
Checking the statistics of an individual job¶
The qacct command can give useful resource usage information on completed jobs.
qacct -j <jobid> command is the most useful for checking exit status,
memory usage, queue time, submission command and walltime.
RAM usage in qacct
ru_maxrss field in the
qacct command output displays the actual
memory usage in GiB. We also provide the
get_job_ram_usage -j <jobid>
command to quickly see the real memory usage of a completed job.
You can also query qacct for jobs over a given period, for example, to display detailed output of every job run by user abc123 in the last 7 days:
qacct -d 7 -o abc123 -j
-j will give a summary of resources used:
$ qacct -d 7 -o abc123 OWNER WALLCLOCK UTIME STIME CPU MEMORY IO IOW =================================================================================== abc123 8240 2844.621 49.198 2906.81 1352.935 2.356 0.380
Walltime is the length of time the job to execute. This does not include the time spent waiting in the job queue. If the job runs over the requested walltime it will be killed by the scheduler.
Currently the maximum walltime allowed on the standard queues is 10 days. If you need more time than this you will need to implement checkpointing in your code, saving the state of your job at regular intervals, allowing a job to be restarted from the point it was stopped at.
The maximum walltime of a job is 10 days to allow for planned system maintenance and updates. This is a global setting, therefore exceptions for individual jobs cannot be made. National and Regional HPC clusters use much shorter walltimes, measured in hours.
Jobs running over their memory limit will be killed by the scheduler. The maximum limit is defined by the physical memory on a compute node.
If your job is killed for breaching the requested memory limit it is important to understand why. If it is a job you have run before and is now suddenly failing due to excessive usage of memory, it is most likely a bug with the application. However if it is a new job it may require some tweaking to find the ideal memory value to request.
See the tuning page for assistance with finding the correct memory requirements for your job.
Job exit status¶
Cluster jobs which ran successfully will exit with code
0. Non-zero exit codes
indicate there was a problem during execution and a command did not run
successfully. A few common non-zero exit codes have been listed below with
their recommended action before job re-submission.
|Code||Error Description||Recommended Action|
|1||Application error||Miscellaneous errors, such as "divide by zero" and other impermissible operations. Check the job output file for errors e.g. invalid parameter|
|2||Misuse of shell built-ins||Missing keyword or command, or permission problem. Check the job output file for errors e.g. module load issues|
|126||Command invoked cannot execute||You are trying to execute a command that cannot be executed. Check the output file for errors|
|127||Command not found||You are trying to execute a command that cannot be found. Check the output file for errors|
If you are unsure about an error or exit code, you may contact us for assistance.