Debugging your jobs¶
This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.
Failure to submit¶
A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.
Check the following:
- The memory request has been adjusted for cores
- Requested resources are a reasonable value e.g.
h_rtis less than 10 Days
- Scheduler parameters are specified as
-l <param>=<value>with no spaces between parameter and value e.g
- You have permission to access the resources you're requesting, e.g. restricted nodes
A job can be verified with:
$ qsub -w v job_script verification: found suitable queue(s)
Failure to run¶
Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running. Queued jobs can be verified with:
$ qalter -w v job_id verification: found suitable queue(s)
Failure to complete¶
A job may fail to run for a number of reasons:
- lack of disk quota
- bad characters in the script
- insufficient resources requested, check the resource usage
Check the job output for the following:
- syntax errors in your script
- code failing to run and exiting with an error
- code failing to run because an expected file or directory did not exist
- permissions problem (can't read or write certain files)
- mismatch between cores requested for the job, and used by the application. To
avoid this you should use
$NSLOTSto provide the correct number of cores to the application.
Job output files¶
Job output is contained in a number of output files, which should be examined closely when a job is failing.
By default the scheduler places output files in the current working directory
unless otherwise specified by the
The default file names have the form
<job_name>.e<job_id>. For array tasks this will be appended by the task id
<job_name>.o<job_id>.<task_id>. Jobs using a parallel environment will
also have a
Jobs using the
-j y option will have all output in the
Job exit status¶
All jobs should complete with an exit status of zero, even if the data from the job looks correct an bad exit code may indicate issues. This can be checked by enabling email options in your submission script, or by checking the job statistics.
Job in Eqw state¶
When a job dies due to errors, node crashes or similar issues, the job may be
placed in the
Eqw state, awaiting further action to avoid running 100s of
jobs that will inevitably crash.
If you have determined the cause of the error and think your job should now run
correctly the error state can be cleared using
qmod -cj <job-id>.
If you no longer require the job, you should delete it with the
so that it does not remain in the queue indefinitely.
DOS / Windows newline Characters¶
DOS / Windows uses different characters from Unix to represent newlines in files. This can cause issues when a script has been written on a Windows machine and transferred to the cluster.
Incorrect newline characters can be detected with:
cat -v <script> | grep "\^M"
The file can then be fixed with:
$ dos2unix <script> dos2unix: converting file <script> to UNIX format ...
Core dumps may be useful for debugging applications. By default core dumps
inside jobs are disabled, these can be enabled via the scheduler parameter