Debugging your jobs¶
This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.
Failure to submit¶
A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.
Check the following:
- The memory request has been adjusted for cores
- Requested resources are a reasonable value e.g.
h_rtis less than 10 Days
- Scheduler parameters are specified as
-l <param>=<value>with no spaces between parameter and value e.g
- You have permission to access the resources you're requesting, e.g. restricted nodes
A job can be verified with:
$ qsub -w v job_script verification: found suitable queue(s)
Failure to run¶
Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running. Queued jobs can be verified with:
$ qalter -w v job_id verification: found suitable queue(s)
Failure to complete¶
A job may fail to run for a number of reasons:
- lack of disk quota
- bad characters in the script
- insufficient resources requested, check the resource usage
Check the job output for the following:
- syntax errors in your script
- code failing to run and exiting with an error
- code failing to run because an expected file or directory did not exist
- permissions problem (can't read or write certain files)
- mismatch between cores requested for the job, and used by the application. To
avoid this you should use
$NSLOTSto provide the correct number of cores to the application.
Job output files¶
Job output is contained in a number of output files, which should be examined closely when a job is failing.
By default the scheduler places output files in the current working directory
unless otherwise specified by the
The default file names have the form
<job_name>.e<job_id>. For array tasks this will be appended by the task id
<job_name>.o<job_id>.<task_id>. Jobs using a parallel environment will
also have a
Jobs using the
-j y option will have all output in the
Job exit status¶
All jobs should complete with an exit status of zero, even if the data from the job looks correct an bad exit code may indicate issues. This can be checked by enabling email options in your submission script, or by checking the job statistics.
When a job dies due to errors, node crashes or similar issues, the job may be
placed in the
Eqw state, awaiting further action to avoid running 100s of
jobs that will inevitably crash.
If you have determined the cause of the error and think your job should now run
correctly the error state can be cleared using
qmod -cj <job-id>.
If you no longer require the job, you should delete it with the
so that it does not remain in the queue indefinitely.
DOS / Windows newline Characters¶
DOS / Windows uses
from Unix to represent newlines in files. Windows/DOS uses carriage return and
line feed (
\r\n) as a line ending, while Unix systems just use line feed
\n). This can cause issues when a script has been written on a Windows
machine and transferred to the cluster.
Typical errors include the following:
line 10: $'\r': command not found
ERROR:105: Unable to locate a modulefile for 'busco/3.0 '
The carriage return before the close quote indicates presence of DOS / Windows newline characters, which can be detected with:
cat -v <script> | grep "\^M"
The file can then be fixed with:
$ dos2unix <script> dos2unix: converting file <script> to UNIX format ...
Core dumps may be useful for debugging applications. By default core dumps
inside jobs are disabled, these can be enabled via the scheduler parameter
Remember that you may need to adjust this value depending on how much RAM
you have requested, keeping in mind that
h_vmem is per core.
Parallel applications may enter a state where each process is waiting for another process to send a message or release a lock on a file or resource. This results in the application ceasing to run as it waits for resources to become available, known as a deadlock. The only solution to a deadlock is adjusting the code to prevent it occurring in the first place.
Detecting deadlocks with Intel MPI¶
The deadlock condition can be detected using Intel Trace Collector. Compiling
the program with
-profile=vtfs will enable this, when a deadlock occurs the
output will look like this:
 Intel(R) Trace Collector ERROR: no progress observed in any process for over 1:04 minutes, aborting application  Intel(R) Trace Collector WARNING: starting emergency trace file writing
More information is available in the Intel user guide
Monitoring jobs on nodes¶
Jobs can be monitored directly on the nodes for deeper debugging, note that it is only allowable to ssh to the nodes for this purpose.
You can see all your processes on a node using
ssh <node> -t top -u $USER
This can also be filtered to show specific jobs or tasks:
fto open the fields display
- Use the
downarrows to navigate to
spaceto select the
qto leave the fields display
oto open the filter
<JID>is your job id or job and task id, e.g.
You can now see displayed only processes that are part of that job, e.g:
strace is a tool that lists the system calls a process makes, this allows you
to see what a process is doing. This can be useful for identifying deadlocked
strace can either invoke the command to trace or be attached to a running
# Run the command `hostname` and trace strace hostname # Trace the currently running process 1234 strace -p 1234
Common useful arguments to
-fTrace forked processes
-tPrefix each output line with a timestamp
-vFull versions of common calls
-s <size>Specify the maximum string size to print (the default is 32).
Things to look out for that suggest a deadlock are:
- A continuous stream of
- A continuous stream of