Debugging your jobs¶

This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.

Failure to submit¶

A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.

Check the following:

The memory request has been adjusted for cores
Requested resources are a reasonable value e.g. h_rt is less than 10 Days
Scheduler parameters are specified as -l <param>=<value> with no spaces between parameter and value e.g -l h_vmem=1G
You have permission to access the resources you're requesting, e.g. restricted nodes

A job can be verified with:

$ qsub -w v job_script
verification: found suitable queue(s)

Failure to run¶

Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running. Queued jobs can be verified with:

$ qalter -w v job_id
verification: found suitable queue(s)

Failure to complete¶

A job may fail to run for a number of reasons:

lack of disk quota
bad characters in the script
insufficient resources requested, check the resource usage

Check the job output for the following:

syntax errors in your script
code failing to run and exiting with an error
code failing to run because an expected file or directory did not exist
permissions problem (can't read or write certain files)
mismatch between cores requested for the job, and used by the application. To avoid this you should use $NSLOTS to provide the correct number of cores to the application.

If you're using software which needs a license like ansys or matlab, check that obtaining a license was successful.

Job output files¶

Job output is contained in a number of output files, which should be examined closely when a job is failing.

By default the scheduler places output files in the current working directory unless otherwise specified by the -e or -o option.

The default file names have the form <job_name>.o<job_id> and <job_name>.e<job_id>. For array tasks this will be appended by the task id e.g. <job_name>.o<job_id>.<task_id>. Jobs using a parallel environment will also have a <job_name>.po<job_id>.<task_id> file.

Jobs using the -j y option will have all output in the <job_name>.o<job_id> file.

Job exit status¶

All jobs should complete with an exit status of zero, even if the data from the job looks correct an bad exit code may indicate issues. This can be checked by enabling email options in your submission script, or by checking the job statistics.

Exit Status

Sometimes you can see an exit status of 0 even though your job failed. The two main causes of this are either: a subsequent command exits successfully (for example exit or echo "finished"), or if your job contains sub-processes and the main process is not alerted if any sub-process fails.

Job in `Eqw` state¶

When a job dies due to errors, node crashes or similar issues, the job may be placed in the Eqw state, awaiting further action to avoid running 100s of jobs that will inevitably crash.

If you have determined the cause of the error and think your job should now run correctly the error state can be cleared using qmod -cj <job-id>.

If you no longer require the job, you should delete it with the qdel command, so that it does not remain in the queue indefinitely.

DOS / Windows newline Characters¶

DOS / Windows uses different characters from Unix to represent newlines in files. Windows/DOS uses carriage return and line feed (\r\n) as a line ending, while Unix systems just use line feed (\n). This can cause issues when a script has been written on a Windows machine and transferred to the cluster.

Typical errors include the following:

line 10: $'\r': command not found

ERROR:105: Unable to locate a modulefile for 'busco/3.0
'

The carriage return before the close quote indicates presence of DOS / Windows newline characters, which can be detected with:

cat -v <script> | grep "\^M"

The file can then be fixed with:

$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...

Core Dumps¶

Core dumps may be useful for debugging applications. By default core dumps inside jobs are disabled, these can be enabled via the scheduler parameter -l h_core=1G.

Remember that you may need to adjust this value depending on how much RAM you have requested, keeping in mind that h_vmem is per core.

Deadlocks¶

Parallel applications may enter a state where each process is waiting for another process to send a message or release a lock on a file or resource. This results in the application ceasing to run as it waits for resources to become available, known as a deadlock. The only solution to a deadlock is adjusting the code to prevent it occurring in the first place.

Monitoring jobs on nodes¶

Jobs can be monitored directly on the nodes for deeper debugging, note that it is only allowable to ssh to the nodes for this purpose.

Using `top`¶

You can see all your processes on a node using top:

ssh <node> -t top -u $USER

This can also be filtered to show specific jobs or tasks:

Press f to open the fields display
Use the up and down arrows to navigate to CGROUPS
Press space to select the CGROUPS field
Press q to leave the fields display
Press o to open the filter
Type CGROUPS=<JID> where <JID> is your job id or job and task id, e.g. 254210 or 254210.1
Press Enter

You can now see displayed only processes that are part of that job, e.g:

Using `strace`¶

strace is a tool that lists the system calls a process makes, this allows you to see what a process is doing. This can be useful for identifying deadlocked processes.

strace can either invoke the command to trace or be attached to a running process:

# Run the command `hostname` and trace
strace hostname
# Trace the currently running process 1234
strace -p 1234

Common useful arguments to strace are:

-f Trace forked processes
-t Prefix each output line with a timestamp
-v Full versions of common calls
-s <size> Specify the maximum string size to print (the default is 32).

Things to look out for that suggest a deadlock are:

A continuous stream of poll resulting in Timeout

A continuous stream of sched_yield