Skip to content

Debugging your jobs

This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.

Failure to submit

A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.

Check the following:

  • The memory request has been adjusted for cores
  • Requested resources are a reasonable value e.g. h_rt is less than 10 Days
  • Scheduler parameters are specified as -l <param>=<value> with no spaces between parameter and value e.g -l h_vmem=1G
  • You have permission to access the resources you're requesting, e.g. restricted nodes

A job can be verified with:

$ qsub -w v job_script
verification: found suitable queue(s)

More information is available on the qsub man page.

Failure to run

Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running. Queued jobs can be verified with:

$ qalter -w v job_id
verification: found suitable queue(s)

See the qalter man page for more information.

Failure to complete

A job may fail to run for a number of reasons:

Check the job output for the following:

  • syntax errors in your script
  • code failing to run and exiting with an error
  • code failing to run because an expected file or directory did not exist
  • permissions problem (can't read or write certain files)
  • mismatch between cores requested for the job, and used by the application. To avoid this you should use $NSLOTS to provide the correct number of cores to the application.

If you're using software which needs a license like ansys, matlab or the intel compiler, check that obtaining a license was successful.

Job output files

Job output is contained in a number of output files, which should be examined closely when a job is failing.

By default the scheduler places output files in the current working directory unless otherwise specified by the -e or -o option.

The default file names have the form <job_name>.o<job_id> and <job_name>.e<job_id>. For array tasks this will be appended by the task id e.g. <job_name>.o<job_id>.<task_id>. Jobs using a parallel environment will also have a <job_name>.po<job_id>.<task_id> file.

Jobs using the -j y option will have all output in the <job_name>.o<job_id> file.

Job exit status

All jobs should complete with an exit status of zero, even if the data from the job looks correct an bad exit code may indicate issues. This can be checked by enabling email options in your submission script, or by checking the job statistics.

Job in Eqw state

When a job dies due to errors, node crashes or similar issues, the job may be placed in the Eqw state, awaiting further action to avoid running 100s of jobs that will inevitably crash.

If you have determined the cause of the error and think your job should now run correctly the error state can be cleared using qmod -cj <job-id>.

If you no longer require the job, you should delete it with the qdel command, so that it does not remain in the queue indefinitely.

DOS / Windows newline Characters

DOS / Windows uses different characters from Unix to represent newlines in files. Windows/DOS uses carriage return and line feed (\r\n) as a line ending, while Unix systems just use line feed (\n). This can cause issues when a script has been written on a Windows machine and transferred to the cluster.

Typical errors include the following:

line 10: $'\r': command not found
ERROR:105: Unable to locate a modulefile for 'busco/3.0
'

The carriage return before the close quote indicates presence of DOS / Windows newline characters, which can be detected with:

cat -v <script> | grep "\^M"

The file can then be fixed with:

$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...

Core Dumps

Core dumps may be useful for debugging applications. By default core dumps inside jobs are disabled, these can be enabled via the scheduler parameter -l h_core=1

Deadlocks

Parallel applications may enter a state where each process is waiting for another process to send a message or release a lock on a file or resource. This results in the application ceasing to run as it waits for resources to become available, known as a deadlock. The only solution to a deadlock is adjusting the code to prevent it occurring in the first place.

Detecting deadlocks with Intel MPI

The deadlock condition can be detected using Intel Trace Collector. Compiling the program with -profile=vtfs will enable this, when a deadlock occurs the output will look like this:

[0] Intel(R) Trace Collector ERROR: no progress observed in any process for over 1:04 minutes, aborting application
[0] Intel(R) Trace Collector WARNING: starting emergency trace file writing

More information is available in the Intel user guide

Monitoring jobs on nodes

Jobs can be monitored directly on the nodes for deeper debugging, note that it is only allowable to ssh to the nodes for this purpose.

Using Top

You can see all your processes on a node using top:

ssh <node> -t top -u $USER

This can also be filtered to show specific jobs or tasks:

  1. Press f to open the fields display
  2. Use the up and down arrows to navigate to CGROUPS
  3. Press space to select the CGROUPS field
  4. Press q to leave the fields display
  5. Press o to open the filter
  6. Type CGROUPS=<JID> where <JID> is your job id or job and task id, e.g. 254210 or 254210.1
  7. Press Enter

You can now see displayed only processes that are part of that job, e.g:

Using strace

strace is a tool that lists the system calls a process makes, this allows you to see what a process is doing. This can be useful for identifying deadlocked processes.

strace can either invoke the command to trace or be attached to a running process:

# Run the command `hostname` and trace
strace hostname
# Trace the currently running process 1234
strace -p 1234

Common useful arguments to strace are:

  • -f Trace forked processes
  • -t Prefix each output line with a timestamp
  • -v Full versions of common calls
  • -s <size> Specify the maximum string size to print (the default is 32).

Things to look out for that suggest a deadlock are:

  • A continuous stream of poll resulting in Timeout

  • A continuous stream of sched_yield