Debugging your jobs

This page provides some general information on debugging jobs that are not submitting, running or completing. If you still cannot resolve the issue without assistance, please contact us, supplying all the relevant information.

Failure to submit

A job may be rejected by the scheduler and fail to submit if incorrect scheduler parameters are provided.

Check the following:

  • The memory request has been adjusted for cores
  • Requested resources are a reasonable value e.g. h_rt is less than 10 Days
  • Scheduler parameters are specified as -l <param>=<value> with no spaces between parameter and value e.g -l h_vmem=1G
  • You have permission to access the resources you're requesting, e.g. restricted nodes

A job can be verified with:

$ qsub -w v job_script
verification: found suitable queue(s)

More information is available on the qsub man page.

Failure to run

Jobs may wait in the queue for a long period depending on the availability of resources, they may also have incorrect resource requests that prevent them running. Queued jobs can be verified with:

$ qalter -w v job_id
verification: found suitable queue(s)

See the qalter man page for more information.

Failure to complete

A job may fail to run for a number of reasons:

Check the job output for the following:

  • syntax errors in your script
  • code failing to run and exiting with an error
  • code failing to run because an expected file or directory did not exist
  • permissions problem (can't read or write certain files)
  • mismatch between cores requested for the job, and used by the application. To avoid this you should use $NSLOTS to provide the correct number of cores to the application.

If you're using software which needs a license like ansys, matlab or the intel compiler, check that obtaining a license was successful.

Job output files

Job output is contained in a number of output files, which should be examined closely when a job is failing.

By default the scheduler places output files in the current working directory unless otherwise specified by the -e or -o option.

The default file names have the form <job_name>.o<job_id> and <job_name>.e<job_id>. For array tasks this will be appended by the task id e.g. <job_name>.o<job_id>.<task_id>. Jobs using a parallel environment will also have a <job_name>.po<job_id>.<task_id> file.

Jobs using the -j y option will have all output in the <job_name>.o<job_id> file.

Job exit status

All jobs should complete with an exit status of zero, even if the data from the job looks correct an bad exit code may indicate issues. This can be checked by enabling email options in your submission script, or by checking the job statistics.

Job in Eqw state

When a job dies due to errors, node crashes or similar issues, the job may be placed in the Eqw state, awaiting further action to avoid running 100s of jobs that will inevitably crash.

If you have determined the cause of the error and think your job should now run correctly the error state can be cleared using qmod -cj <job-id>.

If you no longer require the job, you should delete it with the qdel command, so that it does not remain in the queue indefinitely.

DOS / Windows newline Characters

DOS / Windows uses different characters from Unix to represent newlines in files. This can cause issues when a script has been written on a Windows machine and transferred to the cluster.

Incorrect newline characters can be detected with:

cat -v <script> | grep "\^M"

The file can then be fixed with:

$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...

Core Dumps

Core dumps may be useful for debugging applications. By default core dumps inside jobs are disabled, these can be enabled via the scheduler parameter -l h_core=1