Debugging your jobs


Failure to submit

It may be that you can chosen an incorrect scheduler settings and the scheduler rejects the request.

Make sure to check:

  • Your memory request has been adjusted for cores
  • You have selected a reasonable h_rt and number of cores
  • There are no errors in your job script

You can verify a job with:

$ qsub -w v job_script
verification: found suitable queue(s)

Or a queued job with:

$ qalter -w v job_id
verification: found suitable queue(s)

More information is available on the qsub and qalter man pages.


Failure to run

Your job may fail for a number of reasons:

You can check the job output for the following:

  • syntax errors in your script
  • the code run by your job exits with an error
  • the code failed to run because an expected file or directory did not exist
  • permissions problem (can't read or write certain files)

If you're using software which needs a license like ansys, matlab or the intel compiler, check that obtaining a license was successful.


Job output files

If the job is scheduled and runs but dies instantly, then the first place to check are the output files in the job working directory. These contain useful output and error information produced by the job and should be one of the first places to check when a job is failing or misbehaving

By default the scheduler places output files in the current working directory unless otherwise specified by the -e or -o option, if you are using the -j option all output will be in the .o file.

The default file name has the form job_name.ejob_id and job_name.ejob_id.task_id for array job tasks, if you're using a parallel environment there will also be job_name.pojob_id.task_id.


Job exit status

All jobs should complete with an exit status of zero, even if the data from your job looks correct. This can be checked by enabling email options in your submission script, or by checking the job statistics.


Job in Eqw state

When a job dies due to filesystem errors, node crashes or similar system issues , the job may be placed in the Eqw state, awaiting further action to avoid running 100s of jobs that will inevitably crash.

If you have determined the cause of the error and think your job should now run correctly the error state can be cleared using qmod -cj <job-id>.


DOS / Windows newline Characters

DOS / Windows uses different characters from Unix to represent newlines in files. This can cause issues when a script has been written on a Windows machine and transferred to the cluster.

Incorrect newline characters can be detected with:

cat -v <script> | grep "\^M"

The file can then be fixed with:

$ dos2unix <script>
dos2unix: converting file <script> to UNIX format ...

Support

If you cannot resolve the issue with your job, please contact us, supplying all the relevant information.