Check status of submitted jobs¶
To monitor jobs you have submitted to the queue you can run the
$ qstat job-ID prior name user state submit/start at queue slots ja-task-ID -------------------------------------------------------------------------------------- 3583807 5.76204 exampleA btw999 r 01/21/2016 23:10:10 all.q@dn94 5 16 3583807 5.76204 exampleA btw999 r 01/21/2016 23:54:11 all.q@dn58 5 17 3599804 5.00001 exampleB btw999 r 01/16/2016 12:05:39 all.q@dn75 1 3599852 5.00001 exampleC btw999 r 01/16/2016 17:39:27 all.q@sm3 1 3602902 0.00000 exampleD btw863 qw 01/22/2016 10:49:47 8 3602902 0.00000 exampleE btw863 Eqw 01/22/2016 10:49:47 8
To see the status of a particular job (e.g. job 999) you can run the
qstat command with the
qstat -j 999
To see the resources you are currently requesting you can run the
command with the
More details on the
qstat command are available in the
Checking where my job is in the queue¶
We provide some additional commands to display current activity of the cluster
and the queues in a more readable format than the standard
nodestatus will show the current core and memory usage of all of
the nodes. Note that some nodes are on queues that may not be available to all
users - these are coloured blue in the list of nodes.
gives a detailed view of jobs running on each node, plus the number of slots
used, and the latest finish time of the job based on the maximum requested
runtime. Below demonstrates two of the most commonly used options.
-N option allows you to inspect the jobs running on a selected node. For
nodestatus -N dn55 Node cores memory used/total used/available/total dn55 (12/12) (9/2/24) abc123 all.q 4008031 Thu, 22 Jun 2017 15:20:39 1 xyz126 all.q 4012201 Mon, 26 Jun 2017 17:08:21 1 xyz126 all.q 4012202 Tue, 27 Jun 2017 05:02:36 1 abc985 all.q 4015392 Sat, 24 Jun 2017 16:03:05 8 hij208 all.q 4019133 Wed, 21 Jun 2017 13:48:18 1
-T options will display a summary or running jobs on a given
node type (i.e. nxv) or group type (serial, parallel or gpu). For example:
nodestatus -T gpu Node cores memory used/total used/available/total nxg1 (32/32) (10/3/256) abc123 all.q 606585 Wed, 13 Mar 2019 17:54:14 16 abc123 all.q 606588 Wed, 13 Mar 2019 17:54:21 16 nxg2 (32/32) (8/242/256) xyz126 all.q 618546 Fri, 15 Mar 2019 15:57:34 32 nxg3 (32/32) (10/3/256) xyz126 all.q 605681 Wed, 13 Mar 2019 17:53:59 16 abc985 all.q 605683 Wed, 13 Mar 2019 17:54:07 16 nxg4 (32/32) (10/3/256) hij208 all.q 607750 Thu, 14 Mar 2019 03:31:36 16 hij208 all.q 607751 Thu, 14 Mar 2019 03:31:38 16 sbg1 (32/32) (17/131/384) hij208 all.q 605789 Wed, 13 Mar 2019 20:57:06 16 hij208 all.q 605792 Thu, 14 Mar 2019 11:00:03 16 sbg2 (32/32) (114/248/384) abc123 all.q 618755 Thu, 14 Mar 2019 14:43:22 16 xyz126 all.q 617451 Wed, 13 Mar 2019 18:20:54 16
memory used is the actual RAM in use on the node at that moment. The
memory available value is the approximate amount of RAM available
for use by the scheduler. Note that the sum of these values may not equal the
total memory on the node - for example, if a job has requested memory but not
nodestatus command usage
nodestatus -h will provide a summary of available options.
showqueue is another useful command that shows all of the jobs waiting to
be run, in the order of priority. It is useful for inspecting typical wait
times for different job sizes.
Note that some jobs may not be running because they are restricted by resource
showqueue -F gives additional detail on each queued job, such as the
total RAM requested, and resource quotas. While most users won't hit core or
memory quotas, they can be inspected using the
Jobs which are coloured yellow in the 'Submission Time' field have been queuing for 24 hours. Jobs coloured in red have been queuing for 7 days.
Should your job remain in the queue for a long period of time, please inspect
the output of
showqueue -F and check your job submission carefully.
Typically, long wait times are caused by large resource requests, exclusive
node access or errors in your job submission script. Please contact
ITS Research Support with a job id
number for assistance regarding a specific queued job.
showqueue command usage
showqueue -h will provide a summary of available options.
Jobs with error statuses¶
qstat output may include jobs with a status of E. This
indicates that an error occurred - not an error within the job itself, but one
that prevented the job from being started. Typically, this may be because the
user ran out of file space.
The jobs with errors can be deleted from the queue using:
qdel <job ID>
Alternatively, if the cause of the error has been cleared and the jobs need to run, the error state can be cleared using:
qmod -cj <job ID>
You can request email notifications of a change in status of your jobs by adding the following code to the "Grid Engine options" section of your submission script:
#$ -m bea # Send email at the beginning and end of the job and if aborted #$ -M firstname.lastname@example.org # The email address to notify
If you do not add the
-M line you will be emailed at the address
that is registered with us (usually your QMUL email address).
Check where jobs are running¶
hosts_for_job utility script displays information about which host(s)
a serial or parallel job is running on. Passing the
-s switch will print
a list of hostnames only.
run inside a UGE job
hosts_for_job utility must be run inside a UGE job therefore, you
must execute the script interactively in a QLOGIN job or within your job
script for batch submissions.
For example, to print full details about a running serial job:
$ hosts_for_job Job 457773 is running in the smp PE using 8 core(s) on host dn6
For example, to print hostnames only for a running parallel job:
$ hosts_for_job -s nxv2 nxv3 nxv4 nxv14