Monitoring jobs¶

Check status of submitted jobs¶

To monitor jobs you have submitted to the queue you can run the qstat command. The below example assumes your username is abc123.

$ qstat
job-ID  prior  name     user   state submit/start at     queue        slots ja-task-ID
--------------------------------------------------------------------------------------
123456 5.76204 exampleA abc123 r     01/09/2020 23:10:10 all.q@sdx11     5   16
123456 5.76204 exampleA abc123 r     01/09/2020 23:54:11 all.q@sdx18     5   17
123450 5.10725 exampleB abc123 r     01/09/2020 12:05:39 all.q@sdx2      1
123461 5.00001 exampleC abc123 qw    01/09/2020 15:03:24                 8

Your job will be in one of the following states when running qstat:

r - Job is currently running.
qw - Job has been submitted and is waiting in the queue.
hqw - Job has been submitted but is being held in the queue.
Eqw - Job has been submitted but an error is preventing job from starting.

A job could be in hqw state because of a job dependency or a hold applied by ITS Research if there is an issue with the job. If you are unsure why your job is in hqw or Eqw state, you may contact us for further information.

To see the status of a particular job (e.g. job 999) you can run the qstat command with the -j option:

qstat -j 999

To see the resources you are currently requesting you can run the qstat command with the -r option:

qstat -r

More detailed information about the qstat command can be found in its man page.

Jobs with error statuses¶

The qstat output may include jobs with a status of Eqw. This indicates that an error occurred - not an error within the job itself, but one that prevented the job from being started. Typically, this may be because the user ran out of file space.

The jobs with errors can be deleted from the queue using:

qdel <job ID>

Alternatively, if the cause of the error has been cleared and the jobs need to run, the error state can be cleared using:

qmod -cj <job ID>

Checking where my job is in the queue¶

We provide some additional commands to display current activity of the cluster and the queues in a more readable format than the standard qhost and qstat commands.

nodestatus¶

nodestatus will show the current core and memory usage of all of the nodes. Note that some nodes are on queues that may not be available to all users - these are coloured blue in the list of nodes. nodestatus -F gives a detailed view of jobs running on each node, plus the number of slots used, and the latest finish time of the job based on the maximum requested runtime. Below demonstrates two of the most commonly used options.

The -N option allows you to inspect the jobs running on a selected node. For example:

nodestatus -N sdx1

Node  cores         memory
      used/total    used/available/total

sdx1  (36/36)       (40/142/384)
      abc123        all.q           4008031    Thu, 22 Jun 2017 15:20:39   1
      xyz126        all.q           4012201    Mon, 26 Jun 2017 17:08:21   1
      xyz126        all.q           4012202    Tue, 27 Jun 2017 05:02:36   1
      abc985        all.q           4015392    Sat, 24 Jun 2017 16:03:05   8
      hij208        all.q           4019133    Wed, 21 Jun 2017 13:48:18   1

The -t and -T options will display a summary or running jobs on a given node type (i.e. ddy) or group type (serial, parallel or gpu). For example:

nodestatus -T gpu

Node  cores        memory
      used/total   used/available/total

sbg2         (32/64)      (3/4)        (29/128/378)
                abc851          all.q           4334410    Fri, 15 Nov 2024 09:41:52         12
                xyz646          all.q           4358845    Thu,  7 Nov 2024 11:23:57          8
                bbb153          all.q           4261904    Tue, 12 Nov 2024 18:21:20         12
sbg3         (32/64)      (4/4)        (45/68/378)
                eee713          all.q           4348676    Wed,  6 Nov 2024 23:00:58         16
                ddd646          all.q           4358846    Thu,  7 Nov 2024 11:23:58          8
                fff646          all.q           4358847    Thu,  7 Nov 2024 14:01:12          8
sbg5         (44/48)      (4/4)        (41/14/378)
                abc211          all.q           4254573    Wed, 13 Nov 2024 04:04:37         12
                xyz851          all.q           4334410    Fri, 15 Nov 2024 14:10:14         12
                bbb249          all.q           4233686    Thu,  7 Nov 2024 11:45:05         12
                bbb153          all.q           4388545    Fri, 15 Nov 2024 15:28:42          8

The memory used is the actual RAM in use on the node at that moment. The memory available value is the approximate amount of RAM available for use by the scheduler. Note that the sum of these values may not equal the total memory on the node - for example, if a job has requested memory but not used it.

nodestatus command usage

nodestatus -h will provide a summary of available options.

showqueue¶

showqueue is another useful command that shows all of the jobs waiting to be run, in the order of priority. It is useful for inspecting typical wait times for different job sizes. Note that some jobs may not be running because they are restricted by resource quotas. showqueue -F gives additional detail on each queued job, such as the total RAM requested, and resource quotas. While most users won't hit core or memory quotas, they can be inspected using the qquota command.

Jobs which are coloured yellow in the 'Submission Time' field have been queuing for 24 hours. Jobs coloured in red have been queuing for 7 days.

Should your job remain in the queue for a long period of time, please inspect the output of showqueue -F and check your job submission carefully. Typically, long wait times are caused by large resource requests, exclusive node access or errors in your job submission script. Please contact ITS Research Support with a job id number for assistance regarding a specific queued job.

showqueue command usage

showqueue -h will provide a summary of available options.

Email notifications¶

You can request email notifications of a change in status of your jobs by adding the following code to the "Grid Engine options" section of your submission script:

#$ -m bea # Send email at the beginning and end of the job and if aborted
#$ -M my_name@qmul.ac.uk # The email address to notify

If you do not add the -M line you will be emailed at the address that is registered with us (usually your QMUL email address).

Check where jobs are running¶

The hosts_for_job utility script displays information about which host(s) a serial or parallel job is running on. Passing the -s switch will print a list of hostnames only.

run inside a UGE job

The hosts_for_job utility must be run inside a UGE job therefore, you must execute the script interactively in a qlogin job or within your job script for batch submissions.

For example, to print full details about a running serial job:

$ hosts_for_job
Job 457773 is running in the smp PE using 8 core(s) on host sdx6

For example, to print hostnames only for a running parallel job:

$ hosts_for_job -s
ddy1
ddy4