Skip to content

Using GPUs

The nxg and sbg nodes contain GPU cards that can provide huge acceleration for certain types of parallel computing tasks, via the CUDA and OpenCL frameworks.

Access to GPU nodes

Access to GPU nodes is available for free to QM Researchers. Please contact us if you would like to use these nodes so we can add you to the allowed user list and help you get started with your initial GPU job submission. Note that access to GPU nodes is not permitted for Undergraduate and MSc students

Applications with GPU support

srif.png

There is a considerable number of scientific and analytical applications with GPU support. While some have GPU support out of the box, such as Matlab and Ansys, others may require specific GPU-ready builds. These may appear in the module avail list with a -gpu suffix. If you require GPU support adding to a specific application, please submit a request for a GPU build and provide some test data.

Be aware that not every GPU-capable application will run faster on a GPU for your code. For example, CP2K only has a GPU port of the DBCSR sparse matrix library. If you are not using this library in your code then you will not experience a performance boost.

Submitting jobs to GPU nodes

To request a GPU the -l gpu=<count> option should be used in your job submission, and the scheduler will automatically select a GPU node. Note that requests are handled per node, so a request for 64 cores and 2 GPUs will result in 4 GPUs across two nodes. Examples are shown below.

Selecting a specific GPU type

For compatibility, you may optionally require a specific GPU type. For example, CUDA version 8 predates the V100 GPU, and is not supported, so -l gpu_type=kepler would select nodes using the K80 GPU instead. Conversely, nodes with the V100 GPU may be selected with -l gpu_type=volta.

GPU Card Allocation

Ensure you set card allocation

Failure to set card allocation may result in contention with other users jobs and result in your job being killed.

Requesting cards with parallel PE

If using the parallel parallel environment requests will be exclusive, please ensure that you correctly set slots and gpu to fill the node.

Once a job starts, the assigned GPU cards are listed in the SGE_HGR_gpu environment variable as a space separated list. To ensure correct use of allocated GPU cards you need to limit your computation to run only on the allocated cards.

For Cuda this can be done by exporting the CUDA_VISIBLE_DEVICES environment variable which should be a comma separated list:

$ echo $SGE_HGR_gpu
0 1
# Set CUDA_VISIBLE_DEVICES,
# this converts the space separated list into a comma separated list
$ export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}

For OpenCL, this can be done via the GPU_DEVICE_ORDINAL environment variable which should be a comma separated list:

$ echo $SGE_HGR_gpu
0 1
# Set GPU_DEVICE_ORDINAL,
# this converts the space separated list into a comma separated list
$ export GPU_DEVICE_ORDINAL=${SGE_HGR_gpu// /,}

Checking GPU usage

GPU usage can be checked with the nvidia-smi command e.g.:

$ nvidia-smi -l 1
Tue May  1 13:30:11 2018
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.26                 Driver Version: 375.26                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla K80           On   | 0000:83:00.0     Off |                    0 |
| N/A   32C    P0   147W / 149W |   1211MiB / 11439MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Tesla K80           On   | 0000:84:00.0     Off |                    0 |
| N/A   31C    P8    30W / 149W |      0MiB / 11439MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     28939    C   ...pose/build/examples/openpose/openpose.bin  1207MiB |
+-----------------------------------------------------------------------------+

In this example we can see that the process is using GPU 0. We use the -l 1 option which tells nvidia-smi to repeatedly output the status.

Example job submissions

Request one GPU (Cuda)

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8        # 8 cores (32 per gpu node)
#$ -l h_rt=10:0:0   # 10 hour runtime
#$ -l h_vmem=7.5G   # 7.5 * 8 = 60G
#$ -l gpu=1         # request 1 GPU per host

export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
./run_code.sh

Part node requests

When using a single graphics card, you will need to request the appropriate slots and memory on the node. We recommend requesting 7.5G per core, which may be increased to 11.5G if using the -l gpu_type=volta complex to ensure the job runs on the sbg nodes.

Request two GPUs on the same box (OpenCL)

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 16       # 16 cores (32 per gpu node)
#$ -l h_rt=10:0:0   # 10 hour runtime
#$ -l h_vmem=7.5G   # 7.5 * 16 = 120G
#$ -l gpu=2         # request 2 gpu per host

export GPU_DEVICE_ORDINAL=${SGE_HGR_gpu// /,}
./run_code.sh

Request four GPUs on the same box (OpenCL)

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 32       # 32 cores (32 per gpu node)
#$ -l h_rt=10:0:0   # 10 hour runtime
#$ -l gpu=4         # request 4 gpu per host
#$ -l exclusive     # request exclusive access to the node

export GPU_DEVICE_ORDINAL=${SGE_HGR_gpu// /,}
./run_code.sh

Request four GPUs across multiple boxes (Cuda)

Infiniband

When requesting two GPU nodes, make sure to include the "infiniband_direct" parameter. More information here.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe parallel 64              # 64 cores (32 per nxg node)
#$ -l h_rt=10:0:0               # 10 hour runtime
#$ -l gpu=2                     # request 2 gpu per host (2 per nxg node)
#$ -l infiniband_direct=nxg3-4  # request nxg3 & nxg4 infiniband direct

export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}
./run_code.sh