Andrena cluster¶

The Andrena cluster is a set of compute and GPU nodes which were purchased with a Research Capital Investment Fund to support the University's Digital Environment Research Institute.

Hardware¶

The cluster comprises 16 GPU nodes - each with 4 GPUs, providing a total of 64 Nvidia A100 GPUs - plus 36 compute nodes with the same specification as the Apocrita ddy nodes. The Andrena nodes are joined to Apocrita and make use of the same job scheduler and high performance networking/storage.

DERI research groups may additionally make use of a portion of the 50TB DERI storage entitlement, while commonly used read-only datasets (e.g. training datasets for machine learning) can be hosted on high performance SSD storage.

Requesting access¶

To request access to the Andrena computational resources or storage, please contact us to discuss requirements.

Logging in to Andrena¶

We provide dedicated login nodes for Andrena users. The connection procedure is the same as for Apocrita login procedure, except login.hpc.qmul.ac.uk should be substituted with andrena.hpc.qmul.ac.uk for the Andrena login nodes.

Running jobs on Andrena¶

Workloads are submitted using the job scheduler and works exactly the same way as Apocrita, which is documented thoroughly on this site. If you have been approved to use Andrena, jobs can be submitted from either Andrena or Apocrita login nodes, using the following additional request in the resource request section of the job script:

#$ -l cluster=andrena

For example, the whole job script might look like:

#!/bin/bash
#$ -cwd                # Run the job in the current directory
#$ -pe smp 1           # Request 1 core
#$ -l h_rt=240:0:0     # Request 10 days maximum runtime
#$ -l h_vmem=1G        # Request 1GB RAM per core
#$ -l cluster=andrena  # Ensure that the job runs on Andrena nodes

module load python
python mycode.py

Without this setting, the scheduler will try to run the job either on Apocrita or Andrena nodes, depending on availability.

GPU jobs follow the similar template to Apocrita GPU jobs, and should request 12 cores per GPU, and 7.5G per core even if fewer cores are actually used by the code. By mandating these rules within the job scheduler logic, we avoid situations where GPUs cannot be requested because another job is using all of the cores on the node.

An example GPU job script using a conda environment might look like:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12         # 12 cores per GPU
#$ -l h_rt=240:0:0    # 240 hours runtime
#$ -l h_vmem=7.5G     # 7.5G RAM per core
#$ -l gpu=1           # request 1 GPU
#$ -l cluster=andrena # use the Andrena nodes

module load anaconda3
conda activate tensorflow-env
python train.py

A typical GPU job script using virtualenv will look similar, but since CUDA libraries are not installed as part of the pip install, it is necessary to load the relevant cudnn module to make the CUDNN and CUDA libraries available in your virtual environment. Note that loading the cudnn module also loads a compatible cuda module.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12         # 12 cores per GPU
#$ -l h_rt=240:0:0    # 240 hours runtime
#$ -l h_vmem=7.5G     # 7.5G RAM per core
#$ -l gpu=1           # request 1 GPU
#$ -l cluster=andrena # use the Andrena nodes

module load python cudnn
source venv/bin/activate
python train.py