Tier 2 HPC facilities¶

We have access to a number of Tier 2 clusters which are larger than typical institutional clusters, and are particularly useful for larger workloads. If you are running multi-node parallel jobs you may benefit from access to these, please contact us to see if your job is appropriate and to organise access.

Who can access these clusters?¶

QMUL Academics may apply to use a Tier 2 cluster free of charge if:

they are performing predominantly EPSRC-funded research
jobs are an appropriate size for the Tier 2 service (the QMUL Apocrita service is sufficient for many users); these will usually be parallel jobs running over multiple nodes
jobs are well-tested and known to run successfully
the scope and size of the work is stated in advance as part of an application to use the cluster (or by using the initial resource allocation to determine specific resource requirements)
the work fits the designated application areas for the cluster
they notify us if they don't think they will be able to use the resource allocation within the allocation period (so that the unused hours can be allocated to other users)
they provide a brief description of their research work to be performed on the cluster and agree to our sharing that with consortium partners for reporting purposes

Once your access has been agreed and set up, it is recommended that you connect to the Tier 2 resources through Apocrita. This is because access from the internet at large may be restricted to some of these resources. If you understand SSH client configuration the following example may be useful:

Host young
        HostName young.rc.ucl.ac.uk
        User abc123

If you require help porting your code to a Tier 2 cluster, which may use different jobs scheduling systems or software toolchains from Apocrita, please contact us.

Project allocations¶

Typically, new projects will be granted an initial allocation for benchmarking and job sizing. After this, to obtain resource allocation for your project, you will need to provide a detailed description of your project, along with job sizes and a commitment to use the resources within the agreed time-frame.

QMUL receive an allocation to use within a given accounting period on each cluster, which is divided among the various projects according to their requirements. At the end of each accounting period the balances are reset.

Core Hours

A Core Hour is the amount of work done by a single processor core in one hour. For accounting purposes, you need to calculate the cumulative total over all cores that your job runs on. If your job runs for one hour on ten 24-core nodes, the CPU time used is 240 Core Hours. Part-used nodes are counted as using all of the cores, since jobs are granted exclusive access to nodes.

Additional resources can be requested by contacting us with your requirements. Please request only what you will realistically use within the reporting period: we can always top up your allocation later if required.

Sulis - High throughput and ensemble computing¶

Sulis is a focused on high throughput and ensemble computing, funded by EP/T022108/1. The cluster has a total of 25,728 AMD compute cores and 90 NVIDIA A100 GPUs.

An overview of the Sulis system specification is as follows:

Node type	Hardware	Count	Cores	Memory
Compute	AMD EPYC 7742 (Rome)	167	128	512GB
High-memory compute	AMD EPYC 7742 (Rome)	4	128	1TB
GPU	3 x NVIDIA A100	30	128	512GB / 40GB GPU

See the Sulis Technical Specifications page for more information about system specifications.

To request an account on Sulis, please complete the SAFE registration form providing an institutional email address, and the public portion of your new SSH key. If you are unfamiliar with SAFE, consider reading the SAFE website guide.

An email will be sent to you containing a password for SAFE which needs to be changed on first logging in. Once you have signed up, log on to SAFE and click on "Request access" from the "Projects" menu at the top of the homepage. From the drop-down list choose su008: QMUL and click "Request".

Before we can allocate your initial default resource allocation, please contact us with a detailed description of the work you will perform using Sulis.

Accessing Sulis

Access to Sulis is restricted to a limited set of IP addresses, including the QMUL network range. If you are working off-campus, you should first connect to Apocrita and then SSH to Sulis.

See the following documentation pages for more information:

To acknowledge use of Sulis, please use a statement like the following:

Calculations were performed using the Sulis Tier 2 HPC platform hosted by the
Scientific Computing Research Technology Platform at the University of Warwick.
Sulis is funded by EPSRC Grant EP/T022108/1 and the HPC Midlands+ consortium.

Young - Hub in Materials and Molecular Modelling¶

Host Institution	Physical Cores	Nodes	RAM/Node	Scheduler	Wallclock	Accounting period
UCL	46,536	582	188GB+	SGE	48hrs	3 months

Young has an optional Hyperthreading feature

Hyperthreading lets you use two virtual cores instead of one physical core (some programs can take advantage of this) which can be enabled on a per job basis - the default is to use one thread per core as normal. See the Young Hyperthreading documentation for further information.

The core-hour charging model on Young is different from other Tier 2 clusters, including Thomas, the previous machine in the Hub. Young uses Gold which is charged at 80 Gold per node-hour for both jobs with hyperthreading and jobs without hyperthreading.

Young has three types of node:

standard nodes
high memory nodes
large memory nodes

See the Young Node types page for more information about available nodes.

We generally expect jobs on Young to make full and efficient use of at least one whole node. This means that the jobs you run on Young should usually scale well to use at least 40 cores. If you have previously used Apocrita for your jobs and wish to move to Young then you should test your job using full nodes on Apocrita and be sure that the job makes efficient use of them.

When applying for an account on Young, please be sure to include supporting evidence that your intended job will effectively use your allocated resources. You may want to include such things as:

references to established HPC projects that you will use
proposed job scripts for Young, if available
references to jobs on Apocrita which show efficient whole-node use
scaling analysis for your project on Apocrita or other Tier 2 services
references to jobs of colleagues or other users on Young

Approved account requests will come with an initial allocation of up to 100,000 Gold. Please be sure to use this initial allocation to generate evidence for effective use as well determining what future allocations may be required.

The command nodesforjob is useful when checking node utilisation for your job.

Remember that Gold comes from a pool split across many users from Queen Mary so we do liaise with the service managers to ensure that the resources are fairly shared. We actively check running jobs and may contact you with queries if we see that jobs are not fully using allocated resources.

Jobs do not always need to use close to 100% of allocated resources. If you have legitimate reasons for a lower utilisation please let us know when requesting resources so that we do not inconvenience you with requests for clarification.

Young is funded by EP/P020194/1 and EP/T022213/1 and is designed for materials and molecular modelling. QMUL receive 10 Million Gold for each 3 month accounting period: ensure your request covers the CPU hour charging model as described above.

To acknowledge use of Young, please use the statement provided. Please see the UCL Young Software page for information regarding available software and example job scripts.

Users are given a 250GB quota, which is shared across the home and scratch spaces. Run lquota to display the current disk usage. The maximum job size is 5120 cores; typical job sizes are between 2-5 nodes.

Jobs on Young are allocated whole nodes

Even if you do not request all the available cores, your job will still consume the entire node and no other jobs can run on it. You will be charged as though your job used the entire node. A job will be charged 80 Gold per node per hour regardless of the number of cores used and whether or not hyperthreading is enabled.

Young nodes are diskless

Young nodes have no local hard drives meaning there is no $TMPDIR available, so you should not request -l tmpfs=XG in your job scripts or your job will be rejected at submit time.

To request an account on Young, please contact us and provide the following information:

First name
Surname
QMUL username
Public SSH key (not the private key)
Software Required
Detailed description of research goals using Young

A public SSH key is required because Young does not accept password logins. We request that you create a new SSH key pair for Young rather than re-use any keys used to access Apocrita. Instructions for how to generate an SSH key pair are available here.

JADE2 - Joint Academic Data science Endeavour¶

Host Institution	Nodes	GPUs per node	Scheduler	EPSRC Grant
Oxford	63 NVidia DGX-MAX Q	8 Nvidia V100 (32GB)	Slurm	EP/T022205/1

JADE2 is a GPU cluster designed for machine learning and molecular dynamics applications. The Nvidia DGX-MAX Q system is twice the size of its predecessor, JADE.

To request an account on JADE2, please create an account providing an institutional email address, and the public portion of your ssh key. An email will be sent to you containing a password for SAFE which needs to be changed on first logging in. Once you have signed up, log on to SAFE and click on "Request Join Project". From the drop-down list choose J2AD007 and enter the Project access code which, for this Project, is j2adqmul21, and click "Request".

Before we can allocate your initial default resource allocation, please contact us with a brief description of the work you will perform using JADE2.

To acknowledge use of JADE2, please use a statement like the following:

This project made use of time on Tier 2 HPC facility JADE2, funded by
EPSRC (EP/T022205/1)

Legacy Tier 2 Services¶

Athena (decommissioned)¶

Athena was a 512-node HPC Midlands Plus cluster hosted in Loughborough, decommissioned in April 2021. To acknowledge use of Athena, please use the following statement:

We acknowledge the use of Athena at HPC Midlands+, which was funded by the EPSRC on
grant EP/P020232/1, in this research, as part of the HPC Midlands+ consortium.

Thomas (decommissioned)¶

Thomas was a 582-node Materials and Molecular Modelling cluster hosted in UCL, decommissioned in March 2021. To acknowledge use of Thomas, please use the following statement:

We are grateful to the UK Materials and Molecular Modelling Hub for
computational resources, which is partially funded by EPSRC (EP/P020194/1).

JADE (decommissioned Dec 2021)¶

JADE was a GPU cluster designed for machine learning and molecular dynamics applications. To acknowledge use of JADE, please use the following statement:

This project made use of time on Tier 2 HPC facility JADE, funded by
EPSRC (EP/P020275/1).