Apocrita shutdown for GPFS maintenance¶
Notice : Storage Maintenance Date : 28th June 2018 15:00 - 29th June 2018 15:00 (estimated) Affected : All Apocrita services (HPC Scheduler and Storage) Users affected : All Apocrita users
While we do what we can to minimise the regularity of system outages 1 on this occasion, under advice from our storage provider IBM, we need to bring the whole storage system off-line to perform a full check of the system, and repair a small number of damaged files 2.
This will affect the whole Apocrita system, including:
- Access to Research storage
- Login to frontend nodes
- No compute nodes will be available and we will drain nodes of all jobs
- Jobs submitted ahead of the shutdown time will remained queued and not run if the requested runtime exceeds the time available until shutdown.
 For example, over the last few weeks we have completed an update of the Operating System on the whole cluster to patch the meltdown CPU vulnerability, plus an update of the GPFS storage client without requiring service downtime or significant degradation of service.
 A small number (around 20) files were corrupted by a feature of the storage system, used to provide high availability in the case of a system outage. Due to the issues we encountered, we have disabled this Active File Management feature and are using other techniques instead.
We will notify when the service becomes available, or if we need to extend the maintenance window. GitHub Enterprise will continue to work since this has no dependency on the Apocrita storage.
Please contact us if you have queries not covered by this email.
New node purchases¶
A small number of new sdv nodes running Intel Skylake processors have been deployed and will form the basis of a larger tender bid being submitted shortly. As with previous tender bids, Researchers may contribute Research Grant funds to purchase additional nodes for restricted use.
Scratch space is the recommended location for temporarily storing data produced
by cluster jobs. We often see jobs failing on the cluster due to users filling
their home directory quota with data from job outputs.
We are phasing out the old
/data/scratch in favour of auto-purging scratch
Files are deleted 14 or 90 days after last modification, for the weekly or monthly folder respectively. autoScratch provides much more personal storage capacity (5TB for weekly, 1TB for monthly) versus 300GB on the old service) More details are here
Home and Group Directories¶
In recent weeks home directories were migrated to the new storage platform, This means that quotas have grown slightly as the underlying block size has increased.
qmquota command will tell you how much space you are using,
and that quotas are applied on size as well as the number of files.
Each Research group gets a free 1Tb of storage space on the cluster; if your
group has not got one then please contact us and we can organise it for your
We propose to remove some problematic module files on 4th April. Please check your job scripts for use of these modules:
- Python: Due to a number of issues with the module installs of python,
older versions below
3.6.3are being removed from Apocrita (
python/3.6.2-2). If your virtual environment was created with one of the listed versions, please re-create it with a fixed Python version following the instructions.
- Java: version
java/1.8.0_121-oraclecauses problems with mass thread spawning on the cluster and will be removed.
java/1.8.0_152-oraclewill remain the default version loaded.
QMUL have access to powerful Tier 2 (formerly known as Regional) HPC resources, predominantly for EPSRC researchers. If you run multi-node parallel code (using MPI, for example), you will likely benefit from using the Tier2 clusters.
QMUL have recently installed two IBM AC922 POWER 9 servers to support research into deep learning and artificial intelligence, the first of their kind in UK HE. These servers come with a suite of customised Machine learning tools such as TensorFlow and Caffe. We are currently running a pilot scheme with a select group of users, with a view to opening up to a wider audience shortly.
For the increasing number of Researchers working with Deep Learning technologies, we also have the existing GPU nodes attached to Apocrita. Tensorflow is available, and we are working on adding more machine learning tools.
We also host a local copy of ImageNet, a database of 14 Million annotated images for Machine Learning to the public datasets area on Apocrita.
Short queue for short and interactive jobs¶
Please note that frontend/login nodes are for preparing and submitting your job scripts and running computational tasks directly on the frontend nodes is forbidden, since it can impair the use of the node for the 100+ other users logged in.
In addition to the primary queue, there is a queue designed to minimise waiting times for short jobs and interactive sessions, in response to users who requested the ability to quickly obtain qlogin sessions for quick tests and debugging. This short queue runs on a wider selection of nodes and is automatically selected if your runtime request is 1 hour or less.