New node purchases¶
A small number of new sdv nodes running Intel Skylake processors have been deployed. A larger batch of these nodes will be added to the cluster before the end of the year.
Scratch space is the recommended location for temporarily storing data produced
by cluster jobs. We often see jobs failing on the cluster due to users filling
their home directory quota with data from job outputs.
We are phasing out the old
/data/scratch in favour of auto-purging scratch
Files are deleted 14 or 90 days after last modification, for the weekly or monthly folder respectively. autoScratch provides much more personal storage capacity (5TB for weekly, 1TB for monthly) versus 300GB on the old service) More details are here
Home and Group Directories¶
During the summer, home directories were migrated to the new storage platform, This means that quotas have grown slightly as the underlying block size has increased.
qmquota command will tell you how much space you are using,
and that quotas are applied on size as well as the number of files.
Each Research group gets a free 1Tb of storage space on the cluster; if your
group has not got one then please contact us and we can organise it for your
We removed some problematic module files. Please check your job scripts for use of these modules:
- Python: Due to a number of issues with the module installs of python,
older versions below
3.6.3are being removed from Apocrita (
python/3.6.2-2). If your virtual environment was created with one of the listed versions, please re-create it with a fixed Python version following the instructions.
- Java: version
java/1.8.0_121-oraclecauses problems with mass thread spawning on the cluster and will be removed.
java/1.8.0_152-oraclewill remain the default version loaded.
QMUL have access to powerful Tier 2 (formerly known as Regional) HPC resources, predominantly for EPSRC researchers. If you run multi-node parallel code (using MPI, for example), you will likely benefit from using the Tier2 clusters.
QMUL installed two IBM AC922 POWER 9 servers to support research into deep learning and artificial intelligence, the first of their kind in UK HE. These servers come with a suite of customised Machine learning tools such as TensorFlow and Caffe. We are currently running a pilot scheme with a select group of users, with a view to opening up to a wider audience.
For the increasing number of Researchers working with Deep Learning technologies, we also have the existing GPU nodes attached to Apocrita. Tensorflow is available, and we are working on adding more machine learning tools.
We also host a local copy of ImageNet, a database of 14 Million annotated images for Machine Learning to the public datasets area on Apocrita.
Short queue for short and interactive jobs¶
Please note that frontend/login nodes are for preparing and submitting your job scripts and running computational tasks directly on the frontend nodes is forbidden, since it can impair the use of the node for the 100+ other users logged in.
In addition to the primary queue, there is a queue designed to minimise waiting times for short jobs and interactive sessions, in response to users who requested the ability to quickly obtain qlogin sessions for quick tests and debugging. This short queue runs on a wider selection of nodes and is automatically selected if your runtime request is 1 hour or less.
Apocrita shutdown for GPFS maintenance¶
Notice : Storage Maintenance Date : 28th June 2018 15:00 - 29th June 2018 15:00 (estimated) Affected : All Apocrita services (HPC Scheduler and Storage) Users affected : All Apocrita users
While we do what we can to minimise the regularity of system outages 1 on this occasion, under advice from our storage provider IBM, we need to bring the whole storage system off-line to perform a full check of the system, and repair a small number of damaged files 2.
This will affect the whole Apocrita system, including:
- Access to Research storage
- Login to frontend nodes
- No compute nodes will be available and we will drain nodes of all jobs
- Jobs submitted ahead of the shutdown time will remained queued and not run if the requested runtime exceeds the time available until shutdown.
 For example, over the last few weeks we have completed an update of the Operating System on the whole cluster to patch the meltdown CPU vulnerability, plus an update of the GPFS storage client without requiring service downtime or significant degradation of service.
 A small number (around 20) files were corrupted by a feature of the storage system, used to provide high availability in the case of a system outage. Due to the issues we encountered, we have disabled this Active File Management feature and are using other techniques instead.
We will notify when the service becomes available, or if we need to extend the maintenance window. GitHub Enterprise will continue to work since this has no dependency on the Apocrita storage.
Please contact us if you have queries not covered by this email.
Notice : Storage Maintenance Date : 9:00 2 January 2018 - 9:00 4th January Affected : All Apocrita services (HPC Scheduler and Storage) Users affected : All Apocrita users
We are performing some essential maintenance on the storage system of Apocrita. This will require shutting down the whole storage system and as a result, the entire HPC cluster and storage will be unavailable for the duration of the work.
This will involve:
- upgrade of storage system firmware - this provides important stability fixes, and allows any future minor releases to be applied without a full system shutdown.
- migration of user home directories to the new storage system (also requires turning off user access) - no user action will be required as a result of this task.
- minor version update to the Univa job scheduler to fix a couple of small issues and provide performance enhancements.
We will apply a reservation to all cluster nodes, so that all running jobs will be completed when we begin the update. As you approach the final date, if your requested runtime exceeds the number of days until the shutdown, your job will be added to the queue but not run. Upon commencing the update, any existing jobs in the queue will need to be deleted.
The updates are estimated to take 2 full days to apply - we are planning to restore service by 4th January, and will keep you informed regarding progress.
We try to keep this kind of work to a minimum and have chosen the time of year to hopefully cause least impact to users of the service.
Note that Github, our documentation pages and the ticket system will not be affected by this work.
Christmas Closure 2017¶
The ITS Research office closes for Christmas on 21/12/2017. Emails and support tickets will be read on 02/01/2018 when the office re-opens.
New cluster available for use¶
We are pleased to announce that the new cluster is available for general use. This has been a large project, involving the following:
- Upgraded storage - new storage controllers and extra 1PB of storage
- Cluster Operating System upgrade - now running CentOS7.3
- Job scheduler upgrade - now running Univa Grid Engine 8.5.1 - please note that your old scripts will need changing before running on the new cluster. Please read the documentation on this site for full details.
- Application rebuilds for CentOS7 - featuring latest versions of many applications
- New nxv nodes - including infiniband-connected nodes for parallel jobs
- GPU nodes - 4 new nxg nodes with Nvidia Tesla K80 for gaining substantial performance increase from gpu-enabled applications
- Singularity containers - utilise Linux containers to encapsulate your software for portable and reproducible science
- Documentation site rewrite - all of the pages on this site have been rewritten for use with the new UGE scheduler. Documentation for the previous SGE cluster remain available here
- Stats site has been written for the new cluster. While both clusters are running, a landing page will be shown to give you the option of which cluster stats you require.
The new cluster is available via
login.hpc.qmul.ac.uk whilst the old cluster
is now accessible via
login-legacy.hpc.qmul.ac.uk. Please see the
logging in page for more information on connecting.
Over the coming months we will be migrating more nodes from the existing cluster, as demand requires it. We have been testing with a group of users from a variety of disciplines over the last 6 months. You are free to test your favourite applications and also run production code on the new cluster.
While we have added and tested a substantial number of applications, reaching
the full complement of applications is a work-in-progress. Please fill in the
application request form if you require an application that
has not been provided yet. In the meantime, you can temporarily access the
modules built for the older cluster by executing
module load use.sl6, but it
should be used with caution as many applications will not function correctly
since they were built with particular library versions on a different
operating system. Note that any new application requests will be built for the
new cluster only.
If you are experiencing issues, we recommend that you search this site and read the provided documentation first to see if your question is answered. Please contact us if you are still experiencing an issue relating to the HPC cluster.
Please note that we have a new reference that should be cited for any published research. Citing Apocrita correctly in your published work helps ensure continued funding and upgrades to the service. We also have an updated usage policy - please adhere to the new policy to ensure this shared computing resource runs optimally.
Announcement regarding New Storage¶
We recently added an additional petabyte of storage, it is necessary to move all files to the new storage to benefit from improved performance.
We will be contacting each group to arrange for migration of their files. If you require more space your files will need to be migrated first. During migration we will need to stop activity on each fileset temporarily.
Once the migration is completed, files will continue to be available under /data on the cluster, you will not need to modify your scripts.
Announcement regarding Midplus Consortium¶
The Midlands Plus consortium have now deployed a new 14,000 core cluster, located in Loughborough. You can hear more about this from your local institution.
QMUL have also recently purchased new hardware and storage with college funding, and are in the process of migrating to it.
With the new Midplus cluster coming online, the old Midplus arrangement has reached end-of-life. As the new QMUL cluster hardware is deployed, the old Midplus cluster is simultaneously being phased out.
This means that HPC services and storage hosted by QMUL are no longer available for Warwick, Nottingham and Birmingham Midplus users.
Note that Minerva, the Parallel computing part of the original Midplus cluster based at Warwick, has already been decommissioned.