Apocrita shutdown for GPFS maintenance¶
Notice : Storage Maintenance Date : 28th June 2018 15:00 - 29th June 2018 15:00 (estimated) Affected : All Apocrita services (HPC Scheduler and Storage) Users affected : All Apocrita users
While we do what we can to minimise the regularity of system outages 1 on this occasion, under advice from our storage provider IBM, we need to bring the whole storage system off-line to perform a full check of the system, and repair a small number of damaged files 2.
This will affect the whole Apocrita system, including:
- Access to Research storage
- Login to frontend nodes
- No compute nodes will be available and we will drain nodes of all jobs
- Jobs submitted ahead of the shutdown time will remained queued and not run if the requested runtime exceeds the time available until shutdown.
 For example, over the last few weeks we have completed an update of the Operating System on the whole cluster to patch the meltdown CPU vulnerability, plus an update of the GPFS storage client without requiring service downtime or significant degradation of service.
 A small number (around 20) files were corrupted by a feature of the storage system, used to provide high availability in the case of a system outage. Due to the issues we encountered, we have disabled this Active File Management feature and are using other techniques instead.
We will notify when the service becomes available, or if we need to extend the maintenance window. GitHub Enterprise will continue to work since this has no dependency on the Apocrita storage.
Please contact us if you have queries not covered by this email.
Notice : Storage Maintenance Date : 9:00 2 January 2018 - 9:00 4th January Affected : All Apocrita services (HPC Scheduler and Storage) Users affected : All Apocrita users
We are performing some essential maintenance on the storage system of Apocrita. This will require shutting down the whole storage system and as a result, the entire HPC cluster and storage will be unavailable for the duration of the work.
This will involve:
- upgrade of storage system firmware - this provides important stability fixes, and allows any future minor releases to be applied without a full system shutdown.
- migration of user home directories to the new storage system (also requires turning off user access) - no user action will be required as a result of this task.
- minor version update to the Univa job scheduler to fix a couple of small issues and provide performance enhancements.
We will apply a reservation to all cluster nodes, so that all running jobs will be completed when we begin the update. As you approach the final date, if your requested runtime exceeds the number of days until the shutdown, your job will be added to the queue but not run. Upon commencing the update, any existing jobs in the queue will need to be deleted.
The updates are estimated to take 2 full days to apply - we are planning to restore service by 4th January, and will keep you informed regarding progress.
We try to keep this kind of work to a minimum and have chosen the time of year to hopefully cause least impact to users of the service.
Note that Github, our documentation pages and the ticket system will not be affected by this work.
Christmas Closure 2017¶
The ITS Research office closes for Christmas on 21/12/2017. Emails and support tickets will be read on 02/01/2018 when the office re-opens.
New cluster available for use¶
We are pleased to announce that the new cluster is available for general use. This has been a large project, involving the following:
- Upgraded storage - new storage controllers and extra 1PB of storage
- Cluster Operating System upgrade - now running CentOS7.3
- Job scheduler upgrade - now running Univa Grid Engine 8.5.1 - please note that your old scripts will need changing before running on the new cluster. Please read the documentation on this site for full details.
- Application rebuilds for CentOS7 - featuring latest versions of many applications
- New nxv nodes - including infiniband-connected nodes for parallel jobs
- GPU nodes - 4 new nxg nodes with Nvidia Tesla K80 for gaining substantial performance increase from gpu-enabled applications
- Singularity containers - utilise Linux containers to encapsulate your software for portable and reproducible science
- Documentation site rewrite - all of the pages on this site have been rewritten for use with the new UGE scheduler. Documentation for the previous SGE cluster remain available here
- Stats site has been written for the new cluster. While both clusters are running, a landing page will be shown to give you the option of which cluster stats you require.
The new cluster is available via
login.hpc.qmul.ac.uk whilst the old cluster
is now accessible via
login-legacy.hpc.qmul.ac.uk. Please see the
logging in page for more information on connecting.
Over the coming months we will be migrating more nodes from the existing cluster, as demand requires it. We have been testing with a group of users from a variety of disciplines over the last 6 months. You are free to test your favourite applications and also run production code on the new cluster.
While we have added and tested a substantial number of applications, reaching
the full complement of applications is a work-in-progress. Please fill in the
application request form if you require an application that
has not been provided yet. In the meantime, you can temporarily access the
modules built for the older cluster by executing
module load use.sl6, but it
should be used with caution as many applications will not function correctly
since they were built with particular library versions on a different
operating system. Note that any new application requests will be built for the
new cluster only.
If you are experiencing issues, we recommend that you search this site and read the provided documentation first to see if your question is answered. Please contact us if you are still experiencing an issue relating to the HPC cluster.
Please note that we have a new reference that should be cited for any published research. Citing Apocrita correctly in your published work helps ensure continued funding and upgrades to the service. We also have an updated usage policy - please adhere to the new policy to ensure this shared computing resource runs optimally.
Announcement regarding New Storage¶
We recently added an additional petabyte of storage, it is necessary to move all files to the new storage to benefit from improved performance.
We will be contacting each group to arrange for migration of their files. If you require more space your files will need to be migrated first. During migration we will need to stop activity on each fileset temporarily.
Once the migration is completed, files will continue to be available under /data on the cluster, you will not need to modify your scripts.
Announcement regarding Midplus Consortium¶
The Midlands Plus consortium have now deployed a new 14,000 core cluster, located in Loughborough. You can hear more about this from your local institution.
QMUL have also recently purchased new hardware and storage with college funding, and are in the process of migrating to it.
With the new Midplus cluster coming online, the old Midplus arrangement has reached end-of-life. As the new QMUL cluster hardware is deployed, the old Midplus cluster is simultaneously being phased out.
This means that HPC services and storage hosted by QMUL are no longer available for Warwick, Nottingham and Birmingham Midplus users.
Note that Minerva, the Parallel computing part of the original Midplus cluster based at Warwick, has already been decommissioned.