Skip to content

Frequently asked questions

On this page we list some common problems experienced when using the cluster, with suggestions on how to resolve them. If you contact us asking for help, please point to any solutions listed here that you have tried.

Why do I see "warning: no suitable queues" when submitting a job?

This is a bug in the current version of the job scheduler we are using. This message appears in the common scenario that there are no compute resources available to run your job immediately, however the job has been added to the queue, and if the specified resources are legitimate, the job will be executed when resources are available.

Why do I see an error like "$'\r': command not found"?

It is likely that you created the job script on a Windows machine, which uses different newline characters. The issue is easily fixed by converting your file to Unix newlines as described here. We also recommend using one of the native text editors such as vim or nano (some people find nano more intuitive for basic text editing) to edit your job scripts directly on Apocrita. Note that while vim is available natively, the nano module will need to be loaded first before you can use it.

Why do I get "ssh: connect to host port 22: Connection refused" or similar message when trying to connect?

We use a system to protect against brute-force attacks on the system. If you have 5 failed login attempts within 10 minutes, you will be automatically be locked out for 30 minutes. It is likely you are attempting to authenticate with an incorrect password. If you receive a "connection timed out" message, this may be a network issue, or your ISP is blocking access to SSH port 22. In this instance, it is worth checking if SSH connections work to other machines you have access to, and contacting your ISP/network provider.

What does "removed environment variable LD_LIBRARY_PATH from submit environment" mean when I submit my job?

This warning is shown when the "-V" option has been added to the submission script, or submission command. For security reasons, the scheduler does not allow library paths to be forwarded from the submission environment (the frontend), to the runtime environment (nodes running the job). If you have loaded a module before submission, the LD_LIBRARY_PATH variable will likely be set. The best practice is to not use "-V" in jobs, and to load all necessary modules and set environment variables within a self-contained job script, for better future reproducibility by you, or others.

What can I do when my program fails to run with an error message like "cannot open shared object file: No such file or directory"?

Usually this means that the software has been dynamically linked - at runtime, the environment needs to know where at external library dependencies are located. If the library is provided by gcc, for example, then loading the relevant gcc module will add additional directories to the LD_LIBRARY_PATH for the system to search when the program is run. Additionally the ldd command will show the shared object dependencies of a compiled file. If you are struggling to identify the missing library, please get in touch with the team and provide the steps necessary to reproduce the issue, and we will investigate for you.

How can I build my program when I see an error message like "/usr/bin/ld: cannot find -llibrary"?

The environment cannot find certain dependencies to build the program. Identifying the correct module that provides this library will be a matter of experience, but often there will be a module with a similar sounding name e.g. "/usr/bin/ld: cannot find -llapack may be resolved by loading the relevant lapack module into the environment.

How can I build my program when I see an error message like "Error: C++14 standard requested but CXX14 is not defined"?

Some R packages rely on a C++ compiler which supports the C++ 14 language revision. This can be resolved by creating a Makevars file with the necessary information detailed here.

How can I install a package or program when I get "permission denied"?

Contrary to how you might install an application on a personal device, the applications on Apocrita are not installed as part of the Operating System on each compute node, but is installed to shared storage mounted on all of the nodes. To install an application which is suitable for Apocrita but isn't currently provided by us, there are a couple of options.

  1. Install it locally within your own home folder or shared project
  2. Request that we install it for all users

If you are seeking to install it in your own storage space, when following instructions designed for personal devices, there may be a step that attempts to install the files into /usr/local/bin or some other space limited to administrators. You will need to specify an install location within your home folder or research project folder which you have full permissions to write into. There may also be instructions which tell you to use the sudo command to elevate privileges to administrative access. Any commands attempting to use sudo will fail due to lack of access rights.

Why does my job have lots of threads running but each using little CPU?

Some applications attempt to auto-detect the number of cores available to your job, but often the result is that the application attempts to run as many threads as there are processor cores on the entire compute node, rather than what you have requested for your job. Fortunately, a lot of applications also allow you to manually specify the number of cores available - where this is the case, you can provide the variable $NSLOTS which takes the value of the number of cores you requested for your job. An example is here.

Why did my job fail with a bus error?

Bus errors indicate insufficient RAM was requested for the job. The jobstats tool is helpful in determining how much RAM your jobs used. If the failure happens in the very early stage of the job execution, try using the short queue with -l h_rt=1:0:0 and testing with higher RAM sizes, and the queueing time will be much shorter. When the error no longer occurs, you can try the job again on the main queues by restoring the original h_rt value.

Why did my program work fine after build, but fails when submitted as a job?

If you built your custom program after loading additional modules (for example GCC, Java, or other), you also need to load the exact versions in your job script, otherwise the job will fail due to missing libraries or headers.

Can I run a docker container on Apocrita?

While testing Docker, we found that it is possible to escalate user privileges, which is a considerable security risk, so we (and other HPC sites) don't have the Docker software installed. However, Apptainer (previously known as Singularity) is a container solution designed for HPC services which is compatible with docker, and you can download and run Docker containers with Apptainer.

How can I fix an "UNPROTECTED PRIVATE KEY FILE!" warning?

This error is shown when the permissions on your hidden .ssh directory (likely in your home directory), and your private SSH keys are not secure enough for the SSH protocol on your local machine. OpenSSH will generate the error when you attempt to use the private key. To fix this, you will need to reset the permissions back to the default on your local machine:

chmod 755 ~/.ssh
chmod 600 ~/.ssh/*

How can I fix a "Permission denied (publickey)" error?

This error is shown when you are not using your private ssh key when connecting or you are using the wrong private ssh key. Please also confirm that you have uploaded (QMUL users only) your public key in the correct format, and we have accepted it. In some cases this error is shown when your account is suspended. If you are sure that your account is active, please check both of the following:

  1. You are using your private ssh key when connecting.
  2. Your private ssh key is correct and has not been overwritten with a new one.