Frequently asked questions¶
On this page we list some common problems experienced when using the cluster, with suggestions on how to resolve them. If you contact us asking for help, please point to any solutions listed here that you have tried.
Why do I see "warning: no suitable queues" when submitting a job?¶
This is a bug in the current version of the job scheduler we are using. This message appears in the common scenario that there are no compute resources available to run your job immediately, however the job has been added to the queue, and if the specified resources are legitimate, the job will be executed when resources are available.
Why do I see an error like "$'\r': command not found
"?¶
It is likely that you created the job script on a Windows machine, which uses
different newline characters. The issue is easily fixed by converting your file
to Unix newlines as described
here.
We also recommend using one of the native text editors such as vim
or nano
(some people find nano more intuitive for basic text editing) to edit your job
scripts directly on Apocrita. Note that while vim is available natively, the
nano module will need to be loaded first before you can use it.
Why do I get "ssh: connect to host login.hpc.qmul.ac.uk port 22: Connection refused
" or similar message when trying to connect?¶
We use a system to protect against brute-force attacks on the system. If you have 5 failed login attempts within 10 minutes, you will be automatically be locked out for 30 minutes. It is likely you are attempting to authenticate with an incorrect password. If you receive a "connection timed out" message, this may be a network issue, or your ISP is blocking access to SSH port 22. In this instance, it is worth checking if SSH connections work to other machines you have access to, and contacting your ISP/network provider.
What does "removed environment variable LD_LIBRARY_PATH from submit environment
" mean when I submit my job?¶
This warning is shown when the "-V" option has been added to the submission
script, or submission command. For security reasons, the scheduler does not
allow library paths to be forwarded from the submission environment (the
frontend), to the runtime environment (nodes running the job). If you have
loaded a module before submission, the LD_LIBRARY_PATH
variable will likely
be set. The best practice is to not use "-V" in jobs, and to load all necessary
modules and set environment variables within a self-contained job script, for
better future reproducibility by you, or others.
What can I do when my program fails to run with an error message like "cannot open shared object file: No such file or directory
"?¶
Usually this means that the software has been dynamically linked - at runtime,
the environment needs to know where at external library dependencies are
located. If the library is provided by gcc, for example, then loading the
relevant gcc module will add additional directories to the LD_LIBRARY_PATH for
the system to search when the program is run. Additionally the ldd
command
will show the shared object dependencies of a compiled file. If you are
struggling to identify the missing library, please get in touch with the team
and provide the steps necessary to reproduce the issue, and we will investigate
for you.
How can I build my program when I see an error message like "/usr/bin/ld: cannot find -llibrary
"?¶
The environment cannot find certain dependencies to build the program.
Identifying the correct module that provides this library will be a matter of
experience, but often there will be a module with a similar sounding name e.g.
"/usr/bin/ld: cannot find -llapack
may be resolved by loading the relevant
lapack module into the environment.
How can I build my program when I see an error message like "Error: C++14 standard requested but CXX14 is not defined
"?¶
Some R packages rely on a C++ compiler which supports the C++ 14 language
revision. This can be resolved by creating a Makevars
file with the necessary
information detailed
here.
How can I install a package or program when I get "permission denied"?¶
Contrary to how you might install an application on a personal device, the applications on Apocrita are not installed as part of the Operating System on each compute node, but is installed to shared storage mounted on all of the nodes. To install an application which is suitable for Apocrita but isn't currently provided by us, there are a couple of options.
- Install it locally within your own home folder or shared project
- Request that we install it for all users
If you are seeking to install it in your own storage space, when following
instructions designed for personal devices, there may be a step that attempts
to install the files into /usr/local/bin
or some other space limited to
administrators. You will need to specify an install location within your home
folder or research project folder which you have full permissions to write
into. There may also be instructions which tell you to use the sudo
command to elevate privileges to administrative access. Any commands attempting
to use sudo
will fail due to lack of access rights.
Why does my job have lots of threads running but each using little CPU?¶
Some applications attempt to auto-detect the number of cores available to your
job, but often the result is that the application attempts to run as many
threads as there are processor cores on the entire compute node, rather than
what you have requested for your job. Fortunately, a lot of applications also
allow you to manually specify the number of cores available - where this is the
case, you can provide the variable $NSLOTS
which takes the value of the
number of cores you requested for your job. An example is
here.
Why did my job fail with a bus error?¶
Bus errors indicate insufficient RAM was requested for the job. The jobstats
tool is helpful in determining how much RAM your jobs used. If the failure
happens in the very early stage of the job execution, try using the short queue
with -l h_rt=1:0:0
and testing with higher RAM sizes, and the queueing time will
be much shorter. When the error no longer occurs, you can try the job again on
the main queues by restoring the original h_rt
value.
Why did my program work fine after build, but fails when submitted as a job?¶
If you built your custom program after loading additional modules (for example GCC, Java, or other), you also need to load the exact versions in your job script, otherwise the job will fail due to missing libraries or headers.
Can I run a docker container on Apocrita?¶
While testing Docker, we found that it is possible to escalate user privileges, which is a considerable security risk, so we (and other HPC sites) don't have the Docker software installed. However, Apptainer (previously known as Singularity) is a container solution designed for HPC services which is compatible with docker, and you can download and run Docker containers with Apptainer.
How can I fix an "UNPROTECTED PRIVATE KEY FILE!
" warning?¶
This error is shown when the permissions on your hidden .ssh
directory
(likely in your home directory), and your private SSH keys are not secure
enough for the SSH protocol on your local machine. OpenSSH will generate
the error when you attempt to use the private key. To fix this, you will
need to reset the permissions back to the default on your local machine:
chmod 755 ~/.ssh
chmod 600 ~/.ssh/*
How can I fix a "Permission denied (publickey)
" error?¶
This error is shown when you are not using your private ssh key when connecting or you are using the wrong private ssh key. Please also confirm that you have uploaded (QMUL users only) your public key in the correct format, and we have accepted it. In some cases this error is shown when your account is suspended. If you are sure that your account is active, please check both of the following:
- You are using your private ssh key when connecting.
- Your private ssh key is correct and has not been overwritten with a new one.