Compiling C, C++ and Fortran code¶
On Apocrita we provide a number of compilers and interpreters for popular programming languages. You can use these to build and run your own project code. On Apocrita we also provide programs and software components for you to use, but you can also use the compiler tools to build these for yourself.
This page focuses on the C, C++ and Fortran languages which are the most common compiled languages in use on the cluster. Other documentation pages exist for: Java, Julia, Python, R and Ruby.
Bare Metal vs Spack
These instructions will suffice for most simple cases, but given the subject matter, these will be few and far between. For more portable, reproducible and shareable results, you should consider working through the documentation on custom Spack scopes and Spack environments.
Available compilers¶
A number of compiler suites, each offering C, C++ and Fortran compilers, are available on Apocrita:
- GCC
- Intel (part of Intel OneAPI)
- NVIDIA HPC SDK
Within a compiler suite the provided C compiler is a companion processor to the Fortran compiler in the sense of C interoperability.
The compilers are available via modules. One version of the GCC compilers will be available without loading a module, but this will be an earlier version than offered through the module system. It is preferable to load the module for the latest release of each compiler. Depending on your code and libraries, careful choice of a compilers may provide considerable performance improvements.
Compilation should be performed as job submissions or
interactively via qlogin
in order not to impact
the frontend nodes for other users. One should compile code on the same
architecture machines as it will be run on,
so the appropriate node selection should be applied
to these job requests. One should also ensure that compilation and runtime
modules match, otherwise dynamic libraries might complain about mismatched
versions.
Loading a compiler module¶
It is generally a good idea to be specific with your compiler version. Check which modules you have loaded to be sure you have the right compiler and that there are no conflicts.
Check the available version for the GCC compiler suite:
$ module avail gcc
gcc/12.2.0
For Intel:
module avail intel
intel-classic/2021.10.0 intel-mkl/2024.1.0 intel-mpi/2021.12.1
intel-tbb/2021.9.0-gcc-12.2.0 intel/2023.2.4 intel/2024.1.0
Intel compiler version 2024.1.0 can be loaded with the command
module load intel/2024.1.0
You can test this by typing the command:
icx -V
This should return a short message reporting the compiler version:
Intel(R) oneAPI DPC++/C++ Compiler for applications running on Intel(R) 64,
Version 2024.1.0 Build 20240308
Copyright (C) 1985-2024 Intel Corporation. All rights reserved.
Often, you will require other libraries and headers that can be found in other modules. Unlike modules which provide many programs and tools, these library modules may be specific to a particular compiler suite. For example, for Open MPI and Intel MPI:
$ module avail openmpi
openmpi/5.0.3-gcc-12.2.0
$ module avail intel-mpi
intel-mpi/2021.12.1
Check your loaded modules with
module list
If you don't specify a particular version, the version marked as default
in
the output of module avail
command will be loaded.
Using the compilers¶
Each of the compiler suites provides a C, C++ and a Fortran compiler. The name of the compiler command varies with the language and the compiler suite. For convenience the compiler suite modules set consistent environment variables by which the compilers may be referenced. The compiler names and variables are given in the following table:
Language | Variable | GCC | Intel | NVIDIA |
---|---|---|---|---|
C | CC | gcc | icx | nvc |
C++ | CXX | g++ | icpx | nvc++ |
Fortran | FC | gfortran | ifx | nvfortran |
As an example, we shall consider the problem of Buffon's needle. If a needle is dropped on a surface of parallel lines, such that the line separation is twice the needle's length, the probability of it crossing the lines is the reciprocal of Pi. Thus, Pi can be estimated with a simple Monte-Carlo integration. We shall run 48 million trials, in order to occupy a CPU for a noticeable length of time.
buffon.f90
A possible implementation in Fortran90 might be:
program buffon
use iso_fortran_env
implicit none
integer(kind=int32) :: trials, hits, i
real(kind=real64) :: pi_ref, h_pi, rnd, pos, cos_theta, result
pi_ref = 4.0 * atan(1.0)
h_pi = pi_ref / 2.0
trials = 48E6
hits = 0
do i = 0, trials
call random_number(rnd)
pos = 4 * rnd
call random_number(rnd)
cos_theta = cos(pi_ref * rnd - h_pi)
if (pos .lt. cos_theta .or. pos .gt. 4.0 - cos_theta) hits = hits + 1
end do
result = real(trials) / hits
print "(a,f12.10,f8.3,a)", "Estimated Pi ", result, 100 * result / pi_ref, "%"
end program buffon
For Fortran with the GNU compilers:
$ module load gcc
$ gfortran -o buffon buffon.f90
$ time ./buffon
Estimated Pi 3.1422896385 100.022%
real 0m1.804s
user 0m1.777s
sys 0m0.004s
For Fortran with the Intel compilers:
$ module load intel
$ ifx -o buffon buffon.f90
$ time ./buffon
Estimated Pi 3.1410064697 99.981%
real 0m1.085s
user 0m1.082s
sys 0m0.001s
buffon.c
A possible implementation in C might be:
#include <stdlib.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main(void)
{
int hits, throws;
float h_pi, pos, cos_theta, result;
srand (time(NULL));
throws = 48E6;
h_pi = M_PI / 2;
hits = 0;
for (int i=0;i<throws;i++) {
pos = (4.0 * rand()) / RAND_MAX;
cos_theta = cos((M_PI * rand()) / RAND_MAX - h_pi);
if (pos < cos_theta | pos > 4.0 - cos_theta) hits++;
}
result = (float) throws / hits;
printf("Estimated Pi %12.10f %8.3f%%\n", result, 100 * result / M_PI);
return 0;
}
For C with the GNU compilers, remembering to link the standard maths libraries:
$ module load gcc
$ gcc -o buffon buffon.c -lm
$ time ./buffon
Estimated Pi 3.1412234306 99.988%
real 0m2.088s
user 0m2.082s
sys 0m0.002s
For C with the Intel compilers:
$ module load intel
$ icx -o buffon buffon.c
$ time ./buffon
Estimated Pi 3.1420614719 100.015%
real 0m1.499s
user 0m1.493s
sys 0m0.003s
Deprecated Intel compilers and MPI wrappers¶
Intel oneAPI has deprecated the icc
, icpc
, and ifort
compilers, but they
are still available by loading the intel/2023
module.
Using GPU nodes with OpenMP¶
On Apocrita we support offloading to GPU devices using
OpenMP with GCC. If you have access to the GPU nodes you can compile and
run appropriate OpenMP programs, such as those using the target
construct,
as described below.
OpenMP device offload with GCC compilers¶
OpenMP target offload should be automatically enabled when OpenMP
compilation is selected with the -fopenmp
compiler option. For example to
compile the source file offload-example.c
which uses the target
construct,
you can use:
module load gcc/12.2.0
gcc -fopenmp offload-example.c
The option -foffload=-lm
is required to support the maths library on the
target device. If you see an error message like
unresolved symbol sqrtf
collect2: error: ld returned 1 exit status
mkoffload: fatal error: x86_64-pc-linux-gnu-accel-nvptx-none-gcc returned 1 exit status
compilation terminated.
then you will need to provide this option when compiling.
Although it is not necessary to compile the code on a GPU node to enable GPU offload, using the node on which you wish to run is advised when compiling.
An OpenMP program compiled with offload enabled can be run in the same way
as with other programs. Offload happens automatically if a GPU is available
when a target
construct is entered.
To disable offload so that the code with a target
construct is run on the CPU
host instead of the GPU device, compile the program with -foffload=disable
.
Equally, the code can be compiled without the -fopenmp
option if OpenMP
is not required.
libgomp
loader warnings on non-GPU nodes
If you run an OpenMP program with offload target regions on a node without a GPU you may see a warning like:
libgomp: while loading libgomp-plugin-nvptx.so.1: libcuda.so.1: cannot open shared object file: No such file or directory
These warnings occur because we provide a single compiler build to work on
all node types. Compiling programs with -foffload=disable
will not
avoid such warnings. However, affected parallel regions will still run on
the host CPU and the warnings can be safely ignored.
Build systems¶
Typically, software for Linux comes with a build system with one of two flavours: GNU Autotools and CMake. Each of these typically uses the Make tool at a lower level.
On Apocrita the GNU Autotools system can be used without loading a module,
although it may be necessary to load an
autotools-archive
module to support some
additional macros. To use CMake it is necessary to load a cmake
module.
For a project using GNU Autotools the general steps to build are as follows:
./configure [options]
make
First one runs a configuration command which creates a Makefile. One then runs
the make
command that reads the Makefile and calls the necessary compilers,
linkers and such.
CMake is similar but as well as supporting Makefiles, it can also configure the
build system using Visual Studio Projects, OSX XCode Projects and more.
Such projects can be identified by the presence of a CMakeList.txt
file.
GNU Autotools and CMake support out-of-tree source builds. Put another way, one can create a binary and all its associated support files in a directory that is not the same as the one with the source files. This can be quite advantageous when working with a source management tool like Git or SVN or when building the project supporting several different configurations, such as for debugging or targeting different node types.
To work with CMake with an out-of-tree build, start with creating a build directory in a different location:
$ pwd
/data/home/abc123/MySourceCode
$ mkdir ../MySourceCode_build
$ cd ../MySourceCode_build
$ cmake ../MySourceCode
Essentially, you enter the build directory and call cmake
with the path to
your CMakeList.txt
file. If you wish to re-configure your build, you can use
the program ccmake
.
The end result is a Makefile. So to complete your build you type:
make
just as you would with the GNU Autotools setup.
Similarly, to use an out-of-tree build with GNU Autotools:
$ pwd
/data/home/abc123/MySourceCode
$ mkdir ../MySourceCode_build
$ cd ../MySourceCode_build
$ ../MySourceCode/configure
To learn more about GNU Autotools, CMake, and Makefiles follow the links below
- GNU Autotools - FAQ
- GNU Make - Writing Makefiles
- Makefile Wikipedia article
- CMake Webpage
- CMake Wikipedia article
Optional libraries for HPC¶
MPI¶
The Message Passing Interface is a protocol for parallel computation often used in HPC applications. On Apocrita we have the distinct implementations Intel MPI and Open MPI available.
The module system allows the user to select the implementation of MPI to be used, and the version. With Open MPI, as noted above, one must be careful to load a module compatible with the compiler suite being used.
To load the default (usually latest) Intel MPI module:
module load intel-mpi
To set up the Open MPI environment, version 5.0.3, suitable for use with the GCC compiler suite:
module load openmpi
module load gcc
For each implementation, several versions may be available. The default version is usually set to the latest release: an explicit version number is required to load a different version.
Default module for Open MPI
The Open MPI modules have a default loaded following the command
module load openmpi
which is openmpi/5.0.3-gcc-12.2.0
. This default
module is specific to the GCC compiler suite and so to access an
MPI implementation compatible with a different compiler suite a
specific module name must be specified.
To build a program using MPI it is necessary for the compiler and linker to be able to find the header and library files. As a convenience, the MPI environment provides wrapper scripts to the compiler, each of which sets the appropriate flags for the compiler. The name of each wrapper script depends on the implementation and the target compiler.
Open MPI¶
For each Open MPI module, and the implementation provided by the NVIDIA compiler suite module, the wrapper scripts are consistently named for each language. These are given in the table below:
Language | Script |
---|---|
C | mpicc |
C++ | mpic++ |
Fortran | mpif90 |
buffon_mpi.f90
A possible MPI implementation in Fortran90 might be:
program buffon
use iso_fortran_env
use mpi
implicit none
integer(kind=int32) :: trials, local, hits, i
integer(kind=int32) :: rank, mpisize, mpierr
real(kind=real64) :: pi_ref, h_pi, rnd, pos, cos_theta, result
call MPI_INIT(mpierr)
call MPI_COMM_SIZE(MPI_COMM_WORLD, mpisize, mpierr)
call MPI_COMM_RANK(MPI_COMM_WORLD, rank, mpierr)
pi_ref = 4.0 * atan(1.0)
h_pi = pi_ref / 2.0
trials = 48E6 / mpisize
local = 0
do i = 0, trials
call random_number(rnd)
pos = 4 * rnd
call random_number(rnd)
cos_theta = cos(pi_ref * rnd - h_pi)
if (pos .lt. cos_theta .or. pos .gt. 4.0 - cos_theta) local = local + 1
end do
call MPI_Reduce(local, hits, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, mpierr)
if (rank .eq. 0) then
result = real(trials * mpisize) / hits
print "(a,f12.10,f8.3,a)", "Estimated Pi ", result, 100 * result / pi_ref, "%"
end if
call MPI_FINALIZE(mpierr)
end program buffon
Returning to the Buffon's needle example, a Fortran MPI program may be compiled and run on Open MPI with the requested number of cores with:
$ echo $NSLOTS
4
$ module load openmpi/5.0.3-gcc-12.2.0
$ mpif90 -o buffon_mpi buffon_mpi.f90
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1422548294 100.021%
real 0m1.874s
user 0m2.408s
sys 0m0.735s
(Any detailed discussion of the MPI bindings is beyond the scope of this
document, but sharing the trials across multiple processes with
MPI_Reduce
is an obvious approach. See the courses offered by
HPC-UK and tier-2 facilities such as
Archer2.)
The Open MPI wrapper scripts provide an option -show
which details the
final invocation of the compiler:
$ module load openmpi/5.0.3-gcc-12.2.0
$ mpif90 -show -o hello hello.f90
gfortran -o hello hello.f90 ...
buffon_mpi.c
A possible MPI implementation in C might be:
#include <stdlib.h>
#include <mpi.h>
#include <time.h>
#include <math.h>
#include <stdio.h>
int main (int argc, char** argv)
{
int rank, mpisize, hits, local, throws;
float h_pi, pos, cos_theta, result;
MPI_Init(&argc, &argv);
MPI_Comm_size(MPI_COMM_WORLD, &mpisize);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
srand (time(NULL));
throws = 48E6 / mpisize;
h_pi = M_PI / 2;
local = 0;
for (int i=0;i<throws;i++) {
pos = (4.0 * rand()) / RAND_MAX;
cos_theta = cos((M_PI * rand()) / RAND_MAX - h_pi);
if (pos < cos_theta | pos > 4.0 - cos_theta) local++;
}
MPI_Reduce(&local, &hits, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);
if (rank == 0) {
result = (float) throws * mpisize / hits;
printf("Estimated Pi %12.10f %8.3f%%\n", result, 100 * result / M_PI);
}
MPI_Finalize();
return 0;
}
To compile and run with the appropriate number of cores in C on Open MPI, remembering to link the standard maths libraries:
$ echo $NSLOTS
4
$ module load openmpi/5.0.3-gcc-12.2.0
$ mpicc -o buffon_mpi buffon_mpi.c -lm
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1413266659 99.992%
real 0m1.381s
user 0m2.814s
sys 0m0.803s
No Open MPI module is provided for use with the NVIDIA compiler suite. Instead, the installed NVIDIA compiler environment provides an Open MPI implementation and the NVIDIA compiler module contains the appropriate settings:
$ module purge; module load nvidia-hpc-sdk/24.5
$ type mpif90
mpif90 is /share/apps/rocky9/general/apps/nvidia-hpc-sdk/2024_245/Linux_x86_64/24.5/comm_libs/mpi/bin/mpif90
Intel MPI¶
In contrast, the Intel MPI implementation supports both the Intel and GCC compiler suites in the same module. As with Open MPI wrapper scripts are provided, but these wrapper script names depend on the target compiler suite as well as the language. The wrapper script names are as in the following table:
Language | Compiler suite | Script |
---|---|---|
C | GCC | mpicc |
C | Intel | mpiicx |
C++ | GCC | mpicxx |
C++ | Intel | mpiicpx |
Fortran | GCC | mpifc |
Fortran | Intel | mpiifx |
Compiling and running the MPI version of the Buffon's needle Fortran code for Intel MPI and Intel compilers:
$ echo $NSLOTS
4
$ module load intel intel-mpi
$ mpiifx -o buffon_mpi buffon_mpi.f90
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1396319866 99.938%
real 0m1.455s
user 0m2.376s
sys 0m0.465s
Compiling and running the MPI version of the Buffon's needle C code for Intel MPI and Intel compilers:
$ echo $NSLOTS
4
$ module load intel intel-mpi
$ mpiicx -o buffon_mpi buffon_mpi.c
$ time mpirun -np $NSLOTS ./buffon_mpi
Estimated Pi 3.1414599419 99.996%
real 0m1.171s
user 0m2.280s
sys 0m0.503s
Mixing Intel MPI with GNU compilers
In general we recommend that Intel compilers are used with Intel MPI, and GNU compilers with Open MPI. While mixing Intel MPI with gcc works for C:
$ module load gcc intel-mpi
$ mpicxx -o buffon_mpi buffon_mpi.c
Currently, mixing Intel MPI with GNU Fortran does not:
$ module load gcc intel-mpi
$ mpifc -o buffon_mpi buffon_mpi.f90
Deprecated MPI wrappers¶
Intel has also deprecated the MPI wrappers that go with the deprecated compilers: mpiicc, mpiicpc, mpiifort have been retired in favour of mpiicx, mpiicpx, mpiifx.
The scripts can be used as in the Open MPI example above:
$ module load intel-mpi
$ mpifc -show -o hello hello.f90
gfortran -o 'hello' 'hello.f90' ...
$ mpiifx -show -o hello hello.f90
ifx -o 'hello' 'hello.f90' ...
Matching versions of Intel MPI and Intel compiler
In general we recommend that, when using Intel MPI with the Intel compilers, you match the versions of the modules. However, there are times where it is necessary or desirable to use a different version of Intel MPI. In these cases you should load the Intel MPI module after loading the compiler module.
There is no support for the NVIDIA compilers in the Intel MPI implementation.
Compiling and testing¶
If make succeeds, you should see various calls being printed on your screen with the name of the compiler you chose. If compilation completed successfully you should see a success message of some kind, and an executable appear in your source or build directory.
Quite often, software comes with test programs you can also build. Often, the command to do this looks like the following:
make test
Optimisation¶
Software optimisation comes in many forms, such as compiler optimisation, using alternate libraries, removing bottlenecks from code, algorithmic improvements, and using parallelisation. Using processor-specific compiler options may reduce universal compatibility of your compiled code, but could yield substantial improvements.
The Intel, NVIDIA and GCC compilers may give different performance depending on different libraries or processor optimisation. Benchmarking and comparing code compiled with each compiler is recommended.
Profiling tools¶
Once you have a running program that has been tested, there are several tools you can use to check the performance of your code. Some of these you can use on the cluster and some you can use on your own desktop machine.
perf¶
perf
is a tool that creates a log of where your program spends its time.
The report can be used as a guide to see where you need to focus your time
when optimising code. Once the program has been compiled, it should be run
through the record
subcommand of perf:
perf record -a -g my_program
where my_program
is the name of the program to be profiled. Once the
program run a log file is generated. This log file may be analysed with
the report
subcommand of perf. For example, to display the function
calls in order of the most called:
perf report --sort comm,dso
More information on perf can be found at this Profiling how-to and this extensive tutorial
valgrind¶
valgrind is a suite of tools that allow you to improve the speed and reduce the memory usage of your programs. An example command would be:
valgrind --tool=memcheck <myprogram>
Valgrind is well suited for multi-threaded applications, but may not be
suitable for longer running applications due to the slowdown incurred by the
profiled application. In addition, there is a graphical
tool which is not offered on the
cluster but will work on Linux desktops. There is also an extensive
manual. However, there are serious
issues with using Valgrind with modern AVX/AVX2/AVX-512 architectures and GCC and
Open MPI. If using Intel compilers is an option, we'd recommend the
valgrind/3.20.0-intel-oneapi-mpi-2021.12.1-oneapi-2024.1.0
module.
Python profiling tools¶
The above tools work best for compiled binaries. If you are writing code
in Python, cProfile
and line_profiler
are useful options.
Optimizations for slow-running Python code include parallelisation with
multiprocessing
or
dask
to use multiple cores efficiently, and compilers
such as pythran
or numba
.
For more details, High Performance Python
by Micha Gorelick and Ian Ozsvald
is available to QMUL staff and students.