TensorFlow¶
TensorFlow is an open source library for machine learning.
Installation¶
Installing with pip (recommended)¶
Package naming and installing specific versions
For releases 1.15 and older, the CPU and GPU pip packages are separate. If using one of these older releases, we recommend installing the GPU package because TensorFlow programs typically run much faster on a GPU, compared to CPU. Researchers need to request permission to be added to the list of GPU node users.
To select a specific version, use the pip
standard method, noting that
other versions may have been built with different CUDA libraries. To
install version 1.15, run pip install tensorflow-gpu==1.15
. Removing the
version number installs the latest release version.
The TensorFlow package may be installed using pip in a virtualenv, which uses packages from the Python Package Index. This is the officially supported installation method.
Pip GPU version¶
Recent TensorFlow versions now bundle the CUDA libraries required to utilise a
GPU as long as you add "[and-cuda]
" to your installation command.
Initial setup:
module load python
virtualenv tf_env
source tf_env/bin/activate
pip install 'tensorflow[and-cuda]'
This should install a number of packages called nvidia-*
into your virtualenv
which will provide the required CUDA libraries. These libraries are required to
be loaded to utilise GPU acceleration within TensorFlow. Older versions may
still require you to load an external CUDNN module (which will also load the
corresponding CUDA module as a prerequisite).
Be sure to check for any errors in the job output, as an incorrect installation or CUDA or CUDNN module version will usually result in the GPU not being used.
If you have any other additional python package dependencies, these should be
installed into your virtualenv with additional pip install
commands, or in
bulk, using a
requirements file.
Subsequent activation as part of a GPU job:
module load python
source tf_env/bin/activate
Pip CPU version¶
Most users will probably want to install the GPU version of TensorFlow as above,
but if for some reason you wanted a CPU-only version, you can install just the
tensorflow
package instead:
pip install tensorflow
CPU-only versions will not install any nvidia-*
packages into your virtual
environment.
Installing with Conda (unofficial)¶
Anaconda and Miniconda are no longer available on Apocrita due to licensing issues. Please use Miniforge instead.
Conda GPU version¶
If you prefer to use Conda environments, the approach is slightly different as Conda supports a variety of CUDA versions and will install requirements as Conda packages within your Conda environment. Note that whilst the pip packages are officially supported by TensorFlow, the Conda packages are pulled from conda-forge.
Conda package availability and disk space
Conda tends to pull in a lot of packages, consuming more space than pip virtualenvs. Additionally, pip tends to have a wider range of third-party packages than Conda.
Please note, when installing the tensorflow
Conda package, if the local
machine has a GPU the package resolution will install appropriate CUDA libraries
for that specific GPU. If you are installing TensorFlow inside a Conda
environment on a standard compute node, it contains no GPU so the installer will
fall back to a best guess for CUDA libraries and usually install an older
version to ensure maximum compatibility. To override this, you need to add the
CONDA_OVERRIDE_CUDA
environment variable to your install command stating a
specific CUDA version, e.g.:
CONDA_OVERRIDE_CUDA="12" mamba install -c conda-forge tensorflow
Another option is to make sure you create your Conda environment and install TensorFlow in an interactive qlogin session on a GPU node containing the type of GPU you intend to execute your code on. This way, a GPU will be detected during the installation of TensorFlow and thus the GPU version and required CUDA packages should just install automatically. However, this is not always practical as availability of the GPU nodes can sometimes be limited due to high demand.
Initial setup:
module load miniforge
mamba create -n tensorgpu
mamba activate tensorgpu
If installing in an interactive qlogin session on a GPU node:
mamba install -c conda-forge tensorflow
If installing on the frontend node or any other node without a GPU (adjust CUDA version as required):
CONDA_OVERRIDE_CUDA="12" mamba install -c conda-forge tensorflow
Please pay attention to the output of the
mamba install -c conda-forge tensorflow
command before confirming
installation. If you want your code to run on a GPU, you will need to ensure
that the required CUDA packages are going to be installed:
+ cuda-version 12.6
+ cuda-cudart_linux-64 12.6.77
+ cuda-nvrtc 12.6.85
+ cuda-nvtx 12.6.77
+ cuda-nvvm-tools 12.6.85
+ cuda-crt-tools 12.6.85
+ cuda-cupti 12.6.80
+ cuda-cudart 12.6.77
+ cuda-nvcc-tools 12.6.85
and that the version of tensorflow
is also the GPU version:
+ tensorflow-base 2.17.0 cuda120py312hbec54f7_203
+ tensorflow-estimator 2.17.0 cuda120py312hfa0f5ef_203
+ tensorflow 2.17.0 cuda120py312h02ad488_203
GPU versions of the tensorflow
packages will have cuda
in the package name,
alongside the CUDA version (12.6
in the above example).
Subsequent activation as part of a GPU job:
module load miniforge
mamba activate tensorgpu
Conda CPU version¶
Most users will probably want to install the GPU version of TensorFlow as above,
but if for some reason you wanted a CPU-only version, you can install the
tensorflow-cpu
package instead:
mamba install -c conda-forge tensorflow-cpu
CPU-only versions will look like this in the proposed installation output:
+ tensorflow-base 2.17.0 cpu_py310hfda4fce_3
+ tensorflow-estimator 2.17.0 cpu_py310heba74a3_3
+ tensorflow 2.17.0 cpu_py310h42475c5_3
+ tensorflow-cpu 2.17.0 cpu_py310h718b53a_3
Using containers¶
If you have certain requirements that are not satisfiable by pip or Conda (e.g.
extra operating system packages not available on Apocrita), then it may be
possible to solve this with an Apptainer container. For
most requirements, the pip
method above is
recommended, since it is easier to maintain and add packages to a
user-controlled virtualenv.
NVIDIA maintains an official repository of Docker containers in their NGC Catalog. Whilst Docker is not directly supported on Apocrita, you can use Apptainer to either pull these as is or use them as a bootstrap for your own containers.
Example jobs¶
Checking that the GPU is being used correctly
Running ssh <nodename> nvidia-smi
will query the GPU status on a
node. You can find out the node your job is using with the qstat
command.
You can also use the
nvtools module
to check that the GPU is being used correctly.
xla_gpu_cuda_data_dir
errors
TensorFlow may present an error similar this:
external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:504]
Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result
in compilation or runtime failures, if the program we try to run uses
routines from libdevice.
For most apps, setting the environment variable
XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.
To rectify this error:
A) If you are using a Python virtualenv and loading a separate cudnn
module (not normally required with newer versions), add the following
environment export to your job script:
export XLA_FLAGS=--xla_gpu_cuda_data_dir=${CUDADIR}
B) If you are using a Conda environment, add the following environment export to your job script to point to the internal CUDA directory:
export XLA_FLAGS=--xla_gpu_cuda_data_dir=${CONDA_PREFIX}
In all examples below, the file tf_test.py
contains the following Python
code:
import tensorflow as tf
print(tf.test.gpu_device_name())
Simple GPU job using virtualenv¶
This assumes an existing virtualenv named tf_env
created
as shown above.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=240:0:0
#$ -l h_vmem=11G
#$ -l gpu=1
module load python
source tf_env/bin/activate
python tf_test.py
Simple GPU job using Conda¶
This assumes an existing Conda env named tensorgpu
created
as shown above.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_vmem=11G
#$ -l h_rt=240:0:0
#$ -l gpu=1
module load miniforge
mamba activate tensorgpu
python tf_test.py
CPU-only example using virtualenv¶
This assumes an existing virtualenv named tf_env
created
as shown above.
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
module load python
source tf_env/bin/activate
python -c 'import tensorflow as tf; print(tf.__version__)'
Submit the script to the job scheduler and the TensorFlow version number will be recorded in the job output file.
Simple GPU job using a container¶
(replace /path/to/container.sif
with the full path to your container image)
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_vmem=11G
#$ -l h_rt=240:0:0
#$ -l gpu=1
apptainer exec --nv \
/path/to/container.sif \
python -c 'import tensorflow as tf; print(tf.__version__)'
Apptainer GPU support
The --nv
flag is required for GPU support and passes through the
appropriate GPU drivers and libraries from the host to the container.
GPU machine learning example¶
This example demonstrates some real-life code which uses 1 GPU on a node. The source can be found in the references section below.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_vmem=11G
#$ -l h_rt=240:0:0
#$ -l gpu=1
module load python
source tf_env/bin/activate
python mnist_classify.py