TensorFlow¶

TensorFlow is an open source library for machine learning.

Installation¶

A more in-depth tutorial on installing and using TensorFlow on Apocrita is also available on our blog.

Installing with pip¶

Package naming and installing specific versions

For releases 1.15 and older, the CPU and GPU pip packages are separate. If using one of these older releases, we recommend installing the GPU package because TensorFlow programs typically run much faster on a GPU, compared to CPU. Researchers need to request permission to be added to the list of GPU node users.

To select a specific version, use the pip standard method, noting that other versions may have been built with different CUDA libraries. To install version 1.15, run pip install tensorflow-gpu==1.15. Removing the version number installs the latest release version.

The TensorFlow package may be installed using pip in a virtualenv, which uses packages from the Python Package Index.

Loading a CUDNN module will also load the corresponding CUDA module as a prerequisite. These libraries are required to be loaded to utilise GPU acceleration within TensorFlow. Make sure to check for any errors in the job output, as an incorrect CUDA or CUDNN module version will usually result in the GPU not being used.

Loading a TensorRT module will also load the corresponding CUDNN module (and therefore CUDA) as a prerequisite.

Initial setup:

module load python
virtualenv tfenv
source tfenv/bin/activate
pip install tensorflow

If you have any other additional python package dependencies, these should be installed into your virtualenv with additional pip install commands, or in bulk, using a requirements file.

Subsequent activation as part of a GPU job:

module load python
module load cudnn/8.9.4-cuda12.2
source tfenv/bin/activate

Installing with conda¶

If you prefer to use conda environments, the approach is slightly different as conda supports a variety of CUDA versions and will install requirements as conda packages within your virtual environment. Note that whilst the pip packages are officially supported by TensorFlow, the conda packages are pulled from conda-forge.

Conda package availability and disk space

Conda tends to pull in a lot of packages, consuming more space than pip virtualenvs. Additionally, pip tends to have a wider range of third-party packages than conda.

Please note, when installing the tensorflow conda package, the package resolution will default to the GPU-enabled builds of tensorflow if the local machine has a GPU. If you are installing TensorFlow inside a conda environment on the frontend node, it contains no GPU so the installer will fall back to installing the CPU version. To override this, you need to add the CONDA_OVERRIDE_CUDA environment variable to your install command stating a specific CUDA version:

CONDA_OVERRIDE_CUDA="12.2" mamba install -c conda-forge tensorflow

Another option is to make sure you create your conda environment and install TensorFlow in an interactive qlogin session on a GPU node containing the type of GPU you intend to execute your code on. This way, a GPU will be detected during the installation of TensorFlow and thus the GPU version and required CUDA packages should just install automatically. However, this is not always practical as availability of the GPU nodes can sometimes be limited due to high demand.

Initial setup:

module load anaconda3
mamba create -n tensorgpu
mamba activate tensorgpu

If installing in an interactive qlogin session on a GPU node:

mamba install -c conda-forge tensorflow

If installing on the frontend node or any other node without a GPU (adjust CUDA version as required):

CONDA_OVERRIDE_CUDA="12.2" mamba install -c conda-forge tensorflow

Please pay attention to the output of the mamba install -c conda-forge tensorflow command before confirming installation. If you want your code to run on a GPU, you will need to ensure that the required cuda-* and cudnn packages are going to be installed:

  + cuda-cudart                  12.0.107  hd3aeb46_8               conda-forge/linux-64
  + cuda-cudart_linux-64         12.0.107  h59595ed_8               conda-forge/noarch
  + cuda-nvcc-tools               12.0.76  h59595ed_1               conda-forge/linux-64
  + cuda-nvrtc                    12.0.76  hd3aeb46_2               conda-forge/linux-64
  + cuda-nvtx                     12.0.76  h59595ed_1               conda-forge/linux-64
  + cuda-version                     12.0  hffde075_2               conda-forge/noarch
  + cudnn                       8.8.0.121  h264754d_4               conda-forge/linux-64

and that the version of tensorflow is also the GPU version:

  + tensorflow                     2.15.0  cuda120py311h5cbd639_2   conda-forge/linux-64
  + tensorflow-base                2.15.0  cuda120py311h43b5e44_2   conda-forge/linux-64
  + tensorflow-estimator           2.15.0  cuda120py311hf663016_2   conda-forge/linux-64

GPU versions of the tensorflow packages will have cuda in the package name, alongside the CUDA version (12 in the above example), as opposed to the CPU-only versions, which will look like this:

  + tensorflow                     2.15.0  cpu_py311hd3d7757_2      conda-forge
  + tensorflow-base                2.15.0  cpu_py311h6aa969b_2      conda-forge
  + tensorflow-estimator           2.15.0  cpu_py311ha26c8b9_2      conda-forge

Subsequent activation as part of a GPU job:

module load anaconda3
mamba activate tensorgpu

Using containers¶

If you have certain requirements that are not satisfiable by pip or conda (e.g. extra operating system packages not available on Apocrita), then it may be possible to solve this with an Apptainer container. For most requirements, the pip method is recommended, since it is easier to maintain and add packages to a user-controlled virtualenv.

A list of existing TensorFlow containers can be found in the /data/containers/tensorflow directory on Apocrita, which can be customised to add the required packages.

Example jobs¶

Checking that the GPU is being used correctly

Running ssh <nodename> nvidia-smi will query the GPU status on a node. You can find out the node your job is using with the qstat command. You can also use the nvtools module to check that the GPU is being used correctly.

xla_gpu_cuda_data_dir errors

TensorFlow versions 2.8-2.11 may present an error such as this:

external/local_xla/xla/service/gpu/llvm_gpu_backend/gpu_backend_lib.cc:504] Can't find libdevice directory ${CUDA_DIR}/nvvm/libdevice. This may result in compilation or runtime failures, if the program we try to run uses routines from libdevice. For most apps, setting the environment variable XLA_FLAGS=--xla_gpu_cuda_data_dir=/path/to/cuda will work.

To rectify this error:

A) If you are using a Python virtualenv, be sure a cudnn module is loaded and add the following environment export to your job script:

export XLA_FLAGS=--xla_gpu_cuda_data_dir=${CUDADIR}

B) If you are using a conda environment, add the following environment export to your job script to point to the internal CUDA directory:

export XLA_FLAGS=--xla_gpu_cuda_data_dir=${CONDA_PREFIX}

In all examples below, the file tf_test.py contains the following Python code:

import tensorflow as tf
print(tf.test.gpu_device_name())

Simple GPU job using virtualenv¶

This assumes an existing virtualenv named tfenv created as shown above.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12
#$ -l h_rt=240:0:0
#$ -l h_vmem=7.5G
#$ -l gpu=1

module load python
module load cudnn/8.9.4-cuda12.2
source tfenv/bin/activate
python tf_test.py

Simple GPU job using conda¶

This assumes an existing conda env named tensorgpu created as shown above.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12
#$ -l h_vmem=7.5G
#$ -l h_rt=240:0:0
#$ -l gpu=1

module load anaconda3
mamba activate tensorgpu
python tf_test.py

CPU-only example using virtualenv¶

This assumes an existing virtualenv named tfenv created as shown above.

#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

module load python
source tfenv/bin/activate
python -c 'import tensorflow as tf; print(tf.__version__)'

Submit the script to the job scheduler and the TensorFlow version number will be recorded in the job output file.

Simple GPU job using a container¶

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12
#$ -l h_vmem=7.5G
#$ -l h_rt=240:0:0
#$ -l gpu=1

apptainer exec --nv \
/data/containers/tensorflow/tensorflow-1.8-python3-ubuntu-16.04.img \
python -c 'import tensorflow as tf; print(tf.__version__)'

Apptainer GPU support

The --nv flag is required for GPU support and passes through the appropriate GPU drivers and libraries from the host to the container.

GPU machine learning example¶

This example demonstrates some real-life code which uses 1 GPU on a node. The source can be found in the references section below.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 12
#$ -l h_vmem=7.5G
#$ -l h_rt=240:0:0
#$ -l gpu=1

module load python
module load cudnn/8.9.4-cuda12.2
source tfenv/bin/activate
python mnist_classify.py

TensorFlow¶

Installation¶

Installing with pip¶

Installing with conda¶

Using containers¶

Example jobs¶

Simple GPU job using virtualenv¶

Simple GPU job using conda¶

CPU-only example using virtualenv¶

Simple GPU job using a container¶

GPU machine learning example¶

References¶