TensorFlow¶
TensorFlow is an open source library for machine learning.
Installation¶
An more in-depth tutorial on installing and using TensorFlow on Apocrita is also available on our blog.
Installing with pip¶
Package naming and installing specific versions
For releases 1.15 and older, the CPU and GPU pip packages are separate. If using one of these older releases, we recommend installing the GPU package because TensorFlow programs typically run much faster on a GPU, compared to CPU. Researchers need to request permission to be added to the list of GPU node users.
To select a specific version, use the pip
standard method, noting that
other versions may have been built with different CUDA libraries. To
install version 1.15, run pip install tensorflow-gpu==1.15
. Removing the
version number installs the latest release version.
The TensorFlow package may be installed using pip in a virtualenv, which uses packages from the Python Package Index.
Loading a CUDNN module will also load the corresponding CUDA module as a prerequisite. These libraries are required to be loaded to utilise GPU acceleration within TensorFlow. Make sure to check for any errors in the job output, as an incorrect CUDA or CUDNN module version will usually result in the GPU not being used.
Loading a TensorRT module will also load the corresponding CUDNN module (and therefore CUDA) as a prerequisite.
Initial setup:
module load python
virtualenv tfenv
source tfenv/bin/activate
pip install tensorflow
If you have any other additional python package dependencies, these should be
installed into your virtualenv with additional pip install
commands, or in
bulk, using a
requirements file
Subsequent activation as part of a GPU job:
module load python
module load cudnn/8.1.1-cuda11.2
source tfenv/bin/activate
Installing with conda¶
If you prefer to use conda environments, the approach is slightly different as conda supports a variety of CUDA versions and will install requirements as conda packages within your virtual environment. Note that while the pip packages are officially supported by TensorFlow, the conda packages are built and supported by Anaconda.
Conda package availability and disk space
Conda tends to pull in a lot of packages, consuming more space than pip virtualenvs. Additionally, pip tends to have a wider range of third-party packages than conda.
Initial setup:
module load anaconda3
conda create -n tensorgpu
conda activate tensorgpu
conda install tensorflow-gpu
Subsequent activation as part of a GPU job:
module load anaconda3
conda activate tensorgpu
Using containers¶
If you have certain requirements that are not satisfiable by pip or conda (e.g.
extra operating system packages not available on Apocrita), then it may be
possible to solve this with an Apptainer container. For
most requirements, the pip
method is recommended, since it is easier to
maintain and add packages to a user-controlled virtualenv.
A list of existing TensorFlow containers can be found in the
/data/containers/tensorflow
directory on Apocrita, which can be customised to
add the required packages.
Example jobs¶
Checking that the GPU is being used correctly
Running ssh <nodename> nvidia-smi
will query the GPU status on a
node. You can find out the node your job is using with the qstat
command.
In all examples below, the file tf_test.py
contains the following Python
code:
import tensorflow as tf
print(tf.config.experimental.list_physical_devices(device_type="GPU"))
Simple GPU job using virtualenv¶
This assumes an existing virtualenv named tfenv
created as shown above.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=240:0:0
#$ -l gpu=1
module load python
module load cudnn/8.1.1-cuda11.2
source tfenv/bin/activate
python tf_test.py
Simple GPU job using conda¶
This assumes an existing conda env named tensorgpu
created as shown above.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=240:0:0
#$ -l gpu=1
module load anaconda3
conda activate tensorgpu
python tf_test.py
CPU-only example using virtualenv¶
This assumes an existing virtualenv named tfenv
created as shown above.
#!/bin/bash
#$ -cwd
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
module load python
source tfenv/bin/activate
python -c 'import tensorflow as tf; print(tf.__version__)'
Submit the script to the job scheduler and the TensorFlow version number will be recorded in the job output file.
Simple GPU job using a container¶
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=240:0:0
#$ -l gpu=1
apptainer exec --nv \
/data/containers/tensorflow/tensorflow-1.8-python3-ubuntu-16.04.img \
python -c 'import tensorflow as tf; print(tf.__version__)'
Apptainer GPU support
The --nv
flag is required for GPU support and passes through the
appropriate GPU drivers and libraries from the host to the container.
GPU machine learning example¶
This example demonstrates some real-life code which uses 1 GPU on a node. The source can be found in the references section below.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=240:0:0
#$ -l gpu=1
module load python
module load cudnn/8.1.1-cuda11.2
source tfenv/bin/activate
python mnist_classify.py