AlphaFold¶

AlphaFold is an application for predicting models of protein structures.

AlphaFold is available as an Apptainer container on Apocrita.

Usage¶

AlphaFold requires a suite of supporting tools to be installed, so for reproducibility, we provide all of the tools in a container along with AlphaFold.

To run the default version of AlphaFold, simply load the alphafold module:

module load alphafold

Calling alphafold python after loading the alphafold module will invoke the installed version of Python inside the container. Additionally, this entry point will automatically use any requested GPU cards

Example job¶

GPU job¶

To run AlphaFold from the container, we prepare a job script called alpha.qsub:

#$ -cwd
#$ -j y
#$ -pe smp 8                   # 8 cores per GPU
#$ -l h_rt=240:0:0             # 240 hours runtime
#$ -l h_vmem=11G               # 11G RAM per core
#$ -l gpu=1                    # AlphaFold only uses 1 GPU
# Approved DERI users can include the following line
#$ -l cluster=andrena

# Specify an output destination.
export OUTPUT=/data/scratch/${USER}/alphafold_out

module load alphafold

alphafold ${HOME}/alphafold_scripts/run.sh \
  --fasta_paths=${HOME}/alphafold_input/T1050.fasta

The example uses scratch storage for output, although shared project storage may be used if you have access to any.

Since there are a lot of configuration options, the options that are less likely to change are stored in a separate script called run.sh:

#!/bin/bash

# Set the destination of the AlphaFold dataset
DOWNLOAD_DIR=/data/DERI-DataSets/AlphaFold

# Recommended settings for use with AlphaFold
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=4.0

# Options for AlphaFold
cd /app/alphafold
python run_alphafold.py \
   --data_dir=${DOWNLOAD_DIR} \
   --uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
   --mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
   --bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
   --template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
   --obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
   --output_dir=${OUTPUT} \
   --benchmark=False \
   --max_template_date=2020-5-14 \
   --model_preset=multimer \
   --pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
   --uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
   --db_preset=full_dbs \
   --uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
   --use_gpu_relax=True \
   $@

FATAL: permission denied error

If your job is crashing immediately after launch with a FATAL: permission denied error, you need to check the run.sh file permissions and add execute permission if missing: chmod u+x run.sh.

You can either take a copy of this file from /data/containers/alphafold/run_VER.sh (where VER is the matching version of AlphaFold, e.g. run_2.0.0.sh, run_2.1.0.sh etc.) or make your own (ensuring that the script has execute permissions). The above example is written for the default module at the time of writing, alphafold/2.3.2. The job script expects your copy to be found at ${HOME}/alphafold_scripts/run.sh. Depending on your workflow, you may wish to use multiple instances of run.sh, or move some options into the job script instead, as shown by the fasta_paths option, for example.

The API for AlphaFold tends to change with each release, so be sure to check the GitHub repository to keep up to date with the changelogs. Sometimes flags are changed/removed and your run command will need to be amended accordingly.

The ${DOWNLOAD_DIR} path will not need to change. This is a 2TB dataset required to make AlphaFold work.

Input data¶

You will need some input data for the job. In the above example, T1050.fasta was downloaded from the Protein Structure Prediction Center and stored in the ${HOME}/alphafold_input directory.

If you wish to run a short test job that finishes in less than one hour, then reduce the length of T1050.fasta by editing it with a text editor, for example:

>7LXT1, Bacteroides Ovatus, 779 residues|
MASQSYL

Running the job¶

Check that the job script correctly specifies the output location, any input files and location of the run.sh file. Then submit the job. In the above example, this is done with qsub alpha.qsub.

The $@ bash operator at the end of the run.sh script enables specifying the additional options in alpha.qsub file, so be sure to keep that line in.

The AlphaFold processing workflow will run tasks on the CPU, followed by bursts of intensive GPU activity. For the above example, running on an A100 GPU completes in around 4 hours, and around 5.5 hours on a V100 GPU.

Visualising results¶

The example job produces the following output in /data/scratch/${USER}/alphafold_out/T1050:

$ ls /data/scratch/${USER}/alphafold_out/T1050
features.pkl         ranked_1.pdb        ranked_4.pdb          relaxed_model_2.pdb
relaxed_model_5.pdb  result_model_3.pkl  timings.json          unrelaxed_model_3.pdb
msas                 ranked_2.pdb        ranking_debug.json    relaxed_model_3.pdb
result_model_1.pkl   result_model_4.pkl  unrelaxed_model_1.pdb unrelaxed_model_4.pdb
ranked_0.pdb         ranked_3.pdb        relaxed_model_1.pdb   relaxed_model_4.pdb
result_model_2.pkl   result_model_5.pkl  unrelaxed_model_2.pdb unrelaxed_model_5.pdb

If you download the pdb output files to your local machine, you can visualise them using the NCBI Web-based 3D Structure Viewer, or you can use another tools such as pymol if you have a licence.

AlphaFold¶

Usage¶

Example job¶

GPU job¶

Input data¶

Running the job¶

Visualising results¶

References¶