AlphaFold¶
AlphaFold is an application for predicting models of protein structures.
AlphaFold is available as an Apptainer container on Apocrita.
Usage¶
AlphaFold requires a suite of supporting tools to be installed, so for reproducibility, we provide all of the tools in a container along with AlphaFold.
To run the default version of AlphaFold, simply load the alphafold
module:
module load alphafold
Calling alphafold python
after loading the alphafold
module will invoke the
installed version of Python inside the container. Additionally, this entry
point will automatically use any requested GPU cards
Example job¶
GPU job¶
To run AlphaFold from the container, we prepare a job script called
alpha.qsub
:
#$ -cwd
#$ -j y
#$ -pe smp 12 # 12 cores per GPU
#$ -l h_rt=240:0:0 # 240 hours runtime
#$ -l h_vmem=7.5G # 7.5G RAM per core
#$ -l gpu=1 # AlphaFold only uses 1 GPU
# Approved DERI users can include the following line
#$ -l cluster=andrena
# Specify an output destination.
export OUTPUT=/data/scratch/${USER}/alphafold_out
module load alphafold
alphafold ${HOME}/alphafold_scripts/run.sh \
--fasta_paths=${HOME}/alphafold_input/T1050.fasta
The example uses scratch storage for output, although shared project storage may be used if you have access to any.
Since there are a lot of configuration options, the options that are less
likely to change are stored in a separate script called run.sh
:
#!/bin/bash
# Set the destination of the AlphaFold dataset
DOWNLOAD_DIR=/data/DERI-DataSets/AlphaFold
# Recommended settings for use with AlphaFold
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION=4.0
# Options for AlphaFold
cd /app/alphafold
python run_alphafold.py \
--data_dir=${DOWNLOAD_DIR} \
--uniref90_database_path=${DOWNLOAD_DIR}/uniref90/uniref90.fasta \
--mgnify_database_path=${DOWNLOAD_DIR}/mgnify/mgy_clusters_2022_05.fa \
--bfd_database_path=${DOWNLOAD_DIR}/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt \
--template_mmcif_dir=${DOWNLOAD_DIR}/pdb_mmcif/mmcif_files \
--obsolete_pdbs_path=${DOWNLOAD_DIR}/pdb_mmcif/obsolete.dat \
--output_dir=${OUTPUT} \
--benchmark=False \
--max_template_date=2020-5-14 \
--model_preset=multimer \
--pdb_seqres_database_path=${DOWNLOAD_DIR}/pdb_seqres/pdb_seqres.txt \
--uniprot_database_path=${DOWNLOAD_DIR}/uniprot/uniprot.fasta \
--db_preset=full_dbs \
--uniref30_database_path=${DOWNLOAD_DIR}/uniref30/UniRef30_2021_03 \
--use_gpu_relax=True \
$@
FATAL: permission denied
error
If your job is crashing immediately after launch with a
FATAL: permission denied
error, you need to check the run.sh
file
permissions and add execute permission if missing: chmod u+x run.sh
.
You can either take a copy of this file from /data/containers/alphafold/run_VER.sh
(where VER
is the matching version of AlphaFold, e.g. run_2.0.0.sh
, run_2.1.0.sh
etc.)
or make your own (ensuring that the script has execute permissions). The above
example is written for the default module at the time of writing, alphafold/2.3.2
.
The job script expects your copy to be found at ${HOME}/alphafold_scripts/run.sh
.
Depending on your workflow, you may wish to use multiple instances of run.sh
,
or move some options into the job script instead, as shown by the fasta_paths
option, for example.
The API for AlphaFold tends to change with each release, so be sure to check the GitHub repository to keep up to date with the changelogs. Sometimes flags are changed/removed and your run command will need to be amended accordingly.
The ${DOWNLOAD_DIR}
path will not need to change. This is a 2TB dataset
required to make AlphaFold work.
Input data¶
You will need some input data for the job. In the above example, T1050.fasta
was downloaded from the
Protein Structure Prediction Center
and stored in the ${HOME}/alphafold_input
directory.
If you wish to run a short test job that finishes in less than one hour, then
reduce the length of T1050.fasta
by editing it with a text editor, for example:
>7LXT1, Bacteroides Ovatus, 779 residues|
MASQSYL
Running the job¶
Check that the job script correctly specifies the output location, any input
files and location of the run.sh
file. Then submit the job. In the above
example, this is done with qsub alpha.qsub
.
The $@
bash operator at the end of the run.sh
script enables specifying
the additional options in alpha.qsub
file, so be sure to keep that line in.
The AlphaFold processing workflow will run tasks on the CPU, followed by bursts of intensive GPU activity. For the above example, running on an A100 GPU completes in around 4 hours, and around 5.5 hours on a V100 GPU.
Visualising results¶
The example job produces the following output in
/data/scratch/${USER}/alphafold_out/T1050
:
$ ls /data/scratch/${USER}/alphafold_out/T1050
features.pkl ranked_1.pdb ranked_4.pdb relaxed_model_2.pdb
relaxed_model_5.pdb result_model_3.pkl timings.json unrelaxed_model_3.pdb
msas ranked_2.pdb ranking_debug.json relaxed_model_3.pdb
result_model_1.pkl result_model_4.pkl unrelaxed_model_1.pdb unrelaxed_model_4.pdb
ranked_0.pdb ranked_3.pdb relaxed_model_1.pdb relaxed_model_4.pdb
result_model_2.pkl result_model_5.pkl unrelaxed_model_2.pdb unrelaxed_model_5.pdb
If you download the pdb
output files to your local machine, you can visualise
them using the NCBI Web-based 3D Structure Viewer,
or you can use another tools such as pymol
if you have a licence.