PGAP¶
The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).
PGAP is available as a module on Apocrita.
Usage¶
To run the default installed version of PGAP, simply load the pgap
module:
$ module load pgap
$ pgap.py --help
usage: pgap.py [-h] [-g GENOME] [-s ORGANISM] [-V] [-v]
[--taxcheck | --taxcheck-only] [--auto-correct-tax] [-l | -u]
[-r | -n] [--container-name CONTAINER_NAME]
[--container-path CONTAINER_PATH] [--ignore-all-errors]
[--no-internet] [-D path] [-o path] [-q] [--prefix PREFIX]
[--no-self-update] [-c CPUS] [-m MEMORY] [-d]
Core Usage
To ensure that PGAP uses the correct number of cores, the
-c ${NSLOTS}
option must be used.
Example jobs¶
Serial jobs¶
Choose resources wisely
The majority of PGAP testing is carried out using 8 cores, 32 GB RAM and 16 cores, 64 GB RAM (see the official documentation for further information). Please ensure you tune all jobs to avoid wasting cluster resources.
Sending anonymised usage metadata to NCBI
The examples below pass the -r
flag, which reports anonymised usage
metadata to NCBI. You can instead pass -n
if you prefer not to report
this. Please see the official FAQ
for more information about which information is reported to NCBI when you
use the -r
flag.
Here is an example job running on 8 cores and 32GB of memory:
PGAP redirects all output
PGAP redirects all output to a file called cwltool.log
inside your defined
output directory. Your
job output files
will only contain a few lines containing confirmation that the job has
started and then completed.
Output directory must not already exist
Your nominated output directory (-o
) is created by PGAP as it runs and
must not exist already, otherwise PGAP will exit with:
Output directory /path/to/output exists, exiting.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=24:0:0
#$ -l h_vmem=4G
module load pgap
pgap.py \
-c ${NSLOTS} \
-r \
-o /path/to/output \
-g /path/to/input.fasta \
-s '<organism name>'
PGAP comes with the example input ASM2732v1.annotation.nucleotide.1.fasta
, and
you can run a test job using this input file with the following command:
pgap.py \
-c ${NSLOTS} \
-r \
-o /path/to/output \
-g ${PGAP_INPUT_DIR}/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta \
-s 'Mycoplasmoides genitalium'
The $PGAP_INPUT_DIR
environment variable is set when loading a PGAP module and
automatically points to the installation directory for the loaded version. The
"Mycoplasmoides genitalium" test job above completes in about 12 minutes using 8
CPU cores.