Skip to content

PGAP

The NCBI Prokaryotic Genome Annotation Pipeline (PGAP) is designed to annotate bacterial and archaeal genomes (chromosomes and plasmids).

PGAP is available as a module on Apocrita.

Usage

To run the default installed version of PGAP, simply load the pgap module:

$ module load pgap
$ pgap.py --help
usage: pgap.py [-h] [-g GENOME] [-s ORGANISM] [-V] [-v]
               [--taxcheck | --taxcheck-only] [--auto-correct-tax] [-l | -u]
               [-r | -n] [--container-name CONTAINER_NAME]
               [--container-path CONTAINER_PATH] [--ignore-all-errors]
               [--no-internet] [-D path] [-o path] [-q] [--prefix PREFIX]
               [--no-self-update] [-c CPUS] [-m MEMORY] [-d]

Core Usage

To ensure that PGAP uses the correct number of cores, the -c ${NSLOTS} option must be used.

Example jobs

Serial jobs

Choose resources wisely

The majority of PGAP testing is carried out using 8 cores, 32 GB RAM and 16 cores, 64 GB RAM (see the official documentation for further information). Please ensure you tune all jobs to avoid wasting cluster resources.

Sending anonymised usage metadata to NCBI

The examples below pass the -r flag, which reports anonymised usage metadata to NCBI. You can instead pass -n if you prefer not to report this. Please see the official FAQ for more information about which information is reported to NCBI when you use the -r flag.

Here is an example job running on 8 cores and 32GB of memory:

PGAP redirects all output

PGAP redirects all output to a file called cwltool.log inside your defined output directory. Your job output files will only contain a few lines containing confirmation that the job has started and then completed.

Output directory must not already exist

Your nominated output directory (-o) is created by PGAP as it runs and must not exist already, otherwise PGAP will exit with:

Output directory /path/to/output exists, exiting.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_rt=24:0:0
#$ -l h_vmem=4G

module load pgap

pgap.py \
  -c ${NSLOTS} \
  -r \
  -o /path/to/output \
  -g /path/to/input.fasta \
  -s '<organism name>'

PGAP comes with the example input ASM2732v1.annotation.nucleotide.1.fasta, and you can run a test job using this input file with the following command:

pgap.py \
  -c ${NSLOTS} \
  -r \
  -o /path/to/output \
  -g ${PGAP_INPUT_DIR}/test_genomes/MG37/ASM2732v1.annotation.nucleotide.1.fasta \
  -s 'Mycoplasmoides genitalium'

The $PGAP_INPUT_DIR environment variable is set when loading a PGAP module and automatically points to the installation directory for the loaded version. The "Mycoplasmoides genitalium" test job above completes in about 12 minutes using 8 CPU cores.

References