Canu

Notitle

A single molecule sequence assembler for large and small geonomes.

Usage

To use the default version of canu:

$ module load canu
$ canu

usage: canu [-correct | -trim | -assemble | -trim-assemble] \
            [-s <assembly-specifications-file>] \
             -p <assembly-prefix> \
             -d <assembly-directory> \
             genomeSize=<number>[g|m|k] \
            [other-options] \
            [-pacbio-raw | -pacbio-corrected | -nanopore-raw | -nanopore-corrected] \
            <files.fastq>

  The assembly is computed in the (created) -d <assembly-directory>, with most
  files named using the -p <assembly-prefix>.

  The genome size is your best guess of the genome size of what is being assembled.
  It is used mostly to compute coverage in reads.  Fractional values are
  allowed: '4.7m' is the same as '4700k' and '4700000'.

For full usage documentation, run canu --help.

Error message - Gatekeeper detected problems in your input reads

To resolve this error message, supplement each dataset with Illumina reads. Renaming the files in <assembly-directory>/correction or using the stopOnReadQuality=false option will produce an undesirable assembly.

When executing a canu command just like the one above, an Apocrita submission script is created and submitted. This script is available under: <assembly-directory>/canu-scripts/canu.N.sh where N is an incremented number used internally by canu to distinguish between Apocrita submissions.

This script contains the canu parameters passed via the first command and is submitted as a new job on Apocrita to start the canu sequence assembler. The output of this second script is written to: <assembly-directory>/canu-scripts/canu.N.out.

Example job

Serial job

Here is an example canu command which will submit an Apocrita job running with 4 cores and 8GB total memory (default is 1 core and 4G memory):

$ canu gridOptions='-l h_vmem=8G -pe smp 4' -p 'Ppal' -d 'output' \
     -nanopore-raw data.fq 'genomeSize=300M' gnuplot=$(which gnuplot)

The qsub command produced will look similar to:

qsub \
      -l h_vmem=4g \
      -pe smp 1 \
      -l h_vmem=8G \
      -pe smp 4  \
      -cwd \
      -N 'canu_Ppal' \
      -j y \
      -o /data/home/abc/canu/output/canu-scripts/canu.01.out \
      /data/home/abc/canu/output/canu-scripts/canu.01.sh

Duplicate scheduler variables

The duplication of h_vmem and smp scheduler variables in the example above can be ignored because the latter variable overrides the former.

The qsub command will automatically be executed, so a new Apocrita job will be launched. The output of the second job is written to: /data/home/abc/canu/output/canu-scripts/canu.01.out which is symlinked in the parent directory.

References