Seqtk

Notitle

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

Seqtk is available as a module on Apocrita.

Usage

To run the latest installed version of appname, simply load the seqtk module:

$ module load seqtk
$ seqtk

Usage:   seqtk <command> <arguments>
Version: <VERSION>

Command: seq       common transformation of FASTA/Q
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         gc        identify high- or low-GC regions
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         famask    apply a X-coded FASTA to a source FASTA
         dropse    drop unpaired from interleaved PE FASTA/Q
         rename    rename sequence names
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         listhet   extract the position of each het

Example jobs

Here are a couple of example jobs both running on 1 core and 2GB of memory:

Convert FASTQ to FASTA

#!/bin/sh
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=2:0:0
#$ -l h_vmem=2G

module load seqtk

# Convert FASTQ to FASTA
seqtk seq fastq_data.fastq.gz > fasta_data.fa

Sequence Extraction

#!/bin/sh
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=2:0:0
#$ -l h_vmem=2G

module load seqtk

# Extract sequences in regions contained in file reg.bed
seqtk subseq fastq_data.fastq.gz reg.bed

References