Skip to content

HPC @ QMUL

Seqtk

Seqtk¶

Seqtk is a fast and lightweight tool for processing sequences in the FASTA or FASTQ format. It seamlessly parses both FASTA and FASTQ files which can also be optionally compressed by gzip.

Seqtk is available as a module on Apocrita.

Usage¶

To run the default installed version of Seqtk, simply load the seqtk module:

$ module load seqtk
$ seqtk

Usage:   seqtk <command> <arguments>
Version: <VERSION>

Command: seq       common transformation of FASTA/Q
         size      report the number sequences and bases
         comp      get the nucleotide composition of FASTA/Q
         sample    subsample sequences
         subseq    extract subsequences from FASTA/Q
         fqchk     fastq QC (base/quality summary)
         mergepe   interleave two PE FASTA/Q files
         split     split one file into multiple smaller files
         trimfq    trim FASTQ using the Phred algorithm

         hety      regional heterozygosity
         gc        identify high- or low-GC regions
         mutfa     point mutate FASTA at specified positions
         mergefa   merge two FASTA/Q files
         famask    apply a X-coded FASTA to a source FASTA
         dropse    drop unpaired from interleaved PE FASTA/Q
         rename    rename sequence names
         randbase  choose a random base from hets
         cutN      cut sequence at long N
         gap       get the gap locations
         listhet   extract the position of each het
         hpc       homopolyer-compressed sequence
         telo      identify telomere repeats in asm or long reads

Example jobs¶

Serial jobs¶

Here is an example job running on 1 core and 2GB of memory to convert FASTQ to FASTA:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=2G

module load seqtk

# Convert FASTQ to FASTA
seqtk seq fastq_data.fastq.gz > fasta_data.fa

Here is an example job running on 1 core and 2GB of memory to extract sequences:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=2G

module load seqtk

# Extract sequences in regions contained in file reg.bed
seqtk subseq fastq_data.fastq.gz reg.bed

References¶