KMC

Notitle

KMC (K-mer Counter) is a utility for counting k-mers (sequences of consecutive k symbols) in a set of reads from genome sequencing projects.

KMC is available as a module on Apocrita.

Usage

KMC takes input files in the FASTA, FASTQ or multi FASTA and produces KMC databases.

To run the latest installed version of KMC, simply load the kmc module:

$ module load kmc
$ kmc -h
K-Mer Counter (KMC) ver. 3.0.0 (2017-01-28)
Usage:
 kmc [options] <input_file_name> <output_file_name> <working_directory>
...

then run one of the commands such as

kmc -k31 reads.fastq 31-mers ${TMPDIR}

KMC has a number of options, allowing for k-mer length adjustment and resource limits, specific options of interest are:

-k<len> - k-mer length (k from 1 to 256; default: 25)
-m<size> - max amount of RAM in GB (from 1 to 1024); default: 12
-sm - use strict memory mode (memory limit from -m<n> switch will not be exceeded)
-f<a/q/m> - input in FASTA format (-fa), FASTQ format (-fq) or multi FASTA (-fm); default: FASTQ
-ci<value> - exclude k-mers occurring less than <value> times (default: 2)

${TMPDIR} as working_directory

${TMPDIR} is set by the scheduler to a local node disk which is significantly faster than using GPFS as the <working_directory>.

This will result in jobs executing faster, for example a 3.1GiB file processed on GPFS takes an average of 16.46 seconds whilst using ${TMPDIR} only takes an average of 10.92 seconds.

Example jobs

Serial job

Here is an example job running on 12 cores and 12G of ram.

#!/bin/bash
#$ -l h_vmem=1G     # 1G * 12 slots = 12G
#$ -pe smp 12
#$ -l h_rt=0:5:0    # 5 mins, runs with this input file (3.1GiB) < 1 min
#$ -cwd
#$ -j y

module load kmc

kmc -k31 reads.fastq 31-mers ${TMPDIR}

References