Skip to content

Taiyaki

Taiyaki is research software for training models for basecalling Oxford Nanopore reads.

Taiyaki is available from Github. You can install Taiyaki in your home directory, scratch or lab storage.

Installation

Installation should be carried out on a node with GPU hardware.

Firstly request an interactive session on a GPU node:

qlogin -pe smp 8 -l gpu=1

Change to the directory you want to install Taiyaki into.

Next load modules for CUDA and Python3:

module load cuda && module load python

And using git, clone the Taiyaki repository:

git clone https://github.com/nanoporetech/taiyaki.git

And finally:

cd taiyaki && make install

You should see scrolling text indicating that installation is in progress and eventually you should see:

To activate your new environment: source venv/bin/activate

Usage

Since Taiyaki is written in Python 3 you will need to load the Python 3 module to use it. You will also need to load the CUDA module you used to build Taiyaki, and you'll probably want to ensure you are running on the same GPU type. Assuming your username is "abc123" and you have installed Taiyaki in scratch you would use something like the following to activate the virtual environment:

module load python/3.6.3
source /data/scratch/abc123/taiyaki/bin/activate

Taiyaki doesn't appear to have online help but the following examples may be useful:

Training a Modified Base Model.

Use resources responsibly!

For commands that accept --jobs make sure that you pass the number of cores that you have requested. The best way to do this is to use the ${NSLOTS} environment variable. Also note that the commands which accept --jobs do not make use of GPUs. You should only use commands which accept --device on GPU nodes. See examples below.

Licensing

Taiyaki is licensed under the Oxford Nanopore Technologies Public License.

Example jobs

The following examples are taken from here. The serial job involves the scripts which do not make use of GPU resources. The GPU job does the actual machine learning. You may wish to investigate Job Holds so you can submit both jobs at once and the scheduler will hold the GPU job until the first job has completed.

Serial job

Here is an example job running on 1 core (note use of ${NSLOTS}):

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

# Load Python module
module load python

# Activate Taiyaki venv
source /data/scratch/abc123/taiyaki/venv/bin/activate

# Generate Parameters
generate_per_read_params.py --jobs ${NSLOTS} reads > modbase.tsv

# Prepare Reads
prepare_mapped_reads.py --jobs ${NSLOTS} --mod Z C 5mC --mod Y A 6mA reads modbase.tsv modbase.hdf5  r941_dna_minion.checkpoint modbase_references.fasta

GPU job

Here is an example GPU job. Note use of CUDA_VISIBLE_DEVICES:

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 8
#$ -l h_vmem=8G
#$ -l h_rt=1:0:0
#$ -l gpu=1
#$ -l gpu_type=kepler

# Load Python & CUDA modules
module load python
module load cuda

# Export CUDA_VISIBLE_DEVICES
export CUDA_VISIBLE_DEVICES=${SGE_HGR_gpu// /,}

# Activate Taiyaki venv
source /data/scratch/abc123/taiyaki/venv/bin/activate

# Train modified base model
train_mod_flipflop.py --device 0 --mod_factor 0.01 --outdir training mGru_cat_mod_flipflop.py modbase.hdf5
train_mod_flipflop.py --device 0 --mod_factor 1.0 --outdir training2 training/model_final.checkpoint modbase.hdf5

# Basecall
basecall.py --device 0 --modified_base_output basecalls.hdf5 reads training2/model_final.checkpoint > basecalls.fa

Reference