Skip to content

Public / Shared Data available on Apocrita

In order to prevent duplication of data and to save valuable research time we provide a local copy of some widely used public datasets.

QMUL staff can contact us to request corrections, updates or the addition of new datasets to this repository.

Datasets available

Name and Location on Apocrita Description
Blast databases
Standard set of databases for
BLAST (Basic Local Alignment Search Tool)
Combined Annotation Dependent Depletion
CADD is a tool for scoring the deleteriousness
of single nucleotide variants as well as
insertion/deletions variants in the human genome.
The Conserved Domain Database is a resource
for the annotation of functional units in proteins
GATK Bundle
Standard files for working with human
resequencing data with the GATK
Galaxy hg datasets
Reference genomes for use with Galaxy
Illumina Genomes
Ready-To-Use Reference Sequences
and Annotations
ImageNet 2012
Annotated image database for Machine Learning, 2012 version
ImageNet 2021
Annotated image database for Machine Learning, 2021 version. Full and resized images
MAESTRO (MIDI and Audio Edited for
Synchronous TRacks and Organisation)
A curated collection of labelled classical music in raw format.
Whole Genome Shotgun projects are genome
assemblies of incomplete genomes
NR Protein sequences
Non-redundant protein sequences from GenPept,
Swissprot, PIR, PDF, PDB, and NCBI RefSeq
Protein data for subset of commonly used
model organisms, downloaded from NCBI
A dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription.
The UniProt Reference Clusters (UniRef)
provide clustered sets of sequences from
the UniProt knowledgebase
Database of protein sequence and functional