Skip to content

Public / Shared Data available on Apocrita

In order to prevent duplication of data and to save valuable research time we provide a local copy of some widely used public datasets.

QMUL staff can contact us to request corrections, updates or the addition of new datasets to this repository.

Datasets available

Name and Location on Apocrita Description
Blast databases
/data/PublicDataSets/shared_dbs
Standard set of databases for
BLAST (Basic Local Alignment Search Tool)
CADD
/data/PublicDataSets/genomes/Homo_sapiens/CADD
Combined Annotation Dependent Depletion
CADD is a tool for scoring the deleteriousness
of single nucleotide variants as well as
insertion/deletions variants in the human genome.
CDD
/data/PublicDataSets/CDD
The Conserved Domain Database is a resource
for the annotation of functional units in proteins
GATK Bundle
/data/PublicDataSets/GATKbundle
Standard files for working with human
resequencing data with the GATK
Galaxy hg datasets
/data/PublicDataSets/galaxy
Reference genomes for use with Galaxy
Illumina Genomes
/data/PublicDataSets/genomes
Ready-To-Use Reference Sequences
and Annotations
ImageNet 2012
/data/PublicDataSets/ImageNet-2012/
Annotated image database for Machine Learning, 2012 version
ImageNet 2021
/data/PublicDataSets/ImageNet-2021/
Annotated image database for Machine Learning, 2021 version. Full and resized images
MAESTRO
/data/PublicDataSets/MAESTRO
MAESTRO (MIDI and Audio Edited for
Synchronous TRacks and Organisation)
MusicNet
/data/PublicDataSets/musicnet
A curated collection of labelled classical music in raw format.
NCBI WGS
/data/PublicDataSets/shared_dbs/wgs
Whole Genome Shotgun projects are genome
assemblies of incomplete genomes
NR Protein sequences
/data/PublicDataSets/shared_dbs/nr
Non-redundant protein sequences from GenPept,
Swissprot, PIR, PDF, PDB, and NCBI RefSeq
Prot_RefSeq
/data/PublicDataSets/shared_dbs/prot_refseq
Protein data for subset of commonly used
model organisms, downloaded from NCBI
Slakh
/data/PublicDataSets/slakh2100/
A dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription.
UniRef50
/data/PublicDataSets/shared_dbs/uniref50
The UniProt Reference Clusters (UniRef)
provide clustered sets of sequences from
the UniProt knowledgebase
Uniprot
/data/PublicDataSets/shared_dbs/uniprot
Database of protein sequence and functional
information