Public / Shared Data available on Apocrita¶
In order to prevent duplication of data and to save valuable research time we provide a local copy of some widely used public datasets.
QMUL staff can contact us to request corrections, updates or the addition of new datasets to this repository.
Datasets available¶
Name and Location on Apocrita | Description |
---|---|
Blast databases /data/PublicDataSets/shared_dbs |
Standard set of databases for BLAST (Basic Local Alignment Search Tool) |
CADD /data/PublicDataSets/genomes/Homo_sapiens/CADD |
Combined Annotation Dependent Depletion CADD is a tool for scoring the deleteriousness of single nucleotide variants as well as insertion/deletions variants in the human genome. |
CDD /data/PublicDataSets/CDD |
The Conserved Domain Database is a resource for the annotation of functional units in proteins |
GATK Bundle /data/PublicDataSets/GATKbundle |
Standard files for working with human resequencing data with the GATK |
Galaxy hg datasets /data/PublicDataSets/galaxy |
Reference genomes for use with Galaxy |
Illumina Genomes /data/PublicDataSets/genomes |
Ready-To-Use Reference Sequences and Annotations |
ImageNet 2012 /data/PublicDataSets/ImageNet-2012/ |
Annotated image database for Machine Learning, 2012 version |
ImageNet 2021 /data/PublicDataSets/ImageNet-2021/ |
Annotated image database for Machine Learning, 2021 version. Full and resized images |
MAESTRO /data/PublicDataSets/MAESTRO |
MAESTRO (MIDI and Audio Edited for Synchronous TRacks and Organisation) |
MusicNet /data/PublicDataSets/musicnet |
A curated collection of labelled classical music in raw format. |
NCBI WGS /data/PublicDataSets/shared_dbs/wgs |
Whole Genome Shotgun projects are genome assemblies of incomplete genomes |
NR Protein sequences /data/PublicDataSets/shared_dbs/nr |
Non-redundant protein sequences from GenPept, Swissprot, PIR, PDF, PDB, and NCBI RefSeq |
Prot_RefSeq /data/PublicDataSets/shared_dbs/prot_refseq |
Protein data for subset of commonly used model organisms, downloaded from NCBI |
Slakh /data/PublicDataSets/slakh2100/ |
A dataset of multi-track audio and aligned MIDI for music source separation and multi-instrument automatic transcription. |
UniRef50 /data/PublicDataSets/shared_dbs/uniref50 |
The UniProt Reference Clusters (UniRef) provide clustered sets of sequences from the UniProt knowledgebase |
Uniprot /data/PublicDataSets/shared_dbs/uniprot |
Database of protein sequence and functional information |