Entrez¶

Entrez is the text-based search and retrieval system used at the National Centre for Biotechnology Information (NCBI) for all of the major databases, including: PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, and Taxonomy.

Entrez is available as a module on Apocrita.

Usage¶

To run the default installed version of Entrez, simply load the entrez module:

module load entrez

Several tools are included in the Entrez suite including einfo, efetch, elink, esearch and esummary.

To retrieve a list of all valid Entrez databases, execute:

einfo -db <DB>

To retrieve formatted data records from a specific database, execute:

efetch -db <DB> -id <ID>

The output will be formatted in JSON.

To view document summaries for specific data records, execute:

elink -db <DB> -target <TARGET> -id <ID> | esummary

To view document summaries matching a text query, execute:

esearch -db <DB> -query "<TEXT>" | esummary

In the above two examples, the binary esummary is used to display the document summaries. Without this pipeline, the database record ID is returned.

Example jobs¶

Serial jobs¶

Here is an an example job running on 1 core and 1GB of memory, to retrieve Taxonomy data from NCBI.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

module load entrez

elink -db nuccore -target taxonomy -id '1234' | esummary

Here is an an example job running on 1 core and 1GB of memory, to view NCBI documents which match a text query.

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

module load entrez

esearch -db nuccore -query 'opsin gene conversion' | esummary

Entrez¶

Usage¶

Example jobs¶

Serial jobs¶

References¶