Skip to content

Entrez

Entrez is the text-based search and retrieval system used at the National Centre for Biotechnology Information (NCBI) for all of the major databases, including: PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, and Taxonomy.

Entrez is available as a module on Apocrita.

Usage

To run the latest installed version of Entrez, simply load the entrez module:

module load entrez

Several tools are included in the Entrez suite including einfo, efetch, elink, esearch and esummary.

  • To retrieve a list of all valid Entrez databases, execute:
einfo -db <DB>
  • To retrieve formatted data records from a specific database, execute:
efetch -db <DB> -id <ID>

The output will be formatted in JSON.

  • To view document summaries for specific data records, execute:
elink -db <DB> -target <TARGET> -id <ID> | esummary
  • To view document summaries matching a text query, execute:
esearch -db <DB> -query "<TEXT>" | esummary

In the above two examples, the binary esummary is used to display the document summaries. Without this pipeline, the database record ID is returned.

Example jobs

Here are a couple of example jobs, both running on 1 core and 1GB memory.

Retrieving Taxonomy data from NCBI

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

module load entrez

elink -db nuccore -target taxonomy -id '1234' | esummary

Viewing NCBI documents which match a text query

#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G

module load entrez

esearch -db nuccore -query 'opsin gene conversion' | esummary

References