Entrez¶
Entrez is the text-based search and retrieval system used at the National Centre for Biotechnology Information (NCBI) for all of the major databases, including: PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, and Taxonomy.
Entrez is available as a module on Apocrita.
Usage¶
To run the default installed version of Entrez, simply load the entrez
module:
module load entrez
Several tools are included in the Entrez suite including einfo
, efetch
,
elink
, esearch
and esummary
.
- To retrieve a list of all valid Entrez databases, execute:
einfo -db <DB>
- To retrieve formatted data records from a specific database, execute:
efetch -db <DB> -id <ID>
The output will be formatted in JSON.
- To view document summaries for specific data records, execute:
elink -db <DB> -target <TARGET> -id <ID> | esummary
- To view document summaries matching a text query, execute:
esearch -db <DB> -query "<TEXT>" | esummary
In the above two examples, the binary esummary
is used to display the
document summaries. Without this pipeline, the database record ID is returned.
Example jobs¶
Serial jobs¶
Here is an an example job running on 1 core and 1GB of memory, to retrieve Taxonomy data from NCBI.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
module load entrez
elink -db nuccore -target taxonomy -id '1234' | esummary
Here is an an example job running on 1 core and 1GB of memory, to view NCBI documents which match a text query.
#!/bin/bash
#$ -cwd
#$ -j y
#$ -pe smp 1
#$ -l h_rt=1:0:0
#$ -l h_vmem=1G
module load entrez
esearch -db nuccore -query 'opsin gene conversion' | esummary