NCBI Resource Guide
PubMed Entrez BLAST OMIM Taxonomy Structure

Each link in this Resource Guide leads to a brief description of the resource on this page, then to the resource itself. A graphical Site Map and an Alphabetical Quicklinks Table provide direct links to resources and bypass the descriptions.



RESOURCES BY CATEGORY

About NCBI
programs and services, contact information, NCBI handbook, news (what's new, NCBI News, announcements e-mail lists, RSS feeds), exhibit schedule, postdoctoral fellowships, organizational structure, resource statistics, site search

GenBank
overview, submit sequences, submit genomes, sample record, GenBank divisions, statistics, release notes, international collaboration, FTP GenBank

Molecular Databases
nucleotides, proteins, structures, genes, gene expression, taxonomy

Literature Databases
PubMed, PubMedCentral, Journals, OMIM, Books, Citation Matcher

Genomes and Maps
organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs), and organism-specific resources, such as: human, mouse, rat, cow, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles

Tools
Entrez, LinkOut, My NCBI, BLAST, nucleotide sequence analysis, protein sequence analysis, 3-D structure display and similarity searching, genome analysis, gene expression

Research at NCBI
Computational Biology Branch (CBB), senior investigators in PubMed, seminar schedule, postdoctoral fellowships

Software Engineering
IEB home page, NCBI ToolBox, R&D projects, ASN.1

Education
news, science primer, books, glossaries, tutorials, courses, additional resources

FTP Site
download databases, genomes, and software, NCBI Software ToolBox
ALPHABETICAL INDEX
with links to resource descriptions
(To bypass descriptions, use the Alphabetical Quicklinks Table.)
About NCBI GenBank sample record Plant Genomes
Announcements Genes Protein Sequences
ASN.1 Genes and Disease PubChem
BankIt Genomes (data, projects, submissions) PubMed
BLAST GENSAT PubMed Central
BLink GEO RefSeq
Books Glossaries Research at NCBI
Cancer Chromosomes Handbook Retroviruses
CCDS HIV Interactions SAGEmap
CDART HTGs Science Primer
CDD HomoloGene Seminars
CGAP Human Genome Resources Sequin
Clones Human-Mouse Homology Maps Site Search
Cn3D Journals SKY/M-FISH & CGH Database
Coffee Break LinkOut Software Engineering
COGs Malaria Splign
Computational Biology Branch Map Viewer Statistics
Data Submissions MeSH Structures
dbEST MGC Submit Data
dbGSS Microbial Genomes Taxonomy
dbMHC MMDB Tools
dbSNP Model Maker TPA
dbSTS Mutation Databases Trace Archive
Education My NCBI UniGene
e-PCR NCBI Home UniSTS
Entrez NCBI News VAST
Entrez Utilities Nucleotide Sequences VecScreen
Expression OMIM Viruses
FTP OMSSA WGS
GenBank ORF Finder What's New

   indicates a resource which has become available in the last 12 months.  

About NCBI Overview back to top

About NCBI - The science behind our resources. An introduction for researchers, educators and the public. Includes a Science Primer, with plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics.
Programs and Services - basic research, databases and software, outreach and education
Contact Information - postal address, phone, e-mail addresses for various services
Exhibit Schedule - NCBI exhibits at upcoming conferences
NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other.
Organizational Structure - functions of the three NCBI branches: Computational Biology Branch (CBB), Information Engineering Branch (IEB), and Information Resources Branch (IRB)
Board of Scientific Counselors - advises the NIH Director and the Deputy Director for Intramural Research; the NLM Director, and the NCBI Director about the intramural research and development programs of the NCBI.
Postdoctoral Fellowships - general information, application procedure
Statistics for NCBI Resources - A page listing statistics that are available for selected NCBI resources, including number of records present in various databases, number of genomes available at NCBI and statistics for the individual genomes, and server usage.
Site Search - Search the NCBI web site and display results in various formats. The default Homepage view sorts NCBI pages based on the number of other NCBI pages that link to them. The NCBI Site Search function is part of the Entrez system (described below). Therefore, the search features described in the Entrez help document also apply to the site search function.
News and Announcements back to
top
  • What's New - recently released resources and enhancements to existing resources.
  • NCBI News - announcements about new resources, enhancements to existing resources, staff publications, tutorials, FAQs.
  • NCBI Announcements Email Lists - Receive announcements about changes and updates to a variety of NCBI services. In addition to a general NCBI-announce list, topic-specific e-mail lists are available for BLAST, GenBank, dbSNP, Genomes, LinkOut, RefSeq, Sequin, and Entrez Utilities (for making WWW Links to Entrez). Follow the link to the NCBI Announcements Email Lists page to see a complete list of available topics. Information on how to subscribe is provided.
  • NCBI RSS Feeds - Receive announcements about various NCBI services using an RSS (Real simple syndication) feed reader. RSS feeds are available for resources such as Bookshelf, HomoloGene, PubMed Central, PubMed New and Noteworthy, Probe Database, and UniGene. Follow the link to the NCBI RSS Feeds page to see a complete list of available topics. Additional information about RSS is provided in a short series of FAQs.

GenBank Overview back to top

General Information (sample record, release notes, GenBank divisions, statistics),   Submissions (general, special categories, other data types),   International Collaboration,   FTP GenBank
 

General Information back to
top

What is GenBank? - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described below), which also includes EMBL and DDBJ.  GenBank is updated daily in NCBI search systems, and a full release is issued on the FTP site approximately the 15th of every February, April, June, August, October, and December. It contains all the data present in GenBank as of the cutoff date specified in the release notes (described below). The FTP site also provides daily cumulative an non-cumulative update files (more about the FTP site below).
Sample Record - detailed description of each field in a GenBank record.
Includes, for example, information about accession number formats, sequence identifiers (GI number and accession.version), a listing of GenBank divisions, and more. Describes some commonly annotated biological features, such as CDS, and provides links to documents that list and define the complete set of biological features that can be annotated on sequence records. Includes a link to a sequence revision history tool that can be used to track changes that have occurred to the sequence data in a record.  Also lists the Entrez search field(s) that can be used to search each part of a sequence record.
GenBank Divisions - summary of GenBank divisions, including abbreviations, full spellings, information about what the GenBank divisions are, and what they are not. (This information is part of the GenBank sample record, described above.)
Access GenBank - through Entrez Nucleotides. Search by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez is below. Use BLAST for sequence similarity searches against GenBank and other databases. An option to download the GenBank full release and updates via FTP is also available.
Growth Statistics (graph) - see also Release Notes sections 2.2.6 (per division statistics), 2.2.7 (per organism statistics), 2.2.8 (growth of GenBank). For statistics on other NCBI databases, please see the page that summarizes sources of Statistics for NCBI Resources.
GenBank Release Notes - A document that accompanies each full release (described in "What is GenBank?", above) of the GenBank database. The release notes describe the format and content of the flat files that comprise the release. They also include notices of recent and upcoming changes, information about GenBank divisions, growth statistics, citing GenBank, and more.
Genetic Codes - synopsis of 17 genetic codes; used to ensure correct translation of coding sequences in GenBank records.
GenBank Bionet Newsgroup - A moderated list that includes announcements of new GenBank releases, recent and upcoming changes, and discussion among subscribers. For information on how to subscribe by e-mail, see the NCBI Announcements Email Lists page.

GenBank Submissions back to
top
General Information back to
top
 
In addition to GenBank, there are other databases at NCBI to which a variety of data types can be submitted (third party annotations (TPA), variation, expression, MHC data, SKY/M-FISH/CGH data, traces).
 
Submission Software Programs back to
top
  • BankIt - WWW submission tool for one or few submissions, designed to make the submission process quick and easy.  (BankIt also automatically uses VecScreen to identify segments of nucleic acid sequence which may be of vector, adapter, or linker origin to combat the problem of vector contamination in GenBank.)
  • Sequin - submission software program for one or many submissions, long sequences, complete genomes, alignments, population/phylogenetic/mutation studies. Can be used as a stand-alone application or in a TCP/IP-based "network aware" mode, with links to other NCBI resources and software such as Entrez.  (Use VecScreen prior to submission).  To receive announcements about updates to the Sequin submission software, see the NCBI Announcements Email Lists page.

Special Types of Submissions to GenBank back to
top
Genomes,   Alignments,   ESTs,   GSSs,   HTGs,   STSs,   WGS
 
  • Submission of complete genomes and other large sequence records - Recent enhancements to Sequin make it convenient for genome sequencing centers to annotate their records with Sequin and submit the resulting ASN.1 file to GenBank. After the Sequin files are prepared, large genomes should be submitted by ftp; write to genomes@ncbi.nlm.nih.gov to obtain an ftp account. Smaller records less than 350 kb can be sent by email to gb-sub@ncbi.nlm.nih.gov.

    More information about submitting genomes and other large sequence records is provided on the following pages: GenBank submissions, Sequin, tabular layout for submitting annotated features, bacterial genome submission guidelines.

    In addition, sequencing centers can register a sequencing project with NCBI prior to the submission of any data. This can be done through a Genome project submission form. For each registered project, NCBI will create a sequencing project page that describes the project, links out to genome-specific reosurces, and provides a focal point for the addition of links to NCBI resources such as Map Viewer and genomic BLAST. Projects can be listed publicly or remain unlisted, and sequences may be held until publication (the default), released immediately, or made available for BLAST searches only. The form can also be used to set up an FTP site for the upload of data to NCBI, or to specify a URL to be used by NCBI for download of project or sequence data. (See Fall 2003/Winter 2004 issue of NCBI News for more information.)
  • ESTs - expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.
  • GSSs - genome survey sequences; short, single pass read genomic sequences, exon trapped sequences, cosmid/BAC/YAC ends, others.
  • HTGs - high throughput genome sequences from large scale genome sequencing centers; unfinished (phase 0, 1, 2) and finished (phase 3) sequences. (Note that contigs assembled from draft and finished human HTG sequences are accessible from the Map Viewer, described below.)
  • STSs - sequence tagged sites; short sequences that are operationally unique in the genome, used to generate mapping reagents.
  • WGS - data from Whole Genome Shotgun (WGS) sequencing projects can be submitted to GenBank. The data can contain annotations and an entire project is updated as sequencing progresses. WGS submissions are given accession numbers in the format of four letters followed by eight digits, e.g., XXXX00000000. The four letters are a stable project_ID, which does not change as the project is updated. The first two digits represent the version number, which corresponds to a particular project update. The last six digits represent an individual contig within the WGS project. For example, if a project's assigned accession number is XXXX00000000, then that project's first assembly version would be XXXX01000000, and the first contig of that version would be XXXX01000001. (more...)
    The nucleotide data from WGS projects go into the appropriate organismal GenBank Divisions and the BLAST wgs database. The protein translations of annotated coding sequences go into the BLAST protein nr database. In addition, quality data from many WGS projects are submitted to the Trace Archive (described in the ResourceGuide section on Nucleotide Sequence Databases).
Other Types of Data Submissions
(Other NCBI databases, separate from GenBank, to which data can be submitted)
back to
top
  • Third Party Annotations (TPA) - a database of experimentally supported annotations on assemblies of sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank contains primary sequence data and corresponding annotations submitted by the laboratories that did the sequencing, the TPA database contains third-party assemblies of primary data with experimentally supported annotation that has been published in a peer-reviewed scientific journal. Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page. Additional information about the TPA database is provided below.

International Nucleotide Sequence Database Collaboration back to
top

GenBank, DDBJ, EMBL - Overview of collaborative projects and links to home pages. The GenBank, DDBJ (DNA Data Bank of Japan), and EMBL (European Molecular Biology Laboratory) databases share data on a daily basis and are therefore equivalent. The record formats and search systems might differ among the databases, but the accession numbers, sequence data, and annotations are the same in all of them. E.g., you can retrieve the record with accession number U12345 from GenBank, DDBJ, or EMBL and it will contain the same sequence data, references, etc. in all three databases.
DDBJ/EMBL/GenBank Feature Table - feature table formats and standards used in the annotation of sequence records by the collaborating databases; makes possible sharing of data; includes detailed appendices such as:
  • biological features reference key (alphabetical list also available)
  • feature qualifiers
  • IUPAC abbreviations for nucleotides
  • IUPAC abbreviations for amino acids
  • FTP GenBank and Daily Updates back to
top

    GenBank flat file format - see sample GenBank record and detailed description in GenBank release notes; download most recent full release (described above) and daily cumulative or non-cumulative update files.
    ASN.1 format - Abstract Syntax Notation 1, an International Standards Organization (ISO) data representation format; download most recent full release (described above) and daily cumulative or non-cumulative update files.  (more on ASN.1)
    FASTA format - definition line followed by sequence data only (example); see readme file for database descriptions, including nt.Z (daily updated non-redundant BLAST nucleotide database, contains GenBank+EMBL+DDBJ+PDB sequences, but no EST, STS, GSS, or HTGS sequences), nr.Z (daily updated non-redundant proteins), est.Z, gss.Z, htg.Z, sts.Z, and others.


    Molecular Databases Overview back to top

    Nucleotide Sequences,   Protein Sequences,   Structures,   Genes,   Expression,   Taxonomy
     

    Nucleotide Sequence Databases back to
top

    Entrez Nucleotides - combines data from a number of source databases, including GenBank, RefSeq, TPA, and PDB. Data can be searched by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available.
    GenBank - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described above), which also includes EMBL and DDBJ. A sample record, which provides a detailed description of each field in a GenBank record, is also available. A variety of sequence records exist in GenBank, such as characterized genes that have been well-studied and annotated, batch produced sequences (ESTs, GSSs, STSs), high throughput genomic sequences, complete genomes, and more. Additional information about GenBank is given in the GenBank Overview section of this guide.
    RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Nucleotide sequence records have accessions: NT_123456, NM_123456, NC_123456, NG_123456, XM_123456, XR_123456 (more info about accession numbers and access). Additional details about RefSeq are provided in the NCBI Handbook, which is available online in the Entrez Books database.
    Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site.
    Third Party Annotation (TPA) database - a database of experimentally supported annotations on assemblies of sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank contains primary sequence data and corresponding annotations submitted by the laboratories that did the sequencing, the TPA database contains third-party assemblies of primary data with experimentally supported annotation that has been published in a peer-reviewed scientific journal. Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page.
    Note:  Although TPA records are derived from DDBJ/EMBL/GenBank, TPA is actually a separate database. Therefore, TPA records are not present in the GenBank FTP files, but will be available in separate FTP files.

    The TPA database uses an accession format similar to GenBank records (e.g., two letters followed by six digits) and is organized into similar divisions. (A list of GenBank divisions is given in the GenBank Sample Record. Some divisions, such as EST, GSS, HTG and are present in GenBank but will not be present in TPA.)

    TPA records can be easily recognized because the definition lines begin with the the letters "TPA", and they contain "Third Party Annotation; TPA" in the Keywords field. This is illustrated in a sample TPA record, BK000627.

    TPA records can be retrieved from Entrez Nucleotides (described above). To only see data from TPA, use the "Index" mode to select "tpa" from the Properties search field, or simply add the command AND tpa[prop] to your query.

    Details about how to submit data, as well as examples of what can and cannot be submitted to TPA, are provided on the TPA home page. An announcement and additional information about the TPA database is provided in section 1.4.5, "Third-Party Annotation and Consensus Sequences (TPA)" of the GenBank 133.0 release notes.
    dbEST - database of expressed sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes cDNA sequences from differential display experiments and RACE experiments.
    Note: EST sequences are available from two sources: dbEST and the EST division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    dbGSS - database of genome survey sequences; short, single pass read genomic sequences, exon trapped sequences, cosmid/BAC/YAC ends, others.
    Note: GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    dbMHC - Provides a platform where the human leukocyte antigen (HLA) community can submit, edit, view, and exchange Major Histocompatibility Complex (MHC) data. The MHC database is fully integrated with other NCBI resources, as well as with the International Histocompatibility Working Group (IHWG) Web site, and provides links to the IMmunoGeneTics HLA (IMGT/HLA) database. Additional details are available in the NCBI Handbook.
    dbSNP - database of single nucleotide polymorphisms, small-scale insertions/deletions, polymorphic repetitive elements, and microsatellite variation.  dbSNP includes polymorphism data that is experimentally derived, computationally derived, as well as hybrid data that is determined by the alignment of an experimentally derived molecule to genomic sequence data.  Currently, dbSNP is comprised of 4 general classes of submissions: (a) The SNP Consortium (TSC) - candidate SNPs identified by sequencing using either the reduced representation shotgun strategy or by alignment of random reads to genomic sequence;  (b) Overlaps - candidate SNPs were identified in sequence overlaps between individual BACs or PACs;   (c) ESTs - SNPs identified in EST clusters, including those identified by the Cancer Genome Anatomy Project (described below);  (d) Other - SNPs identified after screening larger numbers of chromosomes include many with alleles of lower frequency (1%-20%).  (data submission instructions)   To receive announcements about updates and new features to dbSNP, see the NCBI Announcements Email Lists page.
    Note: Although dbSNP is a separate database from GenBank, SNP records include cross-references to GenBank records.  
    dbSTS - database of sequence tagged sites; short sequences that are operationally unique in the genome, used to generate mapping reagents.
    Note: STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ.  (data submission instructions...)
    UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map).

    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page.
    Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community.
    Trace Archive - a repository of the raw sequence traces generated by large sequencing projects. It allows retrieval of both the sequence file and the underlying data which generated the file. In the case of projects that rely on a Whole Genome Shotgun (WGS) strategy, the Trace Archive will be the sole source of raw sequence data. (More information about WGS projects is provided in the ResourceGuide section on special types of submissions to GenBank/WGS.) NCBI will be exchanging data regularly with the Ensembl Trace Server. The Trace Archive can be searched by using MegaBLAST (described below), or by entering a term in the search box at the top of the Trace Archive Page. (data submission instructions...)
    Assembly Archive - links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram.
    UniVec - a database that can be used to quickly identify segments within nucleic acid sequences which may be of vector origin. Screening using UniVec is efficient because a large number of redundant sub-sequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors. The VecScreen tool, described below (under sequence analysis tools), can be used to compare a query sequence against the UniVec database in order to identify possible vector contamination.
    Genomes - Resources in the Genomes and Maps section contain the nucleotide sequences for a variety of genomes. Examples of the genomes available include:   >1000 organisms in Entrez Genome, human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles.
    Nucleotide Sequence Analysis - various tools are available for analyzing nucleotide sequences and are described below.

    Protein Sequence Databases back to
top

    Entrez Proteins - search protein sequence records (from GenPept + RefSeq + Swiss-Prot + PIR + RPF + PDB) by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. Entrez proteins also includes BLink ("BLAST Link"), a feature which displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. More information about BLink is provided below.
    RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Protein sequence records have accessions: NP_123456 or XP_123456 (more info about accession numbers and access).
    FTP GenPept - download the "relxxx.fsa_aa.gz" file. The filename stands for "Release number XXX FASTA formatted amino acid translations". The translations are extracted from GenBank/EMBL/DDBJ records that are annotated with one or more CDS features
    Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described below). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases".
    PROW - Protein Resources on the Web - short authoritative guides on the approximately 200 human CD cell-surface molecules. Peer-reviewed; provides approximately 20 standardized categories of information (biochemical function, ligands, etc.) for each CD antigen.
    Protein Sequence Analysis - various tools are available for analyzing protein sequences and are described below.
    Proteomes
    • Entrez Genome - provides ProtTable and TaxTable for various organisms. The ProtTable provides a summary of protein coding regions in a genome, and provides links to the corresponding nucleotide and protein sequences in FASTA format. The TaxTable, also referred to as the "distribution of BLAST protein homologs by taxa," summarizes the results of BLAST analyses done for the proteins, and displays the relationship of the organism to others through a color-coded graphical summary. (Additional information about Entrez Genome is provided below.)
    • FTP Genome Proteins - download an *.faa file (FASTA formatted amino acid sequences) and *ptt file (protein table) for various organisms from the genbank/genomes directory of the ftp site; see readme file for more information. Protein tables can also be viewed in Entrez Genome, as noted above.

    Structure Databases back to
top

    Structure Home - general information about the NCBI Structure Group and its research projects, as well as access to the Molecular Modeling Database (MMDB) and related tools to search and display structures.
    MMDB: Molecular Modeling Database- a database of three-dimensional biomolecular structures derived from X-ray crystallography and NMR-spectroscopy. MMDB is a subset of three-dimensional structures obtained from the Brookhaven Protein DataBank (PDB), excluding theoretical models. MMDB reorganizes and validates the information in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. Its data specification includes a description of a biopolymer's spatial structure, a description of how it is organized chemically, and a set of pointers linking the two. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction. MMDB records are stored in ASN.1 format and can be displayed with the Cn3D, Rasmol, or Kinemage viewers. In addition, similar structures within the database have been identified usingVAST, and new structures can be compared against the database using VASTsearch.
    3D Domains Database - compact structural domains identified automatically in MMDB, Entrez's macromolecular three-dimensional structure database. These domains are identified by searching for breakpoints in the structure between major secondary structure elements so that the ratio of intra- to inter-domain contacts falls above a set threshhold. 3D Domains are the units of comparison for structure neighbor ("related structures") calculations using the VAST algorithm.
    Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described above). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures.
    PubChem - contains the chemical structures of small organic molecules and information on their biological activities. It is intended to support the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. PubChem's chemical structure database may be searched on the basis of descriptive terms, chemical properties, and structural similarity. When possible, PubChem's chemical structure records are linked to other NCBI databases, including the PubMed scientific literature database and NCBI's protein 3D structure database. PubChem also contains the results of high-throughput biological screening experiments. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system.
    • PubChem Substance - Primary data NCBI obtains from the various public depositories. The PubChem Substance database contains approximately 13 million records as of October 2006, provided by various sources, DTP/NCI, NIAID, ChemIDplus, NIST, NIST webbook, MOLI/NCI, ChemBank, MMDB, KEGG, and more. Substance information includes chemical structures, synonyms, registration IDs, descriptions, related urls, and database cross-reference links to PubMed, protein 3D structures, and biological screening results.
    • PubChem Compound - A database made by NCBI and derived from PCSubstance. It is a non-redundant view of the chemically validated substances in PubChem Substance. There is one PubChem Compound record for each unique substance, and for each unique substance component. There can be multiple PubChem Substance records associated with one PubChem Compound record. PubChem Compound contains all standardized structures, mixture components, and precalculated structure neighboring links. Compound information includes structure, compound property information (molecular weight, formula, xLogP, count of the rotatable bonds, H bond donor, H bond acceptor, etc.), and structure description (SMILES, IUPAC name, INCHI).
    • PubChem BioAssay - The assay database consists of deposited bioactivity data and descriptions of bioactivity assays used for screening of the chemical substances contained in PubChem Substance, including descriptions of the conditions and the readouts (bioactivity levels) specific to the screening procedure. The assay database includes DTP/NCI's 710 million lines of in vitro and in vivo data covering from cancer, HIV, to many other fields.
    Structure-Related Tools - in addition to the structure databases described above, NCBI offers several tools:
    • Cn3D - "See in 3-D," a structure and sequence alignment viewer for NCBI databases. It allows viewing of 3-D structures and sequence-structure or structure-structure alignments. Cn3D can work as a helper application to your browser, or as a client-server application that retrieves structure records from MMDB (described above) directly over the internet. The Cn3D home page provides access to information on how to install the program, a tutorial to get started, and a comprehensive help document.
    • CD-Search - The Conserved Domain Search Service (CD-Search) can be used to identify the conserved domains present in a protein sequence. CD-Search uses RPS-BLAST (described above) to compare a query sequence against position-specific score matrices that have been prepared from conserved domain alignments present in the Conserved Domain Database (CDD) (described above). Hits can be displayed as a pairwise alignment of the query sequence with a representative domain sequence, or as a multiple alignment. Alignments are also mapped to known 3-dimensional structures, and can be displayed using Cn3D (described above). In the Cn3D display, residues in sequence alignments are variously colored, based on their degree of conservation.
    • VAST - Vector Alignment Search Tool - a computer algorithm developed at NCBI and used to identify similar protein 3-dimensional structures. The "structure neighbors" for every structure in MMDB are pre-computed and accessible via links on the MMDB Structure Summary pages. These neighbors can be used to identify distant homologs that cannot be recognized by sequence comparison alone.
    • VAST Search - structure-structure similarity search service. Compares 3D coordinates of a newly determined protein structure to those in the MMDB/PDB database. VAST Search computes a list of structure neighbors that you may browse interactively, viewing superpositions and alignments by molecular graphics.

    Genes back to
top

    Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes. Entrez Gene is the successor to LocusLink (described below).
    GeneRIF - Gene References into Function (GeneRIFs) provide a simple mechanism to allow scientists to add to the functional annotation of loci described in Entrez Gene. They appear as annotated bibliographies in Entrez Gene records, and consist of brief statements on gene function with links to the corresponding PubMed records (example: human MLH1). The GeneRIF help page describes the simple steps needed to submit information. GeneRIFs are also added to the Entrez Gene records by the MEDLINE Indexing Staff of the National Library of Medicine. GeneRIFs are currently available for a subset of organisms in Entrez Gene, and will be provided for the loci of other organisms as the development of Entrez Gene continues.
    LocusLink - LocusLink was discontinued as of March 1, 2005. It provided a foundation for what is now Entrez Gene and was described in several articles ( Pruitt KD, Maglott DR (2001), Pruitt KD, Katz KS, Sicotte H, Maglott DR (2000)). It contained data for a number of species such as human, mouse, rat, zebrafish, nematode, fruit fly, cow, sea urchin, African clawed frog, HIV-1, and a few other model and commonly studied organisms. Data for these organisms (and from the ongoing collaboration among the groups listed above) are now available in the Entrez Gene database (described above), which is the successor to LocusLink. The major differences between LocusLink and Entrez Gene are scope of data and search interface. Entrez Gene contains data from all organisms with RefSeq genome records. (RefSeq is described in the Molecular Databases/Nucleotide Sequences section of this guide). Entrez Gene also uses the Entrez search system, and therefore offers the helpful functions such as Preview/Index, History, and LinkOut that are available for other Entrez databases. The Entrez Gene help document includes numerous tips for previous users of LocusLink.
    Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site.
    UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page.
    HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page.
    Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases".
    AceView (Acembly) - AceView offers an integrated view of the human, nematode and Arabidopsis genes reconstructed by co-alignment of all publicly available mRNAs and ESTs on the genome sequence. The goals are to offer a reliable up-to-date resource on the genes and their functions and to stimulate further validating experiments at the bench. AceView carefully computes co-alignment and clustering of experimental cDNA sequences, no prediction is involved. The resulting AceView genes and their alternative variants are analyzed in terms of expression, intron-exon structure, alternative features, regulation and neighbor relationships; the protein products are analyzed for completeness, their best covering clones are identified, the proteins are searched for motifs, membership to a protein family, conservation in evolution, closest homologues in other species and signals for subcellular localization. The genes are presented in the context of biological annotations gathered from various sources. AceView can be queried by meaningful words or sentences as well as by most standard identifiers.

    Expression back to
top

    Gene Expression Omnibus (GEO) - a gene expression and hybridization array data repository, as well as a curated, online resource for gene expression data browsing, query and retrieval. GEO was the first fully public high-throughput gene expression data repository, and became operational in July 2000. Many types of gene expression data from platforms such as spotted microarray (microarray), high-density oligonucleotide array (HDA), hybridization filter (filter) and serial analysis of gene expression (SAGE) data, are accepted, accessioned, and archived as a public data set. GEO data can be accessed through several search and browsing tools on the GEO home page, Entrez (via Entrez GEO Profiles and Entrez GDS (GEO DataSets)), and the FTP site. The Tools/Gene Expression section of this file provides information about data visualization and exploration capabilities available in GEO.
    GENSAT - The Gene Expression Nervous System Atlas, or GENSAT, project aims to map the expression of genes in the central nervous system of the mouse, using both in situ hybridization and transgenic mouse techniques. The GENSAT database contains a series of images related to gene expression experiments. The images are indexed on a number of fields relevant to biological discovery. Search criteria include gene names, gene symbols, gene aliases and synonyms, mouse ages, and imaging protocols. The GENSAT project is a collaboration among the National Institute of Neurological Disorders and Stroke (NINDS), Rockefeller University, St. Jude Children's Research Hospital, and NCBI.
    Expression-Related Tools - in addition to the GEO database, described above, NCBI offers several tools:
    • SAGEmap - Serial Analysis of Gene Expression, or SAGE, is an experimental technique designed to quantitatively measure gene expression. SAGEmap is an online tool to compare computed gene expression profiles between SAGE libraries generated by the Cancer Genome Anatomy Project (CGAP, described under human genome/cancer research) and submitted by others through the Gene Expression Omnibus (GEO, described above). SAGEmap also includes a comprehensive analysis of SAGE tags in human GenBank records, in which a UniGene identifier is assigned to each human sequence that contains a SAGE tag. Data can be retrieved by tag, by sequence, by UniGene cluster ID and by library name. When retrieving data by sequence or UniGene cluster ID, follow a SAGE tag's hotlink to find out its expression level in different SAGE libraries, and how it is represented in the rest of the sequences in GenBank. Retrieving data by library name takes one to GEO, where all SAGEmap data has been stored by library. Analytical tools include xProfiler, which compares gene expression between SAGE libraries of your choice as well as uploaded data. More information about the additional analytical capabilities of the SAGEmap resource is provided in the tools/gene expression section of this file.
    • CGAP - Cancer Genome Anatomy Project - interdisciplinary program to identify the human genes expressed in different cancerous states, based on cDNA (EST) libraries, and to determine the molecular profiles of normal, precancerous, and malignant cells. Collaboration among the National Cancer Institute, the NCBI, and numerous research labs. Additional information about CGAP is provided in the tools/gene expression section of this file. Related resources are described in the human genome/cancer research section.
    • UniGene DDD - Digital Differential Display - an online tool to compare computed gene expression profiles between selected cDNA libraries. Using a statistical test, genes whose expression levels differ significantly from one tissue to the next are identified and shown to the user. Additional information about UniGene is in the molecular databases/genes section.

    Taxonomy back to
top

    NCBI Taxonomy Database Home - general information about the Taxonomy project, including taxonomic resources and a list of outside curators collaborating with NCBI taxonomists. The NCBI Taxonomy Database contains the names and lineages of >160,000 organisms, both living and extinct, that are represented in the genetic databases with at least one nucleotide or protein sequence. New organisms are added to the database as sequence data are deposited for them. The purpose of the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the sequence databases.
    Taxonomy Browser - The search bar on the Taxonomy home page allows you to browse the NCBI taxonomy database. Enter the scientific or common name of a species (e.g., Canis familiaris or dog) or a higher taxon (e.g., Canidae) to view that organism or taxon's lineage; retrieve the available nucleotide, protein, structure, and genome records; and browse up and down the taxonomic tree. (Tip:   For the broadest search results, select the "token set" option in the search bar, which searches for any string, whether in the beginning, middle, or end of a word.)  Entrez also provides an interface for browsing the taxonomy database, and offers features such as the Common Tree function, which allows you to build a tree for your own selection of organisms or taxa (more...).
    Taxonomy BLAST - an implementation of Gapped BLAST (2.x) that groups hits by source organism, according to information in NCBI's Taxonomy database. Species are listed in order of sequence similarity to the query sequence; the strongest match listed first. Three report views are available:
    • organism report - sorts the BLAST hits according to species, so that all of the hits to the same organism will appear together
    • lineage report - gives a simplified view of the relationships between the organisms, according to their classification in the taxonomy database. This report is "focused" on the organism which yielded the strongest BLAST hit. It answers the question, "how closely are the organisms in the BLAST hit list related to the query sequence according to the taxonomy database?"
    • taxonomy report - provides a more detailed report about the relationships among all of the organisms found in the BLAST hit list, including a summary of the taxa that are represented, the number of species and subspecies, and the number of BLAST hits at each node in the taxonomic hierarchy.
    TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared.

    Literature Databases Overview back to top

    PubMed - A database of citations and abstracts for biomedical literature. These citations are from MEDLINE and additional life science journals. PubMed also includes links to many sites providing full text articles and other related resources. PubMed is accessible through the Entrez search and retrieval system (described below)
    • Journals Database - allows you to lookup journals that are cited in any of the Entrez databases, including PubMed. Journals can be searched using the journal title, MEDLINE or ISO abbreviation, ISSN, or the NLM Catalog ID.
    • MeSH - The Medical Subject Headings (MeSH) database is NLM's controlled vocabulary used for indexing articles for MEDLINE/PubMed. MeSH terminology provides a consistent way to retrieve information that may use different terminology for the same concepts.
    PubMed Central - a digital archive of biomedical and life sciences journal literature, including clinical medicine and public health, managed by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). It is not a journal publisher. Access to PubMed Central (PMC) is free and unrestricted.
    OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases.
    Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results.
    HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. RefSeq protein sequence records serve as anchors for collecting published information about interactions between HIV-1 and human proteins. Each HIV Interactions database record lists an HIV protein and the human proteins with which it has been found to interact. In turn, the Entrez Gene record for each human protein contains annotated HIV-1 Interactions bibliographies, which consist of brief statements on protein interactions with links to the corresponding PubMed records and sequence data. The HIV Interactions database is a collaborative project among the developers of RefSeq (description) and Entrez Gene (description), and is similar in concept to GeneRIF (description). In contrast to GeneRIFs for single genes, however, the publications cited in the HIV Interactions Database contain statements about binding between two proteins rather than statements about the function of a single gene.

    Genomes and Maps Overview back to top
    organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs),   and organism-specific resources, such as: human,   mouse,   rat,   zebrafish,   Drosophila,   nematode,   plant genomes,   yeast,   malaria,   microbial genomes,   viruses,   viroids,   plasmids,   eukaryotic organelles
     

    Organism Collections back to
top

    Genomic Biology - An introduction to the field of genomic biology, with links to the genome resources pages for major organisms and organism groups, as well as links to additional NCBI genome resources.
    Entrez Genome - sequence and map data from the whole genomes of over 1000 organisms. The genomes represent both completely sequenced organisms and those for which sequencing is in progress. All three main domains of life - bacteria, archaea, and eukaryota - are represented, as well as many viruses, phages, viroids, plasmids, and organelles.. Entrez Genome provides graphical overviews of complete genomes/chromosomes, and the ability to explore regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on which analyses have been done by NCBI staff. In addition, the Map Viewer, a software component of Entrez Genome, provides views of integrated chromosome maps for a variety of organisms (see additional information about the Map Viewer below).
    Information about submitting genome data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. After data from complete genomes are submitted, they are made available in Entrez Genome (as complete genomes or chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). Entrez Nucleotide also provides access to the records for complete genomes/chromosomes, but the default view of those records is the Nucleotide database is GenBank format, whereas the default view in Entrez Genome is a graphical overview. A companion database, Entrez Genome Project, is described below.
    Entrez Genome Project - a companion database to Entrez Genome (described above). The actual data from genome sequencing projects are contained in Entrez Genome (as complete genomes chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). The Genome Project database, on the other hand, provides an umbrella view of the status of each genome project, links to project data in the other Entrez databases, and links to a variety of other NCBI and external resources associated with a given genome project. A genome project's status can be complete or in-progress, and the project can include large-scale sequencing, assembly, annotation, and mapping efforts. New genome sequencing projects can be registe