| NCBI Resource Guide |
| PubMed | Entrez | BLAST | OMIM | Taxonomy | Structure |
| Each link in this Resource Guide leads to a brief description of the resource on this page, then to the resource itself. A graphical Site Map and an Alphabetical Quicklinks Table provide direct links to resources and bypass the descriptions. |
| About NCBI | Overview |
|
| About NCBI - The science behind our resources. An introduction for researchers, educators and the public. Includes a Science Primer, with plain language introductions to bioinformatics, genome mapping, molecular modeling, SNPs, ESTs, microarray technology, molecular genetics, pharmacogenomics, and phylogenetics. |
| Programs and Services - basic research, databases and software, outreach and education |
| Contact Information - postal address, phone, e-mail addresses for various services |
| Exhibit Schedule - NCBI exhibits at upcoming conferences |
| NCBI Handbook - an online book, written by NCBI staff, that discusses the many resources available at NCBI. Each chapter is devoted to one service; after a brief overview on using the resource, there is an account of how the resource works, including topics such as how data are included in a database, database design, query processing, and how the different resources relate to each other. |
| Organizational Structure - functions of the three NCBI branches: Computational Biology Branch (CBB), Information Engineering Branch (IEB), and Information Resources Branch (IRB) |
| Board of Scientific Counselors - advises the NIH Director and the Deputy Director for Intramural Research; the NLM Director, and the NCBI Director about the intramural research and development programs of the NCBI. |
| Postdoctoral Fellowships - general information, application procedure |
| Statistics for NCBI Resources - A page listing statistics that are available for selected NCBI resources, including number of records present in various databases, number of genomes available at NCBI and statistics for the individual genomes, and server usage. |
| Site Search - Search the NCBI web site and display results in various formats. The default Homepage view sorts NCBI pages based on the number of other NCBI pages that link to them. The NCBI Site Search function is part of the Entrez system (described below). Therefore, the search features described in the Entrez help document also apply to the site search function. |
| News and Announcements |
|
|
|
|
|
| GenBank | Overview |
|
General Information (sample record, release notes, GenBank divisions, statistics), Submissions (general, special categories, other data types), International Collaboration, FTP GenBank |
| General Information |
|
| What is GenBank? - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described below), which also includes EMBL and DDBJ. GenBank is updated daily in NCBI search systems, and a full release is issued on the FTP site approximately the 15th of every February, April, June, August, October, and December. It contains all the data present in GenBank as of the cutoff date specified in the release notes (described below). The FTP site also provides daily cumulative an non-cumulative update files (more about the FTP site below). |
| Sample Record - detailed
description of each field in a GenBank record. Includes, for example, information about accession number formats, sequence identifiers (GI number and accession.version), a listing of GenBank divisions, and more. Describes some commonly annotated biological features, such as CDS, and provides links to documents that list and define the complete set of biological features that can be annotated on sequence records. Includes a link to a sequence revision history tool that can be used to track changes that have occurred to the sequence data in a record. Also lists the Entrez search field(s) that can be used to search each part of a sequence record. |
| GenBank Divisions - summary of GenBank divisions, including abbreviations, full spellings, information about what the GenBank divisions are, and what they are not. (This information is part of the GenBank sample record, described above.) |
| Access GenBank - through Entrez Nucleotides. Search by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez is below. Use BLAST for sequence similarity searches against GenBank and other databases. An option to download the GenBank full release and updates via FTP is also available. |
| Growth Statistics (graph) - see also Release Notes sections 2.2.6 (per division statistics), 2.2.7 (per organism statistics), 2.2.8 (growth of GenBank). For statistics on other NCBI databases, please see the page that summarizes sources of Statistics for NCBI Resources. |
| GenBank Release Notes - A document that accompanies each full release (described in "What is GenBank?", above) of the GenBank database. The release notes describe the format and content of the flat files that comprise the release. They also include notices of recent and upcoming changes, information about GenBank divisions, growth statistics, citing GenBank, and more. |
| Genetic Codes - synopsis of 17 genetic codes; used to ensure correct translation of coding sequences in GenBank records. |
| GenBank Bionet Newsgroup - A moderated list that includes announcements of new GenBank releases, recent and upcoming changes, and discussion among subscribers. For information on how to subscribe by e-mail, see the NCBI Announcements Email Lists page. |
| GenBank Submissions |
|
| General Information |
|
|
In addition to GenBank, there are other databases at NCBI to which a variety of data types can be submitted (third party annotations (TPA), variation, expression, MHC data, SKY/M-FISH/CGH data, traces). |
| Submission Software Programs |
|
|
|
| Special Types of Submissions to GenBank |
|
|
Genomes,
Alignments,
ESTs,
GSSs,
HTGs,
STSs,
WGS |
|
|
|
|
|
|
| Other Types of Data Submissions (Other NCBI databases, separate from GenBank, to which data can be submitted) |
|
|
|
|
|
|
|
| International Nucleotide Sequence Database Collaboration |
|
| GenBank, DDBJ, EMBL - Overview of collaborative projects and links to home pages. The GenBank, DDBJ (DNA Data Bank of Japan), and EMBL (European Molecular Biology Laboratory) databases share data on a daily basis and are therefore equivalent. The record formats and search systems might differ among the databases, but the accession numbers, sequence data, and annotations are the same in all of them. E.g., you can retrieve the record with accession number U12345 from GenBank, DDBJ, or EMBL and it will contain the same sequence data, references, etc. in all three databases. |
| DDBJ/EMBL/GenBank
Feature
Table - feature table formats and standards used in the annotation of
sequence
records by the collaborating databases; makes possible sharing of data; includes
detailed appendices such as:
|
| FTP GenBank and Daily Updates |
|
| GenBank flat file format - see sample GenBank record and detailed description in GenBank release notes; download most recent full release (described above) and daily cumulative or non-cumulative update files. |
| ASN.1 format - Abstract Syntax Notation 1, an International Standards Organization (ISO) data representation format; download most recent full release (described above) and daily cumulative or non-cumulative update files. (more on ASN.1) |
| FASTA format - definition line followed by sequence data only (example); see readme file for database descriptions, including nt.Z (daily updated non-redundant BLAST nucleotide database, contains GenBank+EMBL+DDBJ+PDB sequences, but no EST, STS, GSS, or HTGS sequences), nr.Z (daily updated non-redundant proteins), est.Z, gss.Z, htg.Z, sts.Z, and others. |
| Molecular Databases | Overview |
|
Nucleotide Sequences, Protein Sequences, Structures, Genes, Expression, Taxonomy |
| Nucleotide Sequence Databases |
|
| Entrez Nucleotides - combines data from a number of source databases, including GenBank, RefSeq, TPA, and PDB. Data can be searched by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. |
| GenBank - a database of nucleotide sequences from >160,000 organisms. Records that are annotated with coding region (CDS) features also include amino acid translations. GenBank belongs to an international collaboration of sequence databases (described above), which also includes EMBL and DDBJ. A sample record, which provides a detailed description of each field in a GenBank record, is also available. A variety of sequence records exist in GenBank, such as characterized genes that have been well-studied and annotated, batch produced sequences (ESTs, GSSs, STSs), high throughput genomic sequences, complete genomes, and more. Additional information about GenBank is given in the GenBank Overview section of this guide. |
| RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Nucleotide sequence records have accessions: NT_123456, NM_123456, NC_123456, NG_123456, XM_123456, XR_123456 (more info about accession numbers and access). Additional details about RefSeq are provided in the NCBI Handbook, which is available online in the Entrez Books database. |
Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site. |
Third Party Annotation (TPA)
database - a database of experimentally supported annotations on assemblies
of
sequences already present in DDBJ/EMBL/GenBank. Whereas DDBJ/EMBL/GenBank
contains
primary sequence data and corresponding annotations submitted by the
laboratories
that did the sequencing, the TPA database contains third-party assemblies of
primary
data with experimentally supported annotation that has been published in a
peer-reviewed scientific journal. Details about how to submit data, as well as
examples of what can and cannot be submitted to TPA, are provided on the TPA home page.
Note: Although TPA records are derived from DDBJ/EMBL/GenBank, TPA is actually a separate database. Therefore, TPA records are not present in the GenBank FTP files, but will be available in separate FTP files. |
| dbEST - database of expressed
sequence tags; short, single pass read cDNA (mRNA) sequences. Also includes
cDNA
sequences from differential display experiments and RACE experiments. Note: EST sequences are available from two sources: dbEST and the EST division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
| dbGSS - database of genome
survey
sequences; short, single pass read genomic sequences, exon trapped sequences,
cosmid/BAC/YAC ends, others. Note: GSS sequences are available from two sources: dbGSS and the GSS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
| dbMHC - Provides a platform where the human leukocyte antigen (HLA) community can submit, edit, view, and exchange Major Histocompatibility Complex (MHC) data. The MHC database is fully integrated with other NCBI resources, as well as with the International Histocompatibility Working Group (IHWG) Web site, and provides links to the IMmunoGeneTics HLA (IMGT/HLA) database. Additional details are available in the NCBI Handbook. |
| dbSNP - database of single nucleotide
polymorphisms, small-scale insertions/deletions, polymorphic repetitive
elements,
and microsatellite variation. dbSNP includes polymorphism data that is
experimentally derived, computationally derived, as well as hybrid data that is
determined by the alignment of an experimentally derived molecule to genomic
sequence data. Currently, dbSNP is comprised of 4 general classes of
submissions: (a) The SNP Consortium (TSC) - candidate SNPs identified by
sequencing
using either the reduced representation shotgun strategy or by alignment of
random
reads to genomic sequence; (b)
Overlaps - candidate SNPs were identified in sequence overlaps between
individual
BACs or PACs; (c) ESTs - SNPs identified in EST clusters, including those
identified by the Cancer Genome Anatomy Project (described below); (d) Other - SNPs identified after screening
larger
numbers of chromosomes include many with alleles of lower frequency (1%-20%).
(data submission
instructions) To receive announcements about updates and new
features to dbSNP, see the NCBI Announcements
Email Lists
page. Note: Although dbSNP is a separate database from GenBank, SNP records include cross-references to GenBank records. |
| dbSTS - database of sequence
tagged
sites; short sequences that are operationally unique in the genome, used to
generate
mapping reagents. Note: STS sequences are available from two sources: dbSTS and the STS division of GenBank. The sequences and accession numbers in both sources are the same but the record formats differ. (data submission instructions...) |
| UniSTS - a unified, non-redundant view of sequence tagged sites (STSs). UniSTS integrates marker and mapping data from a variety of public resources. If two or more markers have different names but the same primer pair, a single STS record is presented for the primer pair and all the marker names are shown. Each UniSTS record displays the primer sequences, product size, mapping information, and cross references to Entrez Gene, dbSNP, RHdb, GDB, MGD, and the Map Viewer. The marker report also lists GenBank and RefSeq records that contain the primer sequences, as determined by Electronic PCR (e-PCR). Data sources include dbSTS, RHdb, GDB, various human maps (Genethon genetic map, Marshfield genetic map, Whitehead RH map, Whitehead YAC map, Stanford RH map, NHGRI chr 7 physical map, WashU chrX physical map), various mouse maps (Whitehead RH map, Whitehead YAC map, Jackson laboratory's MGD map). |
| UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page. |
| HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page. |
| Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community. |
| Trace Archive - a repository of the raw sequence traces generated by large sequencing projects. It allows retrieval of both the sequence file and the underlying data which generated the file. In the case of projects that rely on a Whole Genome Shotgun (WGS) strategy, the Trace Archive will be the sole source of raw sequence data. (More information about WGS projects is provided in the ResourceGuide section on special types of submissions to GenBank/WGS.) NCBI will be exchanging data regularly with the Ensembl Trace Server. The Trace Archive can be searched by using MegaBLAST (described below), or by entering a term in the search box at the top of the Trace Archive Page. (data submission instructions...) |
Assembly Archive - links the raw sequence information found in the Trace Archive with assembly information found in publicly available sequence repositories (GenBank/EMBL/DDBJ). The Assembly Viewer allows a user to see the multiple sequence alignments as well as the actual sequence chromatogram. |
| UniVec - a database that can be used to quickly identify segments within nucleic acid sequences which may be of vector origin. Screening using UniVec is efficient because a large number of redundant sub-sequences have been eliminated to create a database that contains only one copy of every unique sequence segment from a large number of vectors. The VecScreen tool, described below (under sequence analysis tools), can be used to compare a query sequence against the UniVec database in order to identify possible vector contamination. |
| Genomes - Resources in the Genomes and Maps section contain the nucleotide sequences for a variety of genomes. Examples of the genomes available include: >1000 organisms in Entrez Genome, human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles. |
| Nucleotide Sequence Analysis - various tools are available for analyzing nucleotide sequences and are described below. |
| Protein Sequence Databases |
|
| Entrez Proteins - search protein sequence records (from GenPept + RefSeq + Swiss-Prot + PIR + RPF + PDB) by accession number, author name, organism, gene/protein name, and a variety of other text terms. Additional information about Entrez below. For retrieval of large data sets, Batch Entrez (described below) is available. Entrez proteins also includes BLink ("BLAST Link"), a feature which displays the results of BLAST searches that have been done for every protein sequence in the Entrez Proteins data domain. To access it, follow the BLink link displayed beside any hit in the results of an Entrez Proteins search. More information about BLink is provided below. |
| RefSeq - NCBI database of Reference Sequences. Curated, non-redundant set including genomic DNA contigs, mRNAs and proteins for known genes, mRNAs and proteins for gene models, and entire chromosomes. Accession numbers have the format of two letters, an underscore bar, and six digits. Protein sequence records have accessions: NP_123456 or XP_123456 (more info about accession numbers and access). |
| FTP GenPept - download the "relxxx.fsa_aa.gz" file. The filename stands for "Release number XXX FASTA formatted amino acid translations". The translations are extracted from GenBank/EMBL/DDBJ records that are annotated with one or more CDS features |
| Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described below). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures. |
| HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases". |
| PROW - Protein Resources on the Web - short authoritative guides on the approximately 200 human CD cell-surface molecules. Peer-reviewed; provides approximately 20 standardized categories of information (biochemical function, ligands, etc.) for each CD antigen. |
| Protein Sequence Analysis - various tools are available for analyzing protein sequences and are described below. |
| Proteomes |
|
|
|
| Structure Databases |
|
| Structure Home - general information about the NCBI Structure Group and its research projects, as well as access to the Molecular Modeling Database (MMDB) and related tools to search and display structures. |
| MMDB: Molecular Modeling Database- a database of three-dimensional biomolecular structures derived from X-ray crystallography and NMR-spectroscopy. MMDB is a subset of three-dimensional structures obtained from the Brookhaven Protein DataBank (PDB), excluding theoretical models. MMDB reorganizes and validates the information in a way that enables cross-referencing between the chemistry and the three-dimensional structure of macromolecules. Its data specification includes a description of a biopolymer's spatial structure, a description of how it is organized chemically, and a set of pointers linking the two. By integrating chemical, sequence, and structure information, MMDB is designed to serve as a resource for structure-based homology modeling and protein structure prediction. MMDB records are stored in ASN.1 format and can be displayed with the Cn3D, Rasmol, or Kinemage viewers. In addition, similar structures within the database have been identified usingVAST, and new structures can be compared against the database using VASTsearch. |
| 3D Domains Database - compact structural domains identified automatically in MMDB, Entrez's macromolecular three-dimensional structure database. These domains are identified by searching for breakpoints in the structure between major secondary structure elements so that the ratio of intra- to inter-domain contacts falls above a set threshhold. 3D Domains are the units of comparison for structure neighbor ("related structures") calculations using the VAST algorithm. |
| Conserved Domain Database (CDD) - a collection of sequence alignments and profiles representing protein domains conserved in molecular evolution. It includes domains from Smart and Pfam, as well as domains contributed by NCBI researchers. It also includes alignments of the domains to known 3-dimensional protein structures in the MMDB database (described above). CDD can be used to identify conserved domains in a protein query sequence, using the CD-Search service (described below). In addition, the CDART tool (described below) uses CDD and RPS-BLAST (described below) to retrieve proteins with similar domain architectures. |
| PubChem - contains the chemical structures of small organic molecules and information on their biological activities. It is intended to support the Molecular Libraries and Imaging component of the NIH Roadmap Initiative. PubChem's chemical structure database may be searched on the basis of descriptive terms, chemical properties, and structural similarity. When possible, PubChem's chemical structure records are linked to other NCBI databases, including the PubMed scientific literature database and NCBI's protein 3D structure database. PubChem also contains the results of high-throughput biological screening experiments. PubChem is organized as three linked databases within the Entrez/PubMed information retrieval system. |
|
|
|
| Structure-Related Tools - in addition to the structure databases described above, NCBI offers several tools: |
|
|
|
|
| Genes |
|
| Entrez Gene - Entrez Gene provides a gene-based view of the data from a wide range of genomes. It supplies key connections in the nexus of map, sequence, expression, structure, functional, and homology data. Each record represents a single gene from a given organism. The minimum set of data in a gene record includes a unique identifier or GeneID assigned by NCBI, a preferred symbol, and any of sequence information, map information, or official nomenclature from an authority list. In addition, a gene record can also include expression, structure, functional, and homology data, when available. Entrez Gene includes data from all organisms that have RefSeq genome records (with NC_* accessions, see more info above), and can also include data from recognized genome-specific databases that provide NCBI with information about genes (preferably with defining sequence) or mapped phenotypes. Entrez Gene is the successor to LocusLink (described below). |
GeneRIF - Gene References into Function (GeneRIFs) provide a simple mechanism to allow scientists to add to the functional annotation of loci described in Entrez Gene. They appear as annotated bibliographies in Entrez Gene records, and consist of brief statements on gene function with links to the corresponding PubMed records (example: human MLH1). The GeneRIF help page describes the simple steps needed to submit information. GeneRIFs are also added to the Entrez Gene records by the MEDLINE Indexing Staff of the National Library of Medicine. GeneRIFs are currently available for a subset of organisms in Entrez Gene, and will be provided for the loci of other organisms as the development of Entrez Gene continues. |
LocusLink - LocusLink was discontinued as of March 1, 2005. It provided a foundation for what is now Entrez Gene and was described in several articles ( Pruitt KD, Maglott DR (2001), Pruitt KD, Katz KS, Sicotte H, Maglott DR (2000)). It contained data for a number of species such as human, mouse, rat, zebrafish, nematode, fruit fly, cow, sea urchin, African clawed frog, HIV-1, and a few other model and commonly studied organisms. Data for these organisms (and from the ongoing collaboration among the groups listed above) are now available in the Entrez Gene database (described above), which is the successor to LocusLink. The major differences between LocusLink and Entrez Gene are scope of data and search interface. Entrez Gene contains data from all organisms with RefSeq genome records. (RefSeq is described in the Molecular Databases/Nucleotide Sequences section of this guide). Entrez Gene also uses the Entrez search system, and therefore offers the helpful functions such as Preview/Index, History, and LinkOut that are available for other Entrez databases. The Entrez Gene help document includes numerous tips for previous users of LocusLink. |
| Consensus CoDing Sequence (CCDS) Database - The CCDS project is a collaborative effort to identify a core set of human protein coding regions that are consistently annotated and of high quality. The long term goal is to support convergence towards a standard set of gene annotations on the human genome. The collaborators include the National Center for Biotechnology Information (NCBI, Map Viewer), European Bioinformatics Institute (EBI, Ensembl), University of California, Santa Cruz (UCSC, Genome Browser), and Wellcome Trust Sanger Institute (WTSI, Vega). They identify the position of protein-coding regions of genes that are (1) annotated consistently on the human genome by all of the participating centers and (2) supported by transcript evidence, use of canonical splice sites, and other quality assurance measures. Additional information about the curation, process flow, and quality testing is available on the CCDS web site. |
| UniGene - ESTs and full-length mRNA sequences organized into clusters that each represent a unique known or putative gene within the organism from which the sequences were obtained. UniGene clusters are annotated with mapping and expression information when possible (e.g., for human), and include cross-references to other resources. Sequence data can be downloaded by cluster through the UniGene web pages, or the complete data set can be downloaded from the repository/UniGene directory of the FTP site. In addition, UniGene DDD (described below) can be used to show differential expression of genes between cDNA libraries. The organisms represented in UniGene are listed on the UniGene home page. |
| HomoloGene - a gene homology tool that compares nucleotide sequences between pairs of organisms in order to identify putative orthologs. Curated orthologs are incorporated from a variety of sources via Entrez Gene. Organisms represented are listed on the HomoloGene home page. |
| Mammalian Gene Collection (MGC) - The NIH Mammalian Gene Collection (MGC) is a trans-NIH initiative that seeks to identify and sequence a representative full open reading frame (FL-ORF) clone for each human, mouse, and rat gene. The MGC project entails the production of cDNA libraries and sequences, database and repository development, as well as the support of research for improved library construction, sequencing, and analytic technologies. All the resources generated by the MGC are publicly accessible to the biomedical research community. |
| HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. It provides annotated bibliograhies of published reports of protein interactions, with links to the corresponding PubMed records and sequence data. More information about this database is provided under "Literature Databases". |
| AceView (Acembly) - AceView offers an integrated view of the human, nematode and Arabidopsis genes reconstructed by co-alignment of all publicly available mRNAs and ESTs on the genome sequence. The goals are to offer a reliable up-to-date resource on the genes and their functions and to stimulate further validating experiments at the bench. AceView carefully computes co-alignment and clustering of experimental cDNA sequences, no prediction is involved. The resulting AceView genes and their alternative variants are analyzed in terms of expression, intron-exon structure, alternative features, regulation and neighbor relationships; the protein products are analyzed for completeness, their best covering clones are identified, the proteins are searched for motifs, membership to a protein family, conservation in evolution, closest homologues in other species and signals for subcellular localization. The genes are presented in the context of biological annotations gathered from various sources. AceView can be queried by meaningful words or sentences as well as by most standard identifiers. |
| Expression |
|
| Gene Expression Omnibus (GEO) - a gene
expression and hybridization array data repository, as well as a curated, online
resource for gene expression data browsing, query and retrieval. GEO was the
first
fully public high-throughput gene expression data repository, and became
operational
in July 2000. Many types of gene expression data from platforms such as spotted
microarray (microarray), high-density oligonucleotide array (HDA), hybridization
filter (filter) and serial analysis of gene expression (SAGE) data, are
accepted,
accessioned, and archived as a public data set. GEO data can be accessed
through
several search and browsing tools on the GEO home page, Entrez (via Entrez GEO
Profiles and Entrez GDS (GEO
DataSets)),
and the FTP site. The Tools/Gene
Expression section of this file provides information about data visualization and exploration capabilities
available in GEO. |
| GENSAT - The Gene Expression Nervous System Atlas, or GENSAT, project aims to map the expression of genes in the central nervous system of the mouse, using both in situ hybridization and transgenic mouse techniques. The GENSAT database contains a series of images related to gene expression experiments. The images are indexed on a number of fields relevant to biological discovery. Search criteria include gene names, gene symbols, gene aliases and synonyms, mouse ages, and imaging protocols. The GENSAT project is a collaboration among the National Institute of Neurological Disorders and Stroke (NINDS), Rockefeller University, St. Jude Children's Research Hospital, and NCBI. |
| Expression-Related Tools - in addition to the GEO database, described above, NCBI offers several tools: |
|
|
|
| Taxonomy |
|
| NCBI Taxonomy Database Home - general information about the Taxonomy project, including taxonomic resources and a list of outside curators collaborating with NCBI taxonomists. The NCBI Taxonomy Database contains the names and lineages of >160,000 organisms, both living and extinct, that are represented in the genetic databases with at least one nucleotide or protein sequence. New organisms are added to the database as sequence data are deposited for them. The purpose of the taxonomy project at NCBI is to build a consistent phylogenetic taxonomy for the sequence databases. |
| Taxonomy Browser - The search bar on the Taxonomy home page allows you to browse the NCBI taxonomy database. Enter the scientific or common name of a species (e.g., Canis familiaris or dog) or a higher taxon (e.g., Canidae) to view that organism or taxon's lineage; retrieve the available nucleotide, protein, structure, and genome records; and browse up and down the taxonomic tree. (Tip: For the broadest search results, select the "token set" option in the search bar, which searches for any string, whether in the beginning, middle, or end of a word.) Entrez also provides an interface for browsing the taxonomy database, and offers features such as the Common Tree function, which allows you to build a tree for your own selection of organisms or taxa (more...). |
Taxonomy BLAST - an
implementation of Gapped BLAST (2.x) that groups hits by source organism,
according
to information in NCBI's Taxonomy database. Species are listed in order of
sequence
similarity to the query sequence; the strongest match listed first. Three report
views are available:
|
| TaxPlot - a tool for 3-way comparisons of genomes on the basis of the protein sequences they encode. To use TaxPlot, one selects a reference genome to which two other genomes are compared. Pre-computed BLAST results are then used to plot a point for each predicted protein in the reference genome, based on the best alignment with proteins in each of the two genomes being compared. |
| Literature Databases | Overview |
|
| PubMed - A database of citations and abstracts for biomedical literature. These citations are from MEDLINE and additional life science journals. PubMed also includes links to many sites providing full text articles and other related resources. PubMed is accessible through the Entrez search and retrieval system (described below) |
|
|
|
| PubMed Central - a digital archive of biomedical and life sciences journal literature, including clinical medicine and public health, managed by the National Center for Biotechnology Information (NCBI) at the U.S. National Library of Medicine (NLM). It is not a journal publisher. Access to PubMed Central (PMC) is free and unrestricted. |
| OMIM - Online Mendelian Inheritance in Man - continuously updated catalog of human genes and genetic disorders, with links to associated literature references, sequence records, maps, and related databases. |
| Entrez Books - In collaboration with book publishers, the NCBI is adapting textbooks for the web and linking them to PubMed, the biomedical bibliographic database. The idea is to provide background information to PubMed, so that users can explore unfamiliar concepts found in PubMed search results. |
| HIV Interactions - The HIV-1, Human Protein Interaction Database contains information about known interactions of HIV-1 proteins with proteins from human hosts. RefSeq protein sequence records serve as anchors for collecting published information about interactions between HIV-1 and human proteins. Each HIV Interactions database record lists an HIV protein and the human proteins with which it has been found to interact. In turn, the Entrez Gene record for each human protein contains annotated HIV-1 Interactions bibliographies, which consist of brief statements on protein interactions with links to the corresponding PubMed records and sequence data. The HIV Interactions database is a collaborative project among the developers of RefSeq (description) and Entrez Gene (description), and is similar in concept to GeneRIF (description). In contrast to GeneRIFs for single genes, however, the publications cited in the HIV Interactions Database contain statements about binding between two proteins rather than statements about the function of a single gene. |
| Genomes and Maps | Overview |
|
organism collections (including Entrez Genome, Entrez Genome Project, Map Viewer, Entrez Gene, UniGene, HomoloGene, and COGs), and organism-specific resources, such as: human, mouse, rat, zebrafish, Drosophila, nematode, plant genomes, yeast, malaria, microbial genomes, viruses, viroids, plasmids, eukaryotic organelles |
| Organism Collections |
|
| Genomic Biology - An introduction to the field of genomic biology, with links to the genome resources pages for major organisms and organism groups, as well as links to additional NCBI genome resources. |
Entrez Genome -
sequence and map data from the whole
genomes of over 1000 organisms. The genomes represent both completely sequenced
organisms and those for which sequencing is in progress. All three main domains
of
life - bacteria,
archaea,
and eukaryota
- are represented, as well as many viruses,
phages,
viroids,
plasmids,
and organelles.. Entrez Genome
provides
graphical overviews of complete genomes/chromosomes, and the ability to explore
regions of interest in progressively greater detail. ProtTables and TaxTables are provided for organisms on
which analyses have been done by NCBI staff. In addition, the Map Viewer, a software component of Entrez Genome, provides
views of integrated chromosome maps for a variety of organisms (see additional
information about the Map Viewer below).
Information about submitting genome data from complete genomes is provided in the Resource Guide section on Submission of complete genomes. After data from complete genomes are submitted, they are made available in Entrez Genome (as complete genomes or chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). Entrez Nucleotide also provides access to the records for complete genomes/chromosomes, but the default view of those records is the Nucleotide database is GenBank format, whereas the default view in Entrez Genome is a graphical overview. A companion database, Entrez Genome Project, is described below. |
| Entrez Genome Project - a companion database to Entrez Genome (described above). The actual data from genome sequencing projects are contained in Entrez Genome (as complete genomes chromosomes) and Entrez Nucleotide (as chromosome or genome fragments such as contigs). The Genome Project database, on the other hand, provides an umbrella view of the status of each genome project, links to project data in the other Entrez databases, and links to a variety of other NCBI and external resources associated with a given genome project. A genome project's status can be complete or in-progress, and the project can include large-scale sequencing, assembly, annotation, and mapping efforts. New genome sequencing projects can be registe |