| Literature DB >> 20369061 |
Chuming Chen1, Peter B McGarvey, Hongzhan Huang, Cathy H Wu.
Abstract
High-throughput "omics" technologies bring new opportunities for biological and biomedical researchers to ask complex questions and gain new scientific insights. However, the voluminous, complex, and context-dependent data being maintained in heterogeneous and distributed environments plus the lack of well-defined data standard and standardized nomenclature imposes a major challenge which requires advanced computational methods and bioinformatics infrastructures for integration, mining, visualization, and comparative analysis to facilitate data-driven hypothesis generation and biological knowledge discovery. In this paper, we present the challenges in high-throughput "omics" data integration and analysis, introduce a protein-centric approach for systems integration of large and heterogeneous high-throughput "omics" data including microarray, mass spectrometry, protein sequence, protein structure, and protein interaction data, and use scientific case study to illustrate how one can use varied "omics" data from different laboratories to make useful connections that could lead to new biological knowledge.Entities:
Year: 2010 PMID: 20369061 PMCID: PMC2847380 DOI: 10.1155/2010/423589
Source DB: PubMed Journal: Adv Bioinformatics ISSN: 1687-8027
Commonly used molecular biology databases for functional analysis of gene and protein expression data.
| Database name | Database content | Data access and analysis support | URL |
|---|---|---|---|
|
| |||
|
| |||
| UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, UniProt Archive (UniParc) [ | UniProt protein sequences and functional information, comprehensive and non-redundant database that contains most of the publicly available protein sequences in the world | Text search; Blast sequence similarity search; Sequence alignment; Batch retrieval; Database ID mapping; FTP download |
|
|
| |||
| NCBI Reference Sequence (RefSeq) [ | Non-redundant collection of richly annotated DNA, RNA, and protein sequences | Entrez query access; Searching Nucleotide or Protein; Searching Genome; BLAST; FTP download; Sequence Homology searches and retrieval |
|
|
| |||
|
| |||
|
| |||
| GenBank [ | Genetic sequence database, an annotated collection of all publicly available DNA sequences databases | Database query; Phylogenetics; Genome Analyses; FTP download |
|
| EMBL [ |
| ||
| DDBJ [ |
| ||
|
| |||
| UniGene [ | Non-redundant set of eukaryotic gene-oriented clusters of transcript sequences, together with information on protein similarities, gene expression, cDNA clone reagents, and genomic location | Entrez query; Library browse; Digital Differential Display; FTP download |
|
|
| |||
| FlyBase [ |
| Aberration Maps; Batch download; BLAST; Chromosome Maps; Coordinate Converter; CytoSearch; GBrowse; ID Converter; ImageBrowse; Interactions Browser; QueryBuilder; TermLink; FTP download |
|
|
| |||
| Mouse Genome Database (MGD) [ | Gene characterization, nomenclature, mapping, gene homologies among mammals, sequence links, phenotypes, allelic variants and mutants, and strain data | Genes & Markers Query; Sequence Query; MouseBLAST; Graphical Map Tools; Mouse Genome Browser; Batch Query; MGI Web Service |
|
|
| |||
| Saccharomyces Genome Database (SGD) [ | Genetic and molecular biological information about | Search Gene function information and Protein information; Specialized Gene and Sequence Searches; Search Yeast Literature; BLAST; Batch download; Pattern Matching; Genome Restriction Analysis; PDB Homology Query; Yeast Protein Motif Query; Yeast Biochemical Pathways; Gene Expression Connection |
|
|
| |||
| WormBase [ | Data repository for | Gene, Phenotype, protein, and Genetics Search; Microarray Expression download and Pattern search; Ontology Search |
|
|
| |||
| The Arabidopsis Information Resource (TAIR) [ | The genetic and molecular biology information resource about | Synteny Viewer; MapViewer; Pattern Matching; Motif Analysis; Bulk Data Retrieval; Chromosome Map Tool; Restriction Analysis |
|
|
| |||
|
| |||
|
| |||
| NCBI Taxonomy [ | Names of all organisms that are represented in the genetic databases with at least one nucleotide or protein sequence | Browse; Retrieve and FTP download |
|
|
| |||
| UniProt Taxonomy [ | UniProt taxonomy database, which integrates taxonomy data compiled in the NCBI database and data specific to the UniProt Knowledgebase | Query the database by keywords (species name) or NCBI taxonomic identifier |
|
|
| |||
|
| |||
|
| |||
| Gene Expression Omnibus (GEO) [ | Public repository for high-throughput microarray experimental data | Search by accession number; Search Entrez GEO DataSets or Entrez GEO Profiles with keywords; Visualize cluster heat map images; Retrieve other genes with similar expression patterns; Retrieve chromosomally closest 20 genes; FTP download |
|
|
| |||
| CleanEx [ | Expression reference database that facilitates joint analysis and cross-dataset comparisons | Search by ID, Gene symbol and target ID; List expression datasets; Text search in expression datasets description lines; Extract all features of common genes between datasets; Experiments pools comparison; Batch retrieval; FTP download |
|
|
| |||
| SOURCE [ | Functional genomics resource for human, mouse and rat to facilitate the analysis of large sets of data using genome-scale experimental approaches | Search by CloneID, Database Accession, Gene name/Symbol, UniGene ClusterID, Probe ID, and Entrez GeneID; Batch retrieval |
|
|
| |||
| ArrayExpress [ | Public repository for well-annotated data from array based platforms, including gene expression, comparative genomic hybridization (CGH) and chromatin-immunoprecipitation (ChIP) experiments, tiling arrays, and so forth | Web-based query interface; REST and Web-services access; FTP download; Web-based online microarray analysis tool—Expression Profiler |
|
|
| |||
|
| |||
|
| |||
| Global Proteome Machine Database (GPMDB) [ | Global Proteome Machine Database, which utilizes the information obtained by GPM servers to aid in peptide validation as well as protein coverage patterns | Search by protein description keywords, and data set keywords |
|
|
| |||
| PRoteomics IDEntifications Database (PRIDE) [ | PRIDE database provides public data repository for proteomics data | Search by PRIDE Experiment accession number and Protein accessions; Browse experiments by project name or categories such as species, tissue, cell type, GO terms and disease; Ontology Lookup Service (OLS); Protein Identifier Cross Reference (PICR) service; Database on Demand (DOD) |
|
|
| |||
| Peptidome [ | Public repository that archives and freely distributes tandem mass spectrometry peptide and protein identification data | Search by Accession, Author, Description, MeSH Terms, Organism, Peptide Count, Platform, Protein Count, Protein GI, Publication Date, Search Engine, Spectra Count, Submitter Institute, Title, Update Date |
|
|
| |||
| PeptideAtlas [ | Peptide database identified by Tandem Mass Proteomics experiments | Search by Protein/Gene Name, Protein/Gene ID, Protein/Gene Symbol, Accession, Refseq, Sequence and Peptide Accession; Browse Peptides; Browse Proteins; FTP download |
|
|
| |||
|
| |||
|
| |||
| Swiss-2DPAGE [ | Annotated 2D gel electrophoresis database contains data on proteins identified on various 2D PAGE and SDS-PAGE reference maps | Search by description, accession number, author, spot serial number, experimental pI/Mw range and experimental identification methods; Retrieve all the protein entries identified on a given reference map; Compute estimated location on reference maps for a user-entered sequence; FTP download |
|
|
| |||
|
| |||
|
| |||
| Kyoto Encyclopedia of Genes and Genomes (KEGG) [ | Integrated database resource consisting of 16 main databases, broadly categorized into systems information, genomic information, and chemical information | Access by KEGG object identifier; KEGG Web Services and KEGG FTP download; Pathway Mapping; Brite Mapping; KegHier for browsing and searching functional hierarchies in KEGG BRITE; KegArray for analysis of transcriptome data (gene expression profiles) and metabolome data (compound profiles) |
|
|
| |||
| BioCyc [ | Microbial pathway/genome databases | Visualize individual metabolic pathways; View the complete metabolic map of an organism; Genome browsing capabilities and comparative analysis tools |
|
|
| |||
|
| |||
|
| |||
| Online Mendelian Inheritance in Man (OMIM) [ | A catalog of human genetic and genomic phenotypes | Entrez search at basic, advanced, or complex Boolean levels; Browse entries; Build query; Combine search results; Store search results in Clipboard; FTP download |
|
|
| |||
| HapMap [ | Resource for human genetic variation | Browse data; Bulk data download; HapMart—a data mining tool for retrieving data from the HapMap database |
|
|
| |||
|
| |||
|
| |||
| Gene Ontology (GO) [ | Gene Ontology database provides controlled vocabulary of terms describing Biological process, Cellular component, and Molecular function of gene and gene product annotation data | Tools include Browsers, Microarray tools, Annotation tools, Mapping to other databases, FTP download in Flat file, MySQL or RDF XML format |
|
|
| |||
|
| |||
|
| |||
| IntAct [ | Protein-protein interaction data | Browse by UniProt Taxonomy, Gene Ontology, Interpro Domain, Reactome Pathway, Chromosomal Location, and mRNA expression, FTP download in PSI-MI and PSI-MI TAB format |
|
|
| |||
| Database of Interacting Proteins (DIP) [ | Database of experimentally determined interactions between proteins with curator or computational methods generated annotations | Search by protein entry, BLAST, Motif, Article and pathBLAST; Data analysis services include Expression Profile Reliability Index, Paralogous Verification, and Domain Pair Verification |
|
|
| |||
|
| |||
|
| |||
| RESID [ | Collection of annotations and structures for Protein Pre-, Co- and Post-translational modifications | Web-based search interface; FTP download database entries in XML format, and associated files containing XML DTD, graphic images, and molecular models |
|
|
| |||
| Phosphosite [ | Database of phosphorylation sites and other Post-translational modifications | Search by Protein, Sequence, or Reference; Browse MS data by Disease, Cell Line, and Tissue |
|
|
| |||
|
| |||
|
| |||
| Protein Data Bank (PDB) [ | Database of experimentally-determined structures of proteins, nucleic acids, and complex assemblies | Web-based search and browsing interface; File download via http and FTP services in PDB, mmCIF, and PDBML/XML format |
|
|
| |||
| Structural Classification of Proteins (SCOP) [ | Comprehensive ordering of all proteins of known structure according to their evolutionary and structural relationships | Keywords-based search |
|
|
| |||
| CATH [ | Protein domain structures database | Search by ID/Keywords and FASTA sequence; BLAST; Cathedral server, and SSAP server for query and analysis CATH data; FTP download |
|
|
| |||
| Molecular Modeling Database (MMDB) [ | Database of 3D structures | Search by UID/text term, protein sequence and 3D coordinates; FTP download |
|
|
| |||
| PDBsum [ | Summaries and analyses of PDB structures | Search by text or sequence; Browse by Highlights, List of PDB codes, Het Groups, Ligands, Enzymes, ProSite and Species; Download data file for protein names, protein sequences, protein annotations, Enzymes, Het Groups, and Ligands |
|
|
| |||
| Protein Structure Model Database (Modbase) [ | Annotated comparative protein structure models and related resources | Search by model or sequence similarity and properties |
|
|
| |||
|
| |||
|
| |||
| PIRSF [ | Family/superfamily classification of whole proteins | Batch retrieval using UniProtKB AC, PIRSF ID, Pfam ID, COG ID, EC Number, GO ID, KEGG Pathway ID, PDB ID; PIRSF scan by sequence or UniProtKB identifier; FTP download |
|
|
| |||
| UniProt Reference Clusters (UniRef) [ | UniProt non-redundant reference clusters | Searches on various attributes of the UniRef clusters, including UniRef cluster ID, protein names, organism names and database identifiers; Direct web access in HTML, XML and FASTA format; FTP download in XML format |
|
|
| |||
| Pfam [ | Protein families of domains each represented by multiple sequence alignments and hidden Markov models (HMMs) | Search by Sequence, Functional similarity, Keyword, Domain, DNA, and Taxonomy; Browse by Families, Clans, Proteomics; FTP download |
|
|
| |||
| InterPro [ | Integrated resource of protein families, domains, and functional sites | Text search; SRS text search; InterPro Scan; InterPro BoMart; Web services; FTP download |
|
|
| |||
| Protein ANalysis THrough Evolutionary Relationships (PANTHER) Classification System [ | Gene products organized by biological function | Search; Browse; Batch search; Gene expression data analysis; Evolutionary analysis of coding SNPs; HMM sequence scoring; FTP download |
|
|
| |||
| Simple Modular Architecture Research Tool (SMART) [ | Resource for protein domain identification and the analysis of protein domain architectures | Sequence analysis; Architecture analysis; Domain detection |
|
Figure 1PIR ID mapping service maps a set of NCBI GI numbers to UniProt accession numbers.
Database identifiers supported by PIR ID mapping service.
| From | To |
|---|---|
|
|
|
| FLY ID, GenBank AC, Genpept AC, GI Number, IPI ID, MGI | FLY ID, GenBank AC, Genpept AC, GI Number, IPI ID, MGI |
| ID, NREF ID, PIR-PSD ID, PIR-PSD AC, Refseq AC, SGD ID, | ID, NREF ID, PIR-PSD ID, PIR-PSD AC, Refseq AC, SGD ID, |
| TIGR ID, UniParc AC, UniProtKB AC, UniProtKB ID | TIGR ID, UniParc AC, UniProtKB AC, UniProtKB ID, |
| UniRef50, UniRef90, UniRef100 | |
|
| |
|
|
|
| BLOCKS ID, COG ID, Pfam ID, PIRSF ID, PRINTS ID, PROSITE | BLOCKS ID, COG ID, Pfam ID, PIRSF ID, PRINTS ID, |
| ID, UniRef50, UniRef90, UniRef100 | PROSITE ID |
|
| |
|
|
|
| BIND ID, EC Number, GO ID, KEGG Pathway ID, RESID ID | BIND ID, EC Number, GO ID, KEGG Pathway ID, RESID ID |
|
| |
|
|
|
| Taxon Group ID, Taxon ID | Taxon Group ID, Taxon ID |
|
| |
|
|
|
| Entrez Gene ID, OMIM ID, PDB ID, PubMed ID, Gene Name | Entrez Gene ID, OMIM ID, PDB ID, PubMed ID, Gene Name |
Figure 2The overview of PIR iProClass data warehouse.
Figure 3iProClass data warehouse batch retrieval tool web form and result page. (1) Retrieval Box: it shows the user's query ID and also allows the user to perform a new retrieval; (2) Display Options: it allows the user to choose the columns to be displayed; (3) Save Results As: the output can be saved to the user's local computer. The results will be saved for selected entries or, if no proteins are selected, for all entries; (4) Analyze: BLAST, FASTA, Pattern Match, Multiple Alignment, and Domain Display: retrieved entries can be further analyzed using the sequence analysis programs available on the results page; (5) Results Display: search results are displayed in a table; (6) GO Slim: it shows smaller versions of the Gene Ontologies (GO) containing a subset of the terms in the whole GO. They give a broad overview of the ontology content without the detail of the specific fine grained terms; (7) Show match list: it shows a table mapping the user's query IDs with the UniProtKB/UniParc IDs.
Figure 4iProClass data warehouse peptide match tool web form and result page. (1) Query Peptide: it displays ON/OFF the query peptide by clicking this box; (2) Save Results As: the output can be saved to the user's local computer. The results will be saved for selected entries or, if no proteins are selected, for all entries; (3) Analyze: BLAST, FASTA, Pattern Match, Multiple Alignment, and Domain Display: retrieved entries can be further analyzed using the sequence analysis programs available on the results page; (4) Results Display: search results are displayed in a table.
NIAID biodefense proteomics resource catalog summary.
| Organism | PRC | Data Type | SOPs | No. of protein in Master Protein Directory (MPD) | No. of reagents in Master Reagent Directory (MRD) | No. of proteins in Complete Predicated Proteome (CPP) |
|---|---|---|---|---|---|---|
|
| Caprion Proteomics Inc. | Mass spectrometry | 23 | 4963 | — | 6070 |
|
| Einstein Biodefense Proteomic Research Center | Mass spectrometry | 4 | 609 | Antibodies (68) | — |
|
| Myriad Genetics | Protein interaction | 4 | 62 | Clone (4379) | 4629 |
|
| Pacific Northwest National Laboratory | Mass spectrometry | 2 | 2958 | Antibodies (1) | 187 |
|
| Scripps Research Institute | Protein structure | 5 | 6 | Clone (7) | — |
|
| Einstein Biodefense Proteomic Research Center | Mass spectrometry | 5 | 6678 | Antibodies (101) | — |
|
| Myriad Genetics | Protein interaction | 2 | 33 | Clone (315) | 254 |
| Pacific Northwest National Laboratory | Mass spectrometry | 2 | 2973 | — | — | |
|
| Harvard Institute of Proteomics | Clone | 4 | 3731 | Bacteria (627) Clone (7172) | 11208 |
|
| Myriad Genetics | Protein interaction | 5 | 75 | Clone (9900) | 5966 |
|
| Harvard Institute of Proteomics | Clone | 3 | 5342 | Clone (5344) | |
| University of Michigan | Mass spectrometry | — | 5851 | Bacteria (22) ArrayChip (1) | ||
| 16686 | ||||||
| University of Michigan | Microarray | 2 | 6378 | — | ||
| Myriad Genetics | Protein interaction | 5 | 84 | Clone (7884) | ||
|
| Pacific Northwest National Laboratory | Mass spectrometry | — | 2061 | Bacteria (38) | — |
|
| Pacific Northwest National Laboratory | Protein interaction | — | 3 | — | |
| Pacific Northwest National Laboratory | Mass spectrometry | 12 | 3753 | — | 4532 | |
| Pacific Northwest National Laboratory | Microarray | — | 653 | — |
Figure 5Protein-centric query across multiple data types in the NIAID Biodefense Master Protein Directory. (a) Search for Bacillus anthracis proteins with data from interaction, microarray and mass spectrometry data yields 47 bacterial proteins; (b) ID mapping merges database identifiers from up to six different databases; (c) Different data types are displayed; (d) Each data set is assigned an experiment identifier and hyperlinks provide additional information about the experiments.
Figure 6A single experiment, Myriad_Bac_07, contains interactions between 497 Bacillus anthracis and human proteins determined by Yeast two Hybrid analysis. Using the customizable interface we can download the UniRef_90 [99] identifiers for human proteins and use them to retrieve the homologous mouse proteins with microarray and mass spectrometry data from mouse macrophages infected with Bacillus anthracis. See text for details. (a) Customizable interface can be used to display UniRef_90 identifiers of proteins; (b) A list of human proteins used to retrieve the homologous mouse proteins with microarray and mass spectrometry data from mouse macrophages infected with Bacillus anthracis.
Figure 7Pathogen-Host Y2H protein interaction data, from Figures 5 and 6, was loaded into Cytoscape and combined with microarray and mass spectrometry data from other experiments where Bacillus anthracis was used to infect mouse macrophages. Only data that showed a significant increase or decrease in expression in the original experiments was loaded. For human interacting proteins, data from homologous mouse proteins was used. (a) The Bacillus anthracis—human protein interaction network. Triangles are Bacillus anthracis proteins, and squares are human proteins. Red indicates that an increase in expression was observed in either microarray or proteomic experiment, and green indicates a decrease in expression; (b) A list of sixteen proteins involved in eight pathogen-host interactions where a human protein showed a significant decrease in expression upon infection. Of the eight interactions three were with pathogen proteins that showed an increase in expression upon infection.