| Literature DB >> 17112386 |
Kai Xia1, Dong Dong, Jing-Dong J Han.
Abstract
BACKGROUND: Although protein-protein interaction (PPI) networks have been explored by various experimental methods, the maps so built are still limited in coverage and accuracy. To further expand the PPI network and to extract more accurate information from existing maps, studies have been carried out to integrate various types of functional relationship data. A frequently updated database of computationally analyzed potential PPIs to provide biological researchers with rapid and easy access to analyze original data as a biological network is still lacking.Entities:
Mesh:
Substances:
Year: 2006 PMID: 17112386 PMCID: PMC1661597 DOI: 10.1186/1471-2105-7-508
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Figure 1A brief description of the integration by the probabilistic model. Seven heterogeneous dataset types are gathered and evaluated by the gold standard positive (GSP, all the annotated protein-protein interactions from HPRD) and gold standard negative (GSN, possible protein pairs between the proteins on the plasma membrane and those in the nucleus). The potential of forming a protein-protein interaction is scored as the likelihood ratio (LR) for protein pairs to be true positive interactions versus true negative interactions, according to the GSP and GSN. Each interaction is assigned a LR within a data type. When evidence arises from more than one datasets within a data type, the maximal LR among the datasets is used for a gene pair. Then the LRs given by different data types are integrated by the Naïve Bayes model, which generates the final prediction score for a potential PPI by multiplying all the LRs from the seven distinct data types. Lastly, the final integrated network with an acceptable confidence level for each interaction is presented.
Figure 2Assessing the performance of each dataset in predicting the human protein-protein interactions. A. Large-scale protein-protein interaction (PPI) datasets from model organisms and human. The datasets SC5 and DM1 are binned by their confidence level given by the original studies: LC, low confidence; HC, high confidence. B-D. Phenotypic datasets from model organisms. The fly genome-wide RNAi dataset is evaluated by the arithmetic difference in phenotypic values between a pair of genes (D). The phenotypic similarity of yeast genes upon knock-out is evaluated by cosine distance for the discrete values (C) and PCC for the continuous values (D). E. Yeast genetic interaction datasets. Yeast genetic interactions are grouped by the number of shared neighbors between a pair of genetically interacting genes. F. Large-scale human gene expression datasets. Gene pairs are binned by their Pearson Correlation Coefficient (PCC) between the expression profiles of the pair. The purple, yellow and blue curves are derived from three different expression datasets [57-59]. G. Domain-domain interaction (DDI) score. The DDI score of a domain pair is assigned to a pair of proteins containing the domains. If different scores exist between a pair of proteins arising from different interacting domain pairs, the maximum of the scores is assigned to the pair. The protein pairs are grouped according to their DDI scores. H. Smallest number of shared biological processes (SSBP) of yeast (SC), worm (CE), fruitfly (DM), mouse (MM) and human GO annotations. Gene pairs are binned by the smallest number of shared GO annotations between a pair of genes. Then the LR of being GSP versus GSN is calculated and plotted for gene pairs within each bin for each organism. I. Gene context analysis to predict PPIs. Three types of in silico prediction results are evaluated (gene fusion, gene co-occurrence and gene neighborhood).
Figure 3TP/FP ratios (true positive versus false positive) at different LLR cutoffs or sensitivity by 10-fold cross-validation. The TP/FP ratios and the sensitivity (TP/(TP+FN)) are calculated for different LLR (log2LR) cutoffs. Each dot on the curves represents an average of ten cross-validations at a particular LLR cutoff. A. TP/FP ratios versus LLR cutoffs. A resolution of 44% false positives (TP/FP>1) corresponds to a LLR cutoff of 7.0 and 180,010 predicted interactions. Predicted interactions of higher confidence (larger TP/FP ratios) can be obtained by selecting LLR cutoffs higher than 7.0. B. TP/FP versus sensitivity. The TP/FP ratios indicate the accuracy at a certain resolution, while the sensitivity defines the ability of a test to detect true positives. Hence a tradeoff exists between accuracy and sensitivity of a prediction.
Improvements over the previous integration analysis for human PPI predictions
| Rhodes | This study | ||
| Gold standard positive (HPRD version) | Aug, 2004 | Sep, 2005 (About 10000 more interactions) | |
| Number of data types integrated | 4 (PPI, GO, Microarray, DDI) | 7 (PPI, GO, Microarray, DDI, Phenotypic, Genetic, Gene context) | |
| Number of datasets integrated | 13 | 27 | |
| Info from model organism | Only the PPI is applied | All the possible large scale data from model organism | |
| Datasets for each data type | Gene expression | 5 | 3 |
| PPI dataset | 4 yeast, 1 worm and 1 fly | 5 yeast, 1 worm, 2 fly and 2 Human | |
| GO dataset | Human | yeast, worm, fly, mouse and human | |
| Phenotypic | None | 2 yeast and 1 fly datasets | |
| Genetic interaction | None | 2 yeast datasets | |
| DDI interactions | Interpro annotation | Domain-domain interaction score dataset | |
| Gene context | None | 3 yeast datasets | |
| Method to validate the results | One simple test using the old HPRD as training set and the updated HPRD as test set | 10-fold cross-validation | |
| Size of integrated network under TP/FP ratio of 1 | 38,379 interactions among 5,791 proteins | 180,010 interactions among 9,901 proteins | |
| Data query | One time spread sheet download | Online query options with selectable data types and confidence level; data can be downloaded through spreadsheet or network graphs. | |
| Network visualization and analysis | Only support one single gene's search and none for download or analysis | Network cluster extraction, visualization, drill-down and download options |
Figure 4Overview of IntNetDB. A. The IntNetDB web interface. The default example is the network among human p53 (encoded by the TP53 gene) and its potential interaction partners. B. IntNetDB search results. For these query genes, the network graph view shows all predicted PPIs and the data types supporting each PPI, where '-' and '+' signs stand for 'absent' and 'present', respectively. A present call corresponds to LLR>0 for the data type shown. C. Visualization of the predicted PPI network. A hyperlink on the node and edge, when clicked, leads to more detailed information of the node or interaction (insets). D. Graphical representation of highly connected subgraphs. The subgraphs are extracted from the entire network by the MCODE algorithm. The two enlarged subgraphs correspond to the troponin-related complex and the proteasome complex. The color of an edge denotes the evidence type supporting the predicted PPI. 'Multiple evidences' are said to support a predicted interaction when more than one data type has LLR>1 (LR>2) for that edge. The color of a node is assigned according to the GO term of the gene.