| Literature DB >> 20689580 |
Tao Huang1, Ping Wang, Zhi-Qiang Ye, Heng Xu, Zhisong He, Kai-Yan Feng, Lele Hu, Weiren Cui, Kai Wang, Xiao Dong, Lu Xie, Xiangyin Kong, Yu-Dong Cai, Yixue Li.
Abstract
Non-synonymous SNPs (nsSNPs), also known as Single Amino acid Polymorphisms (SAPs) account for the majority of human inherited diseases. It is important to distinguish the deleterious SAPs from neutral ones. Most traditional computational methods to classify SAPs are based on sequential or structural features. However, these features cannot fully explain the association between a SAP and the observed pathophysiological phenotype. We believe the better rationale for deleterious SAP prediction should be: If a SAP lies in the protein with important functions and it can change the protein sequence and structure severely, it is more likely related to disease. So we established a method to predict deleterious SAPs based on both protein interaction network and traditional hybrid properties. Each SAP is represented by 472 features that include sequential features, structural features and network features. Maximum Relevance Minimum Redundancy (mRMR) method and Incremental Feature Selection (IFS) were applied to obtain the optimal feature set and the prediction model was Nearest Neighbor Algorithm (NNA). In jackknife cross-validation, 83.27% of SAPs were correctly predicted when the optimized 263 features were used. The optimized predictor with 263 features was also tested in an independent dataset and the accuracy was still 80.00%. In contrast, SIFT, a widely used predictor of deleterious SAPs based on sequential features, has a prediction accuracy of 71.05% on the same dataset. In our study, network features were found to be most important for accurate prediction and can significantly improve the prediction performance. Our results suggest that the protein interaction context could provide important clues to help better illustrate SAP's functional association. This research will facilitate the post genome-wide association studies.Entities:
Mesh:
Substances:
Year: 2010 PMID: 20689580 PMCID: PMC2912763 DOI: 10.1371/journal.pone.0011900
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1The curve of IFS.
(A) The IFS curve with a step width of 5. The highest accuracy was achieved with 261 features, which suggest the optimal feature set should have more than 256 and less than 266 features; (B) The IFS curve between index 256 and 265. Refine the accuracy around S, by calculating accuracies using feature sets S, S. The highest accuracy of IFS was 83.27% using 263 features. These 263 features formed the optimal feature set.
Figure 2The number of each type of features in the optimal feature set.
The feature with the biggest contribution is KEGG enrichment scores, one kind of the network features. Another kind of the network features, betweenness, was also important. This suggests that if a protein does not interact with biologically important proteins, then its mutation may not cause severe damage.
Figure 3The number of each type of AAFactor features in the optimal feature set.
Factor 3 is the most important one and it relates to molecular size or volume with high factor coefficients for bulkiness, residue volume, average volume of a buried residue, side chain volume, and molecular weight.