| Literature DB >> 22937107 |
Mingjun Wang1, Xing-Ming Zhao, Kazuhiro Takemoto, Haisong Xu, Yuan Li, Tatsuya Akutsu, Jiangning Song.
Abstract
Single amino acid variants (SAVs) are the most abundant form of known genetic variations associated with human disease. Successful prediction of the functional impact of SAVs from sequences can thus lead to an improved understanding of the underlying mechanisms of why a SAV may be associated with certain disease. In this work, we constructed a high-quality structural dataset that contained 679 high-quality protein structures with 2,048 SAVs by collecting the human genetic variant data from multiple resources and dividing them into two categories, i.e., disease-associated and neutral variants. We built a two-stage random forest (RF) model, termed as FunSAV, to predict the functional effect of SAVs by combining sequence, structure and residue-contact network features with other additional features that were not explored in previous studies. Importantly, a two-step feature selection procedure was proposed to select the most important and informative features that contribute to the prediction of disease association of SAVs. In cross-validation experiments on the benchmark dataset, FunSAV achieved a good prediction performance with the area under the curve (AUC) of 0.882, which is competitive with and in some cases better than other existing tools including SIFT, SNAP, Polyphen2, PANTHER, nsSNPAnalyzer and PhD-SNP. The sourcecodes of FunSAV and the datasets can be downloaded at http://sunflower.kuicr.kyoto-u.ac.jp/sjn/FunSAV.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22937107 PMCID: PMC3427247 DOI: 10.1371/journal.pone.0043847
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Overview of the FunSAV method for predicting the functional effect of SAVs.
Features used by FunSAV are derived from the amino acid sequence of the protein, 3D structure of the protein, as well as network properties which are calculated based on the representation of the protein structure as a residue-residue contact network. A full list of the extracted features is given in Table 1. After feature selection, distinguishable features between disease-associated and neutral SAVs are statistically analyzed and used as the input to construct RF models. Prediction performance is evaluated by both 5-fold cross-validation and independent tests.
Features used in this study, which are categorized into nine major types: sequence or sequence-derived, structure, residue-contact network features, computed scores, annotations from database, solvent exposure features, coevolutionary features, solvent accessibilities and conservation score.
| Feature type | Annotation |
| Sequence or sequence derived features | Mutation residue and residue neighbor in the range of window size |
| Wild type residue and mutation type residue | |
| PSSM (PSI-BLAST | |
| Mass weight change upon mutation | |
| Aggregation properties (TANGO | |
| SCRATCH (SSpro) score | |
| PSIPRED score | |
| DISOPRED score | |
| PSIC score | |
| Structure features | B-factor |
| α-helix or β-bend or coil (DSSP | |
| ACC (number of water molecules in contact with this residue *10) (DSSP | |
| Disulfide bond and residue distance in the 3D structure | |
| KAPPA: virtual bond angle (bend angle) defined by the three Cα atoms of residues I−2,I,I+2 (DSSP | |
| ALPHA: virtual torsion angle (dihedral angle) defined by the four Cα atoms of residues I−1, I, I+1,I+2.(DSSP | |
| TCO: cosine of angle between C = O of residue I and C = O of residue I−1. (DSSP | |
| X-CA Y-CA Z-CA: echo of Cα atom coordinates (DSSP | |
| Number of H-bonds (HBPLUS | |
| Metal-binding residue and the 3D distance | |
| Hydrogen bond (DSSP) | |
| Dihedral angle, Cα atom coordinates (DSSP | |
| Distance between SAVs to the origin of the coordinates | |
| Network features | degree, closeness, status, hubscore, clustering coefficient, cyclic coefficient, constraint, betweeness, eigenvector, cocitation, coreness, eccentrality. |
| Computed scores | SIFT score |
| PolyPhen2 score | |
| SNAP score | |
| PANTHER | |
| nsSNPAnalyzer | |
| PhD-SNP | |
| Annotations from database | Functional region annotation from UniProt |
| Sequence distance between SAV and functional region | |
| 3D distance between SAV and functional region | |
| Pfam family annotation from Pfam | |
| Solvent exposure features | Solvent exposure feature calculated by biopython |
| Coevolutionary features | MI, MIp, MIr and Kai value |
| Solvent accessibilities | Solvent accessibility calculated by NACCESS |
| Conservation score | Conservation score |
Abbreviations of the 15 final selected features in this study.
| Feature name | Residue Position | Abbreviation |
| The non-polar side chain solvent accessibility calculated by NACCESS | V8 | NAC_npa_V8 |
| Conservation score | V8 | Con_V8 |
| SSpro | V8 | SSpro_V8 |
| Mass weight change | – | MW_ch |
| PSSM | V160 | PSSM_V160 |
| B-factor | V7 | B_factor_V7 |
| Coevolutionanry feature MI | V8 | Co_MI_V8 |
| Exposure feature HSEBD | V8 | HSEBD_V8 |
| Exposure feature RD | V8 | RD_V8 |
| Exposure feature HSEBU | V9 | HSEBU_V9 |
| Exposure fature CN | V9 | CN_V9 |
| Network feature Status | V1 | Status_V1 |
| Network feature Closeness | V7 | Closeness _V7 |
| Network feature Status | V9 | Status_V9 |
| Network feature Status | V7 | Status_V7 |
Figure 2The relative importance and ranking of the optimal feature group, as evaluated by the mean MDGI Z-Score.
The bar represents the mean MDGI Z-Score of the corresponding feature group. NACCESS: solvent accessibilities calculated by NACCESS [50]; exposure: solvent exposure features calculated by the biopython package [51]; network: residue-contact network features calculated by the JUNG library available at http://jung.sourceforge.net/; PSSM: PSSM features calculated by PSI-BLAST [28]; co-evolution: coevolutionay features including MIr, MIp, MI and Kai value; DSSP_ACC: the number of water molecules in contact with the residue of interest extracted from DSSP [39]; conserve_score: conservation score defined in the Feature extraction Section; SSpro: solvent accessibility calculated by the SSpro program [30]; MW_change: Mass weight change upon mutation; B_factor: the temperature factor extracted from the PDB file; DISOPRED: predicted native disorder by DISOPRED [31].
Figure 3Comparison of the mean values and standard deviations of the 15 optimal features of disease-associated and neutral SAVs.
“*” represents a P-value in the range of 0.01∼0.05, “**” represents a P-value in the range of 2.2e-16∼0.01, while “***” represents a P-value<2.2e-16, respectively. See Table 2 for more details about feature abbreviations.
Figure 4Effect of the removal or inclusion of the 15 individual optimal features on the prediction performance of the first-stage FunSAV classifier.
Performance was evaluated using MCC. A: Performance of the trained classifier using the individual feature; B: MCC decrease of the trained classifier by removal of the corresponding feature. See Table 2 for more details about feature abbreviations.
Prediction performance of the first-stage and two-stage FunSAV classifiers in comparison with six other prediction tools.
| Classifier | Performance | |||||
| MCC | ACC | SEN | SPE | PRE | AUC | |
| SNAP | 0.426 | 0.680 | 0.932 | 0.441 | 0.612 | 0.740 |
| SIFT | 0.475 | 0.734 | 0.806 | 0.665 | 0.695 | 0.807 |
| PolyPhen2 | 0.512 | 0.745 | 0.879 | 0.618 | 0.685 | 0.838 |
| nsSNPAnalyzer | 0.334 | 0.665 | 0.546 | 0.778 | 0.699 | 0.662 |
| PANTHER | 0.500 | 0.749 | 0.776 | 0.724 | 0.727 | 0.816 |
| PhD-SNP | 0.350 | 0.676 | 0.653 | 0.697 | 0.671 | 0.675 |
| First-stage classifier | 0.535 | 0.767 | 0.772 | 0.763 | 0.755 | 0.824 |
| PolyPhen2+SIFT+SNAP+nsSNPAnalyzer+PANTHER+PhD-SNP | 0.513 | 0.757 | 0.802 | 0.708 | 0.748 | 0.831 |
| Two-stage classifier | 0.598 | 0.799 | 0.797 | 0.801 | 0.792 | 0.882 |
Figure 5The ROC curves of nine classifiers based on 5-fold cross-validation tests.
Results are evaluated based on the benchmark dataset (A) and independent test dataset (B).
Figure 6Prediction examples of the functional effect of SAVs in two proteins by FunSAV.
(A) and (B) the all-atom; (C) and (D) surface; (E) and (F) network representations of proteins hATR (PDB ID: 2IDX, chain A) and PAF-AH (PDB ID: 3D59, chain A), respectively. Red color denotes disease-associated variants while green color represents neutral variants. 3D structures were rendered using PyMol [71] and network graphs were drawn using Cytoscape [72].
Figure 7Prediction example of the false negative of the functional effect of SAVs by FunSAV for the Noggin protein.
(A) The all-atom; (B) surface; (C) network representations of the Noggin protein. Red color denotes the disease-associated variant. 3D structures were rendered using PyMol [71] and network graphs were drawn using Cytoscape [72].