| Literature DB >> 35036922 |
Jiaying Lai1,2, Jordan Yang3, Ece D Gamsiz Uzun2,4,5, Brenda M Rubenstein2,3, Indra Neil Sarkar1,6.
Abstract
SUMMARY: Single amino acid variations (SAVs) are a primary contributor to variations in the human genome. Identifying pathogenic SAVs can provide insights to the genetic architecture of complex diseases. Most approaches for predicting the functional effects or pathogenicity of SAVs rely on either sequence or structural information. This study presents 〈Lai Yang Rubenstein Uzun Sarkar〉 (LYRUS), a machine learning method that uses an XGBoost classifier to predict the pathogenicity of SAVs. LYRUS incorporates five sequence-based, six structure-based and four dynamics-based features. Uniquely, LYRUS includes a newly proposed sequence co-evolution feature called the variation number. LYRUS was trained using a dataset that contains 4363 protein structures corresponding to 22 639 SAVs from the ClinVar database, and tested using the VariBench testing dataset. Performance analysis showed that LYRUS achieved comparable performance to current variant effect predictors. LYRUS's performance was also benchmarked against six Deep Mutational Scanning datasets for PTEN and TP53.Entities:
Year: 2021 PMID: 35036922 PMCID: PMC8754197 DOI: 10.1093/bioadv/vbab045
Source DB: PubMed Journal: Bioinform Adv ISSN: 2635-0041
Features used for SAV pathogenicity prediction
| Feature name | Description | Type |
|---|---|---|
| Variation number | Sequence position conservation score calculated using orthologs | SEQ |
| Variation numbers employed in the model are scaled using min. to max. normalization for each amino acid sequence | ||
| Δ | Change in evolutionary statistical energy computed by EVmutation ( | SEQ |
| Functional impact score (FIS) | Predicted magnitude of the effects of amino acid substitutions | SEQ |
| weighted by the relative frequency of disease-causing and neutral | ||
| amino acid substitutions computed by FATHMM ( | ||
| ΔPSIC | Difference of PSIC scores for two amino acid residue variants | SEQ |
| computed by PolyPhen-2 ( | ||
| Wild-type PSIC | PSIC score for wild-type amino acid residue computed by PolyPhen-2 ( | SEQ |
|
| Folding free energy difference computed by FoldX ( | STR |
| SASA | Solvent accessible surface area computed by FreeSASA ( | STR |
| Mutant SSF | Knowledge-based potential for mutant amino acid variants | STR |
| computed by MAESTRO ( | ||
| Active site value | Calibrated probability of being a ligand-binding residue | STR |
| Assigned 1 if the probability is >0.5 | ||
| computed by P2Rank ( | ||
| Mutant reference energy | Unfolded-state reference energies for mutant amino acid variants | STR |
| computed by PyRosetta ( | ||
| ΔReference energy | Difference between unfolded-state reference energies for two amino acid variants | STR |
| computed by PyRosetta ( | ||
| MSD | Mean squared displacements of | DYN |
| computed by | ||
| Mechanical stiffness | Measurement of the mechanical resistance of residues to external pulling forces | DYN |
| computed by | ||
| Effectiveness | The ability of a residue to transmit mechanical deformation signals | DYN |
| when subjected to a unit perturbation computed by | ||
| Sensitivity | The ability of a residue to sense mechanical deformation signals | DYN |
| when subjected to a unit perturbation computed by |
Note: Fifteen features belonging to three different categories are used. Each feature calculation requires either an amino acid sequence or PDB file, or both. SEQ, sequence-based feature; STR, structure-based feature; DYN, dynamics-based feature.
Fig. 1.Feature validations. (a) A comparison of variation number histograms for the pathogenic and neutral SAVs. Among the 22 639 selected SAVs, 9743 SAVs were neutral and 10 564 SAVs were pathogenic. The mean variation number for the pathogenic SAVs was 0.12 while the mean variation number for the neutral SAVs was 0.32. (b) The XGBoost model was trained 15 times, each time excluding one feature from training. The results are shown with the accuracy, balanced accuracy and F1 scores calculated using the best model plotted on the y-axis
Fig. 2.Comparison of the ROC and PR curves of LYRUS using 10-fold cross-validation with the ClinVar dataset. (a) ROC curves for each fold of the 10-fold cross-validation, the mean and the standard deviation. (b) PR curves for each fold of the 10-fold cross-validation, the mean and the standard deviation
Fig. 3.Comparison of ROC and PR curves of LYRUS and 12 other VEPs using VariBench_selected dataset. (a) ROC curves for the 13 VEPs. LYRUS has an AUROC of 0.891, which is the fifth highest among all the VEPs. (b) PR curves for the 13 VEPs
Fig. 4.VEPs benchmarked against three PTEN DMS datasets (a) Spearman’s correlation (absolute value) between pten(a) DMS results, and 15 VEPs. The top three performing predictors are: PROVEAN, REVEL and SuSPect. (b) Spearman’s correlation (absolute value) between pten(b) DMS results, and 15 VEPs. The top three performing predictors are: SuSPect, PROVEAN and LYRUS. (c) Spearman’s correlation (absolute value) between pten(highqual_b) DMS results, and 15 VEPs variants. The top three performing predictors are: SuSPect, PROVEAN and MutationAssessor
Fig. 5.VEPs benchmarked against three TP53 DMS datasets (a) Spearman’s correlation (absolute value) between p53(wt_nutlin) DMS results and 12 VEPs. The top three performing predictors are: FATHMM, LYRUS and M-CAP. (b) Spearman’s correlation (absolute value) between p53(null_nutlin) DMS results and 12 VEPs. The top three performing predictors are: FATHMM, LYRUS and MVP. (c) Spearman’s correlation (absolute value) between p53(null_etoposide) DMS results and 12 VEPs variants. The top three performing predictors are: MVP, FATHMM and LYRUS