| Literature DB >> 32756939 |
Jing-Bo Zhou1, Yao Xiong1, Ke An1, Zhi-Qiang Ye1,2, Yun-Dong Wu1,2,3.
Abstract
MOTIVATION: Despite of the lack of folded structure, intrinsically disordered regions (IDRs) of proteins play versatile roles in various biological processes, and many nonsynonymous single nucleotide variants (nsSNVs) in IDRs are associated with human diseases. The continuous accumulation of nsSNVs resulted from the wide application of NGS has driven the development of disease-association prediction methods for decades. However, their performance on nsSNVs in IDRs remains inferior, possibly due to the domination of nsSNVs from structured regions in training data. Therefore, it is highly demanding to build a disease-association predictor specifically for nsSNVs in IDRs with better performance.Entities:
Mesh:
Substances:
Year: 2020 PMID: 32756939 PMCID: PMC7755418 DOI: 10.1093/bioinformatics/btaa618
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.The pipeline of building the IDR disease nsSNV predictor. DVs and NVs represent disease-associated and neutral nsSNVs, respectively
Summary of the training and testing datasets
| Dataset | Number of disease nsSNVs | Number of neutral nsSNVs | Number of IDRs | Number of proteins |
|---|---|---|---|---|
| Training | 2496 | 2531 | 2821 | 2390 |
| Independent testing | 297 | 262 | 313 | 262 |
| Third-party testing | 2897 | 2897 | 2914 | 2562 |
The 17 selected optimal features
| # | Feature name | Description |
|---|---|---|
| 1 | b9_eva_nal_w | Weighted number of sequences in the alignment based on BLAST against UniRef90 with E-value of 10E-45 |
| 2 | b9_all_rwt | Proportion of wild-type residue at the nsSNV site in the alignment (UniRef90, E-value: default) |
| 3 | b9_eva_rmt_w | Weighted proportion of mutant residue at the nsSNV site in the alignment (UniRef90, E-value: 10E-45) |
| 4 | b9_eva_ree | Relative entropy based on the alignment (UniRef90, E-value: 10E-45) |
| 5 | b1_eva_naa_w | Weighted number of residues at the nsSNV site in the alignment (UniRef100, E-value: 10E-75) |
| 6 | b1_hum_nmt_w | Weighted number of mutant residues at the nsSNV site in the alignment (UniRef100 human, E-value: default) |
| 7 | b1_nhu_pwm_w | Weighted position weight matrix score based on the alignment (UniRef100 non-human, E-value: default) |
| 8 | b1_hum_ree | Relative entropy based on the alignment (UniRef100 human, E-value: default) |
| 9 | b1_nhu_ree | Relative entropy based on the alignment (UniRef100 non-human, E-value: default) |
| 10 | pos_spo | SPOT-Disorder score of the wild-type residue at the nsSNV site ( |
| 11 | pro_Prec | Estimated probability that a gene is a recessive disease gene ( |
| 12 | pro_RVIS_ExAC | ExAC-based RVIS score ( |
| 13 | pro_GDI_Phred | Phred-scaled GDI score ( |
| 14 | pro_Essential_gene | Gene essentiality ( |
| 15 | hww_9 | Sum of Wimley–White hydropathy index of neighboring residues with a window of 9 ( |
| 16 | hwo_d | Difference of octanol–water free energy transfer index ( |
| 17 | hwo_3 | Sum of octanol–water free energy transfer index of neighboring residues with a window of 3 |
Performance comparison on the independent testing dataset and third-party testing dataset
| Method | Independent testing dataset | Third-party dataset | ||||||
|---|---|---|---|---|---|---|---|---|
| ACC | MCC | F1 score | AUC | ACC | MCC | F1 score | AUC | |
| Predictors without protein-/gene-level features | ||||||||
| SIFT | 0.793 | 0.584 | 0.803 | 0.863 | 0.660 | 0.321 | 0.642 | 0.723 |
| pph2-HumDiv | 0.773 | 0.542 | 0.791 | 0.854 | 0.637 | 0.279 | 0.611 | 0.693 |
| pph2-HumVar | 0.782 | 0.572 | 0.781 | 0.872 | 0.666 | 0.353 | 0.604 | 0.745 |
| PhD-SNP | 0.816 | 0.664 | 0.799 | – | 0.673 | 0.412 | 0.552 | – |
| MutationAssessor | 0.773 | 0.559 | 0.770 | 0.856 | 0.663 | 0.341 | 0.614 | 0.755 |
| fathmm-U | 0.753 | 0.518 | 0.745 | 0.831 | 0.615 | 0.252 | 0.514 | 0.665 |
| PROVEAN | 0.785 | 0.590 | 0.774 | 0.862 | 0.636 | 0.310 | 0.522 | 0.675 |
| PANTHER-PSEP | 0.805 | 0.598 | 0.845 | 0.858 | 0.527 | 0.068 | 0.496 | 0.649 |
| Eigen | 0.801 | 0.556 | 0.707 | 0.840 | 0.671 | 0.357 | 0.610 | 0.735 |
| PMut2017 | 0.834 | 0.696 | 0.826 | 0.924 | 0.772 | 0.365 | 0.452 | 0.765 |
| MutPred2 | 0.812 | 0.657 | 0.795 | 0.906 | 0.634 | 0.344 | 0.466 | 0.761 |
| LIST | 0.736 | 0.495 | 0.790 | 0.904 | 0.703 | 0.437 | 0.749 | 0.809 |
| Predictors containing protein-/gene-level features | ||||||||
| fathmm-W | 0.801 | 0.615 | 0.795 | 0.889 | 0.842 | 0.686 | 0.835 | 0.898 |
| PON-P2 | 0.835 | 0.670 | 0.860 | 0.929 | 0.803 | 0.614 | 0.822 | 0.896 |
| REVEL | 0.825 | 0.637 | 0.701 | 0.915 | 0.744 | 0.555 | 0.662 | 0.908 |
| CADD | 0.725 | 0.448 | 0.672 | 0.787 | 0.675 | 0.352 | 0.658 | 0.729 |
| RF-based model | 0.857 | 0.716 | 0.861 | 0.927 |
|
| 0.853 | 0.926 |
| XGBoost-based model | 0.859 | 0.719 | 0.863 |
| 0.856 | 0.713 | 0.854 |
|
| LightGBM-based model |
|
|
| 0.931 |
|
|
|
|
Both PolyPhen2 and fathmm have two versions, so this table contains 16 lines for the 14 general-purpose predictors.
No AUC was calculated for PhD-SNP due to lack of continuous prediction scores.
The best value in each column is underlined.
Fig. 2.The feature importance based on RF. The importance is defined as the average gain of splits that use the feature in RF
Fig. 3.The boxplots of four features in IDR and OR nsSNVs. (A) The relative entropy at the nsSNV site. (B) The score estimating the probability that a gene is a recessive disease gene. (C) The hydrophobicity difference between the mutant and wild-type residue at the nsSNV site. (D) The SPOT-Disorder score of the wild-type residue at the nsSNV site