| Literature DB >> 34580383 |
Ha Young Kim1, Woosung Jeon1, Dongsup Kim2.
Abstract
The development of an accurate and reliable variant effect prediction tool is important for research in human genetic diseases. A large number of predictors have been developed towards this goal, yet many of these predictors suffer from the problem of data circularity. Here we present MTBAN (Mutation effect predictor using the Temporal convolutional network and the Born-Again Networks), a method for predicting the deleteriousness of variants. We apply a form of knowledge distillation technique known as the Born-Again Networks (BAN) to a previously developed deep autoregressive generative model, mutationTCN, to achieve an improved performance in variant effect prediction. As the model is fully unsupervised and trained only on the evolutionarily related sequences of a protein, it does not suffer from the problem of data circularity which is common across supervised predictors. When evaluated on a test dataset consisting of deleterious and benign human protein variants, MTBAN shows an outstanding predictive ability compared to other well-known variant effect predictors. We also offer a user-friendly web server to predict variant effects using MTBAN, freely accessible at http://mtban.kaist.ac.kr . To our knowledge, MTBAN is the first variant effect prediction tool based on a deep generative model that provides a user-friendly web server for the prediction of deleteriousness of variants.Entities:
Year: 2021 PMID: 34580383 PMCID: PMC8476491 DOI: 10.1038/s41598-021-98693-3
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1MTBAN model structure. We implemented BAN with mutationTCN as both the teacher and the student network. In the first step, only the teacher network is trained, with the loss function being the label loss (red arrow), which refers to the cross entropy loss between the input sequence and the softmax output distribution of the teacher network. In the second step, only the student network is trained, with the loss being the sum of the label loss (red arrow) and the teacher loss (blue arrow). Here, the label loss refers to the cross entropy loss between the input sequence and the softmax output of the student network. The teacher loss refers to the cross entropy loss between the softmax output of the student network and the “softened” output distribution of the teacher network.
Test datasets used and the number of deleterious and benign variants for each dataset used for evaluation.
| References | Dataset | Description | ND | NB |
|---|---|---|---|---|
| Grimm et al.[ | HumVar | Disease-causing mutations from UniProtKB and common single nucleotide polymorphisms with major allele frequency > 1%[ | 1230 | 1230 |
| Total | 1230 | 1230 | ||
| Mahmood et al.[ | UniFun | Deleterious and benign variants in UniProt which are derived from functional assays[ | 25 | 25 |
| BRCA1-DMS | Deleterious and benign variants derived from deep mutational scanning experiment measuring homology-directed DNA repair and tumor suppression activity[ | 41 | 41 | |
| TP53-TA | Deleterious and benign variants derived from transactivation assay[ | 413 | 413 | |
| Total | 479 | 479 | ||
| Total | 1709 | 1709 | ||
ND stands for the number of deleterious variants, and NB stands for the number of benign variants.
Figure 2ROC Curves and Precision-Recall Curves for MTBAN and other predictors on the test dataset. (a) MTBAN achieved a ROC-AUC (Receiver Operating Characteristic Area Under Curve) of 0.883, which is the highest among 12 variant effect predictors. (b) MTBAN achieved a PR-AUC (Precision-Recall Area Under Curve) of 0.878, outperforming all other variant effect predictors.
Performances of MTBAN and other predictors on the test dataset consisting of 1709 deleterious and 1709 benign variants.
| Predictor | ROC-AUC | PR-AUC | Accuracy | MCC | Precision | Specificity | Sensitivity | F-score | NPV |
|---|---|---|---|---|---|---|---|---|---|
| MTBAN | 0.739 | 0.686 | 0.887 | 0.859 | |||||
| mutationTCN | 0.873 | 0.87 | 0.763 | 0.548 | 0.706 | 0.624 | 0.902 | 0.792 | 0.865 |
| SIFT | 0.856 | 0.861 | 0.77 | 0.55 | 0.728 | 0.671 | 0.868 | 0.792 | 0.833 |
| MutationAssessor | 0.855 | 0.849 | 0.763 | 0.535 | 0.722 | 0.686 | 0.843 | 0.778 | 0.819 |
| PolyPhen-2 | 0.853 | 0.856 | 0.759 | 0.537 | 0.703 | 0.637 | 0.885 | 0.783 | 0.851 |
| fathmm-MKL | 0.844 | 0.812 | 0.743 | 0.518 | 0.681 | 0.567 | 0.782 | ||
| phyloPa | 0.836 | 0.838 | 0.753 | 0.532 | 0.602 | 0.71 | 0.693 | ||
| DANN | 0.814 | 0.775 | 0.753 | 0.51 | 0.722 | 0.68 | 0.825 | 0.77 | 0.794 |
| phastConsb | 0.789 | 0.829 | 0.749 | 0.506 | 0.711 | 0.657 | 0.84 | 0.77 | 0.803 |
| GERP++ | 0.778 | 0.74 | 0.714 | 0.435 | 0.757 | 0.795 | 0.635 | 0.69 | 0.684 |
| MPC | 0.772 | 0.762 | 0.68 | 0.369 | 0.73 | 0.772 | 0.591 | 0.653 | 0.644 |
| GenoCanyon | 0.742 | 0.748 | 0.657 | 0.323 | 0.626 | 0.53 | 0.783 | 0.696 | 0.708 |
Since the score cutoffs for phyloP, DANN, phastCons, GERP++, MPC, and GenoCanyon were not provided by dbNSFP, we computed the cutoffs for each predictor using the Humsavar database (release 03/2021) as described in “Methods” section. The highest values for each evaluation metric are indicated in bold.
ROC-AUC, Receiver Operating Characteristic Area Under Curve; PR-AUC, Precision-Recall Area Under Curve; MCC, Matthews Correlation Coefficient; NPV, Negative Predictive Value.
aPhyloP100way_vertebrate from dbNSFP.
bPhastCons100way_vertebrate from dbNSFP.