Literature DB >> 33492342

Calibrating variant-scoring methods for clinical decision making.

Silvia Benevenuta¹, Emidio Capriotti², Piero Fariselli¹.

Abstract

Identifying pathogenic variants and annotating them is a major challenge in human genetics, especially for the non-coding ones. Several tools have been developed and used to predict the functional effect of genetic variants. However, the calibration assessment of the predictions has received little attention. Calibration refers to the idea that if a model predicts a group of variants to be pathogenic with a probability P, it is expected that the same fraction P of true positive is found in the observed set. For instance, a well-calibrated classifier should label the variants such that among the ones to which it gave a probability value close to 0.7, approximately 70% actually belong to the pathogenic class. Poorly calibrated algorithms can be misleading and potentially harmful for clinical decision-making. Supplementary information Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Species

Year: 2021 PMID： 33492342 PMCID： PMC8023678 DOI： 10.1093/bioinformatics/btaa943

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

One of the main challenges in human genetics is to predict the functional effect of genetic variants (Capriotti , 2019; Lappalainen ). Knowing whether or not a variant is potentially pathogenic can lead to better diagnosis and the implementation of more effective treatment strategies which have a significant impact on clinical settings. In their day-to-day life, physicians can rely on different tools to estimate the impact of a variant, but it is not an easy task to select the most appropriate one. A common strategy to select the most reliable method consists in reading articles reporting the results of critical assessment experiments. Unfortunately, in many cases, evaluation metrics do not include model calibrations (Cheng ; Drubay ). Even when the tools have good discrimination power, they may be unreliable if they are uncalibrated (Van Calster et al., 2016, 2019). Calibration is a relevant measure referring to the concept that if we take a group of variants, all predicted by a ‘calibrated’ model to be pathogenic with a probability score P (e.g. 0.7), it is expected that the fraction of truly pathogenic variants in that group is exactly P (70% of true positive is found in the observed set). A recent review found that calibration is assessed far less often than discrimination (Christodoulou ), which is problematic since poor calibration can make predictions misleading (Van Calster and Vickers, 2015) . For its high impact on the interpretability of the prediction, calibration has been addressed as the Achilles heel of predictive analytics (Shah ; Van Calster ). The TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis Or Diagnosis) guidelines for prediction modelling studies recommend the reporting on calibration performance (Collins ). When predictions are used in support of decision-making diagnoses and prognoses, the calibration is even more relevant as observed in the case of the cancer prediction models (Yala ). In this article, we evaluate the calibration of the state-of-the-art methods for scoring the impact of the variants: CADD (Kircher ), DANN (Quang ), Eigen (Ionita-Laza ), DeepSea (Zhou ), FATHMM-MKL (Shihab ) and PhD-SNPg (Capriotti ). We calculated the calibration and the Brier scores (Brier, 1950) of six tools on a dataset of 2066 single nucleotide variants, both coding and non-coding. We observed that the top classifiers, which were not necessarily well-calibrated, may lead to an incorrect interpretation of the functional effect of the genetic variant.

2 AUC performance of the different predictors

One of the most commonly used metrics for assessing the performance of the classification methods is the AUC-ROC (Area Under the Receiver Operating Characteristic Curve). The ROC curve is the plot of the true-positive rate (TPR) against the false-positive rate (FPR) at various threshold settings, and it illustrates the discrimination ability of a binary classifier as its discrimination threshold varies (Supplementary Materials Section S5). An ideal classifier would have an AUC-ROC of 1, while a completely random classifier would have an AUC-ROC of 0.5. AUC is an efficient way to reject tools that fail to differentiate between pathogenic and benign variants. FromFigure 1A, we can see that all predictors perform quite well as discriminators on the selected dataset (all the predictions of the methods are reported in Supplementary File S1). However, none of the predictors has been validated for its calibration. Using an ill-calibrated classifier could lead to an incorrect interpretation of the functional effect of the genetic variant (it could be over-estimated or under-estimated).

Fig. 1.

(A) ROC curves of PhD-SNPg, FATHMM-MKL, CADD, DANN and Eigen on the complete dataset (both coding and non-coding variants). DeepSea has been evaluated only on the subset of non-coding variants, since it has been developed only to score them. AUCs for coding and non-coding variants are reported in Supplementary Table S2. True- and false-positive rates are defined in Supplementary Materials. (B) Calibration curves of the predictors on coding and non-coding variants. CADD and Eigen scores have been modified using a sigmoid transformation (1/(1+exp(-A – x + B))). The best parameters were: A = 1, B = 2.5 for CADD and A = 1, B = 1.63/0.05 for Eigen (coding and non-coding variants were transformed separately, since Eigen provides two different sets of scores)

3 Calibration evaluation

A standard way to examine whether or not a predictor is calibrated is to plot the calibration curve or using the Brier score (Supplementary Materials). The calibration curve shows whether the predicted probabilities agree with the observed probabilities. If the calibration curve lies on the diagonal, the predictor is perfectly calibrated, and it requires no further investigation. The deviation from the diagonal indicates the miscalibration. Brier score is a numerical value that ranges from zero to one (one being totally uncalibrated, zero being perfect calibration). To evaluate the calibration of a method that returns a probability score, we compared its outputs with the observed class frequency. For Eigen and CADD, which provided only raw scores, we transformed their outputs using an optimal sigmoid function (Fig. 1B). From Figure 1A and B, we observed that despite showing similar AUCs, the tested tools have significantly different calibration curves. Indeed, PhD-SNPg is the best-calibrated method, while DeepSea and DANN resulted in the least-calibrated predictions. However, all the presented methods can be calibrated using the isotonic-regression, which transforms the output of a non-calibrated classifier in a very well-calibrated one (Niculescu-Mizil ). The effect of this kind of transformation is reported in Table 1 (and Supplementary Fig. S4), where the isotonic-regression mapping is computed using a 10-fold cross-validation procedure. The cross-validation procedure is necessary to evaluate the calibration on never-seen-before data (with at least 500–1000 datapoints). The sigmoid calibration, although it requires very few data-points, was less effective and not all the methods can be calibrated (Supplementary Figs S6 and S7).

Table 1.

Brier scores of the methods on the dataset

Predictor	BS_Coding	BS_Non-Coding	BS_All
PhD-SNP^g	0.10/0.10	0.03/0.03	0.07/0.07
DANN	0.24/0.09	0.27/0.05	0.25/0.07
FATHMM	0.17/0.15	0.07/0.04	0.14/0.12
DeepSea	–	0.43/0.08	–
Eigen^a	0.14/0.07	0.06/0.04	0.11/0.06
CADD^a	0.06/0.05	0.04/0.03	0.05/0.05

Note: Brier scores (BS) of the methods before and after isotonic calibration.

Uncalibrated scores for Eigen and CADD are obtained after sigmoid transformation.

Brier scores of the methods on the dataset Note: Brier scores (BS) of the methods before and after isotonic calibration. Uncalibrated scores for Eigen and CADD are obtained after sigmoid transformation.

4 Conclusion

Despite showing comparable AUCs, different methods may have significantly different calibration curves. Usually, the AUC is taken as the only evaluation criterion to assess the validity of the model. Thus, a model is chosen without checking its calibration. Nonetheless, its scores might still be used and interpreted as a measure of the ‘pathogenicity’ of the variants. This assumption could lead to an incorrect interpretation of the functional effects and their probability meaning. According to our analysis, from the user standpoint, we suggest selecting a method based both on the classification and calibration performances. In particular, for the end-users, who do not want to process the predictor outputs, we suggest to use PhD-SNPg, as the ready-on-the-shelf method that is both accurate and naturally calibrated (Fig. 1). For developers and expert users who prefer other tools (such as CADD or FATHMM), we recommend calibrating the predictor before its usage. The calibration can be performed using suitable software such as scikit-learn (Pedregosa et al., 2011) calibration suite, which transforms the predictor outputs, as shown in Supplementary Materials (Supplementary Figs S4, S14–S19). Click here for additional data file.

18 in total

1. A calibration hierarchy for risk models was defined: from utopia to empirical data.

Authors: Ben Van Calster; Daan Nieboer; Yvonne Vergouwe; Bavo De Cock; Michael J Pencina; Ewout W Steyerberg
Journal: J Clin Epidemiol Date: 2016-01-06 Impact factor: 6.437

2. Calibration of risk prediction models: impact on decision-analytic performance.

Authors: Ben Van Calster; Andrew J Vickers
Journal: Med Decis Making Date: 2014-08-25 Impact factor: 2.583

3. DANN: a deep learning approach for annotating the pathogenicity of genetic variants.

Authors: Daniel Quang; Yifei Chen; Xiaohui Xie
Journal: Bioinformatics Date: 2014-10-22 Impact factor: 6.937

4. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

Authors: Evangelia Christodoulou; Jie Ma; Gary S Collins; Ewout W Steyerberg; Jan Y Verbakel; Ben Van Calster
Journal: J Clin Epidemiol Date: 2019-02-11 Impact factor: 6.437

Calibrating variant-scoring methods for clinical decision making.

1 Introduction

2 AUC performance of the different predictors

3 Calibration evaluation

4 Conclusion

1. A calibration hierarchy for risk models was defined: from utopia to empirical data.

2. Calibration of risk prediction models: impact on decision-analytic performance.

3. DANN: a deep learning approach for annotating the pathogenicity of genetic variants.

4. A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models.

5. A benchmark study of scoring methods for non-coding mutations.

Review 6. Genomic Analysis in the Age of Human Genome Sequencing.

7. A Deep Learning Mammography-based Model for Improved Breast Cancer Risk Prediction.

8. An integrative approach to predicting the functional effects of non-coding and coding sequence variation.

9. A spectral approach integrating functional genomic annotations for coding and noncoding variants.

Review 10. Integrating molecular networks with genetic variant interpretation for precision medicine.

1. Evaluating the relevance of sequence conservation in the prediction of pathogenic missense variants.

2. TADA-a machine learning tool for functional annotation-based prioritisation of pathogenic CNVs.

3. An objective framework for evaluating unrecognized bias in medical AI models predicting COVID-19 outcomes.