Literature DB >> 36061567

Clinical significance of genetic variation in hypertrophic cardiomyopathy: comparison of computational tools to prioritize missense variants.

Pedro Barbosa^1,2, Marta Ribeiro³, Maria Carmo-Fonseca², Alcides Fonseca^1,4.

Abstract

Hypertrophic cardiomyopathy (HCM) is a common heart disease associated with sudden cardiac death. Early diagnosis is critical to identify patients who may benefit from implantable cardioverter defibrillator therapy. Although genetic testing is an integral part of the clinical evaluation and management of patients with HCM and their families, in many cases the genetic analysis fails to identify a disease-causing mutation. This is in part due to difficulties in classifying newly detected rare genetic variants as well as variants-of-unknown-significance (VUS). Multiple computational algorithms have been developed to predict the potential pathogenicity of genetic variants, but their relative performance in HCM has not been comprehensively assessed. Here, we compared the performance of 39 currently available prediction tools in distinguishing between high-confidence HCM-causing missense variants and benign variants, and we developed an easy-to-use-tool to perform variant prediction benchmarks based on annotated VCF files (VETA). Our results show that tool performance increases after HCM-specific calibration of thresholds. After excluding potential biases due to circularity type I issues, we identified ClinPred, MISTIC, FATHMM, MPC and MetaLR as the five best performer tools in discriminating HCM-associated variants. We propose combining these tools in order to prioritize unknown HCM missense variants that should be closely followed-up in the clinic.

Entities: Chemical

Keywords: computational pathogenicity prediction; genetic testing; hypertrophic cardiomyopathy; missense variant interpretation; prediction tool comparison; variants-of-unknown-significance

Year: 2022 PMID： 36061567 PMCID： PMC9433717 DOI： 10.3389/fcvm.2022.975478

Source DB: PubMed Journal: Front Cardiovasc Med ISSN： 2297-055X

Introduction

Familial hypertrophic cardiomyopathy (HCM) is the most common inherited heart disease and one of the leading causes of sudden cardiac death in younger people (1) and athletes (2). The estimated prevalence of HCM is at least 1 in 500 individuals in the general population (3). Access to more sensitive imaging methods and advanced genetic testing improved the diagnostic rate and a more recent study revealed that 1 in 200 people may be affected (4). In some cases, sudden cardiac death is the first manifestation of HCM, particularly in younger individuals (5–7). Thus, it is important to implement prevention strategies that involve screening, monitoring and counseling HCM patients and their families (8). HCM is morphologically characterized by increased left ventricular wall thickness in the absence of abnormal loading conditions (9). In the majority of familial HCM patients, the disease is caused by mutations in any of the following eight sarcomeric genes, MYBPC3, MYH7, TNNT2, TPM1, MYL2, MYL3, TNNI3, and ACTC1 (10). Although advances in high-throughput sequencing led to an exponential increase in the number of genes proposed to be associated with HCM, in many cases there is no robust evidence supporting a causative link between these additional genes and the disease (10). Nevertheless, screening of extended gene panels is recommended, including genes associated with other disorders such as inherited metabolic and neuromuscular diseases that may mimic the clinical features of HCM (11). Genetic testing has become an integral part of the clinical evaluation and management of patients with HCM (12, 13). Detection of a mutation known to be causative of the disease in the index patient is followed by family genetic cascade testing in order to identify which family members do or do not carry the mutation. This allows to eliminate disease risk in non-carrier individuals, and to implement primary prevention strategies in individuals with pre-symptomatic genetic diagnosis (9). With contemporary disease management, approximately two thirds of patients with HCM have a normal life expectancy without significant morbidity, while a subset requires symptomatic therapies for heart failure (14). In some patients, however, no causative mutation is identified. A recent systematic review and meta-analysis revealed a mutation detection rate of 33–43% in adult HCM cohorts and 52–78% in pediatric HCM cohorts (15). The detection rates for adult cohorts with a positive family history of HCM were significantly higher compared with apparently sporadic cases, whereas in pediatric cohorts the detection rate was similar irrespective of family history (15). Approximately 40% of HCM patients were reported as presenting a non-familial subtype for which the underlying mechanism remains unknown (16, 17). In addition to ambiguous gene associations, difficulties in classifying variants in “core” HCM genes limits the impact of genetic testing in clinical practice (4). Determining which genetic variants detected in HCM-associated genes are pathogenic relies on a set of functional (molecular) and clinical criteria that have been defined by the American College of Medical Genetics and Genomics and the Association for Molecular Pathology (ACMG/AMP) (18–20). In many cases, available information is insufficient to classify a variant as benign/likely benign or pathogenic/likely pathogenic. This results in a large group of so-called “variants-of-unknown-significance” (VUS), the interpretation of which is extremely challenging. The advent of whole-exome and whole-genome databases revealed that many variants previously associated with cardiomyopathies were rather likely benign, as their population frequencies were incompatible with the prevalence of disease (21, 22). These observations prompted the development of disease-specific approaches to assist decisions on which variants should be considered in clinical practice. Rigorous curation efforts that assess all available lines of evidence for HCM-association are now available, such as the SHaRe registry (https://theshareregistry.org/), which comprises genetic data and cardiac morphofunctional parameters for >9,000 HCM patients. When a VUS or a previously unseen genetic variant is identified in an HCM-causing gene, computational approaches can aid by making a prediction of potential pathogenicity. Over the last decade, many different algorithms and tools have been proposed, but their performance is not consistent across different independent benchmarks (23–31). Additionally, most of these studies evaluate prediction tools on datasets that incorporate variants from multiple disease phenotypes, which may compromise their performance for a specific disease (27, 32, 33). In this study, we used three distinct datasets comprising high-confidence HCM-causing missense variants and we developed a dedicated computational framework (VETA) to perform a comprehensive analysis of currently available prediction algorithms. We found that ClinPred, MISTIC, FATHMM, MPC and MetaLR are the five tools that more accurately and reliably distinguish between benign and HCM-causing missense variants.

Methods

Prediction scores

When available, pre-computed prediction scores were obtained directly from each tool website. Alternatively, dbNSFP v4.0b1 (34) was used. UCSC genome browser was used to access conservation scores (35).Because some tools do not provide scores for the latest genome build, the GRCh37 version was used to include a more comprehensive number of tools in the analysis. Annotation of VCF scores was performed with Ensembl Variant Effect Predictor (VEP) v105 (36) using custom plugins or with vcfanno v0.3.3 (37). Reference tool thresholds were obtained from different sources according to the following priority. First, we followed a recent computational approach that calibrates missense variants' thresholds to different levels of pathogenicity evidence according to the ACMG/AMP guidelines (33). Whenever reported in this dataset, we used the least conservative value in the threshold range with “Supporting” evidence for Pathogenicity (PP3). Alternatively, we used the threshold value provided in the original publication, or indicated by authors elsewhere (e.g., online repository for the tool data). If not available, we included threshold values indicated in other studies that use the tool (e.g., benchmark paper).

Datasets

ClinVar HCM

The ClinVar v20220403 database (38) was used. To select HCM-associated variants, we filtered the dataset using a combination of three disease ontologies by keeping variants with any of the following identifiers: MedGen (39) (C3495498, C0949658); OMIM (40) (192600); Mondo Disease Ontology (41) (0005045, 0024573). Additionally, we excluded all variants with zero-star review status, classified as “Uncertain_significance” or with conflicting interpretations of pathogenicity. Finally, missense variants were selected by inspecting the “Consequence” field of VEP annotations. The resulting HCM ClinVar dataset consisted of 471 missense variants (278 Pathogenic/Likely_pathoenic, 193 Benign/Likely_benign; Supplementary Table S1). For the variant distribution analysis depicted in Figure 1, the VCF field GENEINFO was used to assign the gene name. Genes with <5 variants were grouped as “Other”. Variant categories were extracted from the VCF MC field. For the cases where MC was empty, Consequence field from VEP annotations was used instead. “inframe_deletion” and “inframe_insertion” ontologies were generalized to “Inframe indel”. Splice site and “intron_variant” annotations were combined into one single category “Splice site/Intron”.

Figure 1

Distribution of HCM-associated variants (Pathogenic/Likely pathogenic) with a review status of > 1 star in ClinVar (N = 768). (A) Number and proportion of overall variants per gene. (B) Number and proportion of overall variants per category. (C) Category of variants located in the MYH7 gene. (D) Category of variants located in the MYBPC3 gene.

SHaRE cohort

Data was directly accessed from (17). Variants with Pathogenic/Likely pathogenic assignments that were absent from the ClinVar dataset were selected. Records with allele mismatch after running Ensembl VEP were excluded. The majority of remaining variants refer to either missense or splice region variants. Missense variants were selected as described above. After this selection, 93 HCM-associated missense variants were considered for further analysis (Supplementary Table S2).

Walsh_2017

This dataset was generated from results in (22). In this study, the following genes showed a significant excess of rare variation in patients compared to control individuals: ACTC1, FHL1, GLA, MYBPC, MYH7, MYL2, MYL3, PRKAG2, TNNI3, TNNT2, TPM1. We selected the variants considered Pathogenic/Likely_pathogenic, excluding those with conflicting interpretations between different labs. A final set of 103 missense variants was obtained (Supplementary Table S3).

gnomAD

gnomAD v2.1 (42) was used to identify benign variants for a balanced evaluation of the SHaRe and Walsh_2017 datasets. Common variants were selected based on a Minimum Allele Frequency (MAF) threshold of 0.001 (>0.1%). We restricted our analysis to variants located in 62 genes associated with cardiomyopathies (https://www.ncbi.nlm.nih.gov/gtr/tests/509149/). From a total of 110,762 variants, 709 missense variants were further selected. Hits in the TTN gene were discarded to avoid over-representation of missense variants from a single gene. Finally, we checked that the remaining 356 variants did not overlap with any variant from the other datasets, as well as with variants from the whole ClinVar database with any of the following assignments: “Pathogenic”, “Likely pathogenic”, “Pathogenic/Likely Pathogenic”, “Uncertain significance”, or “Conflicting interpretations of pathogenicity”. The final set comprised 220 variants, which were randomly split in two non-overlapping groups of 100 variants, called benign_set1 (used to compare with the SHaRe HCM dataset; Supplementary Table S2) and benign_set2 (used to compare with the Walsh_2017 HCM dataset; Supplementary Table S3).

Performance metrics

For each tool, a confusion matrix was constructed that measures the number of True Positives (TP), True Negatives (TN), False Positives (FP) and False Negatives (FN). TP refers to the number of pathogenic variants that a tool correctly predicts to be pathogenic (e.g. above the reference threshold). TN is the number of benign variants that a tool correctly predicts to be benign; FP indicates the number of benign variants that a tool predicts to be pathogenic; FN is the number of pathogenic variants that a tool predicts to be benign. Tools were ranked using a small variation of the Matthews correlation coefficient (). To account for the magnitude of missing predictions, MCC values were normalized to range from 0 and 1 (), and weighted by the fraction of variants that a tool gives predictions (tool coverage). Throughout the manuscript we call this metric weighted normalized MCC, which corresponds to weighted norm MCC = coveragenormalizedMCC. In addition, tools were ranked based on Receiving Operating Characteristic (ROC) curves, which, as opposed to weighted normalized MCC, evaluate performance at multiple threshold values. ROC curves were created by plotting the Sensitivity (also known as True Positive Rate or Recall) against the 1—Specificity (also known as False Positive Rate) at several different thresholds. Sensitivity=TP/TP+FN; Specificity=TN/TN+FP. For each tool, scores were transformed based on their rank so that they ranged between 0 and 1 (in the minority of tools where the values below a threshold are considered pathogenic, we inverted the signal accordingly). The area under the ROC curve (auROC) was used as the summary statistic.

Automated analysis using VETA

Most of the analysis in the manuscript were performed with Variant prEdiction Tools evAluation (VETA), a general tool we developed to benchmark variant predictors. Briefly, VETA takes annotated VCF files from Ensembl VEP (36) as input (in this analysis we specifically set –hgvs, –per_gene, –pick_order ccds, canonical, biotype, rank, –no_intergenic –gencode_basic) and automatically compares tools performance at different levels. It allows to evaluate predictions according to the variant type (e.g., SNVs, indels), variant location (e.g., exons, introns) and scope of the tool (e.g., separate analysis for missense and splicing predictors). In addition, VETA is particularly suited to deal with ClinVar data since it incorporates methods to filter variants according to review status and/or phenotype desired. Furthermore, VETA is able to inspect whether reference thresholds are appropriate, and allows combination of scores from multiple tools to create meta-predictors using standard Machine Learning algorithms. By default, VETA has native support for more than 50 predictors, but it also allows users to include custom tools through a configuration file. Detailed documentation is available at https://github.com/PedroBarbosa/VETA, where instructions for easy installation are provided. Of note, VETA depends on cyvcf2 (43) for VCF parsing, on hgvs (44) for parsing HGVS expressions, on Scikit-learn for ROC curve analysis (45) and seaborn for plots generation (46). Importantly, VETA does not run any prediction tool but rather receives as input VCF files annotated with prediction scores.

Threshold analysis

To evaluate whether published reference thresholds were appropriate, we measured the performance of each tool using a set of 100 threshold values uniformly distributed over the observed range of scores. The best thresholds were obtained based on the F-Beta formula , where Precision=TP/TP+FP. This is similar to the commonly used F1 score, but allows to weight the balance between precision and recall using the β parameter (when β = 1, it is equal to the F1 score). In this study, we used β = 0.5, 1 and 1.5. For each β value, the threshold that maximized the Fβ function was selected. Higher β values favor sensitivity over precision, which translate into higher recall rates at the cost of increasing false positives. Conversely, lower β values favor precision, at the cost of increasing false negatives. We additionally performed a bootstrapping procedure to evaluate how reliable the adjusted threshold is. For each tool, we generated 1,000 bootstrap samples of the same size of the data sample with the same ratio of pathogenic/benign variants as in the original dataset. For each bootstrap sample we derived the best threshold (as described above). Then, we computed the 0.025 and 0.975 quantiles of the distribution of the bootstrap sample statistic (distribution of best thresholds). We used these values to interrogate at which threshold range 95% of the bootstrap sample statistic lies, and how wide/narrow this interval is in respect to the adjusted threshold originally obtained.

Results

The majority of HCM-associated variants annotated in ClinVar are missense

Among algorithms developed to assess the likelihood of pathogenicity of rare variants, two main categories are generally considered: those that predict whether a missense change (i.e., a base change that alters the encoded amino acid) is damaging to the resultant protein function or structure and those that predict whether there is an effect on splicing (18). To determine the relative contribution of missense variation to HCM, we analyzed all HCM-associated variants annotated in the ClinVar database that are classified as Pathogenic/Likely pathogenic with a review status of > 1 star (Figure 1). As previously described (47), the most frequently mutated genes are MYBPC3 and MYH7 (Figure 1A). Among all HCM-associated variants, missense variants are the most frequent (Figure 1B). However, the prevalence of missense variants differs depending on the affected gene. Over 90% of annotated variants in the MYH7 gene are missense (Figure 1C), whereas in the MYBPC3 gene missense variants are less than 20% (Figure 1D).

Comprehensive review of computational tools to predict clinical significance of missense variants

Having shown that missense variants are a frequent cause of HCM, we next performed a comprehensive review of computational tools that predict the clinical significance of this type of genetic change (Table 1). A subset of existing computational tools relies on features such as amino acid or nucleotide conservation, the location and context within the protein sequence, and the biochemical consequence of the amino acid substitution (see references in Table 1, “protein predictors”). Other methods estimate the probability that a particular nucleotide belongs to a conserved element irrespective of its location in the genome, and therefore are not restricted to variation in exons of protein coding genes (see references in Table 1, “conservation scores”). Another class consists of tools that integrate genome-wide features to predict variant effects irrespective of the variant category (Table 1, “Consequence-agnostic predictors”). Tools specifically designed for cardiac diseases were also included (Table 1, “disease-specific predictors”).

Table 1

Prediction tools analyzed in this study.

Category	Tool	Threshold
Protein predictors	SIFT (48)	<0.01 (33)
	MutPred (49)	>0.5*
	PolyPhen-2 HDIV (50)	>0.978 (33)
	PolyPhen-2 HVAR (50)	>0.978 (33)
	Mutation Assessor (51)	>1.935 (52)
	Condel (53)	>0.98 (53)
	VEST4 (54)	>0.764 (33)
	MutationTaster2 (55)	>0.5 (52)
	FATHMM (56)	<-4.14 (33)
	PROVEAN (57)	<-2.5 (52)
	MetaSVM (25)	>0.5 (25)
	MetaLR (25)	>0.5 (25)
	M-CAP (58)	>0.025 (58)
	REVEL (59)	>0.644 (33)
	MPC (60)	>1.360 (33)
	MTR (61)	<0.5*
	PrimateAI (62)	>0.790 (33)
	ClinPred (63)	>0.5 (63)
	MISTIC (64)	>0.5 (63)
	cVEP (65)	>0.5**
	MVP (66)	>0.7 (63)
	VARITY (67)	>0.75 (67)
	MutFormer (68)	>0.5*
	EVE (69)	>0.5***
	MutScore (70)	>0.5*
Conservation scores	phastCons (71)	>0.99 (28)
	phyloP (72)	>7.367 (33)
	SiPhy (73)	>12.7 (25)
	GERP (74)	>4.4 (25)
	CDTS (75)	<10 (75)
Consequence-agnostic predictors	GWAVA (76)	>0.4 (77)
	FATHMM-MKL (78)	>0.5 (52)
	DANN (79)	>0.9 (80)
	Eigen (81)	>1 (58)
	ReMM (82)	>0.984 (83)
	CAPICE (84)	>0.02 (84)
	CADD (85)	>25.3 (33)
Disease-specific predictors	CardioVAI (86)	>2 (86)
	CardioBoost (32)	>0.9 (32)

If a reference threshold was not found, decision boundary was set to 0.5 for tools with a score range between 0 and 1.

cVEP outputs categorical labels (e.g. Pathogenic, Likely_benign). We transformed categories into numerical predictions to allow doing the benchmark as following: Benign: 0; Likely_benign: 0.25; Likely_pathogenic: 0.75; Pathogenic: 1. VUS classifications were treated as NaN. Since these transformations represent artificial numeric predictions, this tool was just used in the first comparison, where tools are evaluated according to reference cut-offs. Downstream analysis (e.g. best threshold analysis, ROC curves) did not include cVEP.

For EVE, we tested doing the benchmarks using the categorical classifications at three different uncertainty thresholds (20, 82, 87). We transformed categorical classifications as we did for cVEP. At the end, we observed that none of these annotations improved classifications compared with using the raw EVE numeric score. For initial performance assessment, we set EVE threshold to 0.5, as defined in**.

Prediction tools analyzed in this study. If a reference threshold was not found, decision boundary was set to 0.5 for tools with a score range between 0 and 1. cVEP outputs categorical labels (e.g. Pathogenic, Likely_benign). We transformed categories into numerical predictions to allow doing the benchmark as following: Benign: 0; Likely_benign: 0.25; Likely_pathogenic: 0.75; Pathogenic: 1. VUS classifications were treated as NaN. Since these transformations represent artificial numeric predictions, this tool was just used in the first comparison, where tools are evaluated according to reference cut-offs. Downstream analysis (e.g. best threshold analysis, ROC curves) did not include cVEP. For EVE, we tested doing the benchmarks using the categorical classifications at three different uncertainty thresholds (20, 82, 87). We transformed categorical classifications as we did for cVEP. At the end, we observed that none of these annotations improved classifications compared with using the raw EVE numeric score. For initial performance assessment, we set EVE threshold to 0.5, as defined in**.

Threshold optimization for HCM-associated variants

To compare the performance of the different tools in classifying pathogenic and benign missense variants, we used three distinct high-confidence HCM test datasets that are mostly based on expert-reviewed clinical and functional evidence (Figure 2). We first assessed performance on the three datasets using the threshold recommended by each tool (Figure 3). The results show that ClinPred, CAPICE, cVEP, MISTIC, MetaLR, REVEL and MutScore consistently ranked among the best tools with weighted normalized MCC values >0.80 (Figures 3A–C). These tools scored with high sensitivity and specificity (approximately 80% or higher) and provided predictions for the vast majority (>90%) of the variants analyzed. Other tools, such as VEST4, ranked worse despite scoring with relatively high sensitivity and specificity because they failed to provide predictions for many of the variants analyzed. A subset of tools, including cardiac-specific methods, failed to predict predominantly benign variants (Supplementary Figure S1).

Figure 2

Workflow of the study. Number of variants on each dataset are presented.

Figure 3

Performance of prediction tools in classifying HCM missense variants using fixed thresholds for ClinVar (A), SHaRe (B) and Walsh_2017 (C) datasets. For each dataset, the numbers of pathogenic/likely pathogenic (N pos) and benign/likely benign (N neg) variants are indicated. Tools were ranked according to the weighted normalized MCC (weighted_norm_mcc).

Workflow of the study. Number of variants on each dataset are presented. Performance of prediction tools in classifying HCM missense variants using fixed thresholds for ClinVar (A), SHaRe (B) and Walsh_2017 (C) datasets. For each dataset, the numbers of pathogenic/likely pathogenic (N pos) and benign/likely benign (N neg) variants are indicated. Tools were ranked according to the weighted normalized MCC (weighted_norm_mcc). In contrast, ROC curves (calculated only from scored variants) revealed overall excellent performance, with several tools with auROC scores above 0.9 (Figures 4A–C). This analysis highlights how much results can change depending on the selected metric. For example, CardioVAI, which was ranked in an intermediate position using MCC-based values for the recommended thresholds (see Figure 3), appears as one of the best in the ROC curve analysis (auROC ≥ 0.95). An overall comparison of auROC (from multi-threshold analysis) and MCC (from fixed threshold analysis) scores reveals performance differences within each tool clearly favoring auROC, which indicates that reference thresholds of several methods may be suboptimal for HCM (Figure 4D).

Figure 4

Performance of prediction tools in classifying HCM missense variants using ROC curve analysis for ClinVar (A), SHaRe (B) and Walsh_2017 (C) datasets. For each dataset, the numbers of pathogenic/likely pathogenic (N pos) and benign/likely benign (N neg) variants are indicated. Tools were ranked according to the area under the ROC curve (auROC). The number (n) of variants predicted by each tool is indicated. Tools with more than 50% of missing predictions were not included. (D) Differences in the metrics when evaluating with auROC and weighted normalized MCC. For comparison, auROC values were weighted by the fraction of variants predicted by each tool. This observation prompted us to carry out a threshold analysis to find values that best discriminate the high-quality pathogenic HCM variants. We merged the three datasets, leading to 867 variants (474 pathogenic, 393 benign). Using the F-Beta (β) score at different values of β, we derived new thresholds that prioritize differently precision and recall (Table 2). Next, we evaluated tools performance using the adjusted thresholds on each dataset independently (Figure 5). The results confirm improved performance. Yet, no major change was observed for the top ranked tools. Notably, the performance of top tools remained similar when different weights to precision/recall were given. Although different types of errors were introduced using different thresholds, these results suggest that the top tools are similarly sensitive to type I (more false positives at β of 1.5) and type II errors (more false negatives at β of 0.5).

Table 2

Adjusted thresholds that maximize performance for HCM variants at different levels of importance given to precision and recall.

Tool*	Reference threshold	Threshold_beta_0.5**	Threshold_beta_1**	Threshold_beta_1.5**
ClinPred	0.5	0.52 (0.366, 0.832)	0.41 (0.242, 0.533)	0.37 (0.123, 0.476)
CAPICE	0.02	0.61 (0.116, 0.619)	0.06 (0.016, 0.078)	0.02 (0.009, 0.058)
MISTIC	0.5	0.67 (0.6, 0.781)	0.543 (0.499, 0.611)	0.514 (0.395, 0.544)
REVEL	0.644	0.596 (0.533, 0.679)	0.441 (0.407, 0.565)	0.353 (0.323, 0.473)
MPC	1.36	1.047 (0.808, 1.281)	0.717 (0.672, 0.878)	0.675 (0.572, 0.774)
MetaLR	0.5	0.629 (0.547, 0.658)	0.509 (0.346, 0.606)	0.26 (0.212, 0.491)
MutScore	0.5	0.75 (0.714, 0.818)	0.741 (0.512, 0.757)	0.501 (0.38, 0.582)
FATHMM	−4.14	–2.119 (–2.156, –1.062)	–1.078 (–1.185, –0.759)	–0.947 (–1.137, –0.209)
PrimateAI	0.79	0.693 (0.641, 0.733)	0.577 (0.552, 0.645)	0.524 (0.504, 0.583)
CADD	25.3	24.12 (22.915, 24.742)	23.04 (22.055, 23.348)	22.32 (21.336, 23.066)
VARITY	0.75	0.427 (0.375, 0.597)	0.348 (0.243, 0.428)	0.23 (0.169, 0.299)
Provean	−2.5	−2.582 (−2.932, −2.182)	–2.268 (–2.423, –1.507)	–1.484 (–2.215, –0.946)
MutFormer	0.5	0.99 (0.96, 0.998)	0.98 (0.954, 0.993)	0.98 (0.664, 0.988)
Condel	0.468	0.79 (0.624, 0.866)	0.59 (0.484, 0.671)	0.47 (0.463, 0.561)
MTR	0.5	0.783 (0.746, 0.817)	0.883 (0.815, 0.917)	0.916 (0.883, 0.931)
CardioVAI	2	2.53 (2.502, 2.852)	2.53 (2.502, 2.852)	2.53 (1.515, 2.837)
DANN	0.9	0.991 (0.964, 0.997)	0.991 (0.956, 0.994)	0.959 (0.903, 0.99)
MVP	0.7	0.891 (0.873, 0.924)	0.862 (0.818, 0.893)	0.793 (0.778, 0.861)
Sift	0.001	0.0 (0.001, 0.038)	0.05 (0.003, 0.096)	0.13 (0.043, 0.196)
Eigen	1	4.282 (2.973, 5.641)	2.531 (2.275, 3.155)	2.337 (1.384, 2.777)
SiPhy	12.17	11.823 (10.859, 13.018)	10.398 (9.507, 11.931)	7.264 (6.245, 10.729)
phyloP	7.367	7.012 (4.39, 7.122)	3.558 (0.76, 4.512)	0.105 (0.022, 1.068)
Polyphen2HVAR	0.978	0.65 (0.359, 0.861)	0.24 (0.156, 0.463)	0.02 (0.012, 0.265)
FATHMM-MKL	0.5	0.978 (0.948, 0.986)	0.959 (0.58, 0.964)	0.464 (0.436, 0.881)
ReMM	0.984	0.98 (0.943, 0.989)	0.88 (0.84, 0.95)	0.341 (0.313, 0.886)
Polyphen2HDIV	0.978	0.94 (0.511, 0.957)	0.5 (0.021, 0.565)	0.02 (0.0, 0.064)
GERP	4.4	3.401 (2.602, 3.601)	2.583 (2.203, 3.388)	2.232 (0.057, 2.679)
MutationAssessor	1.935	2.462 (2.429, 2.828)	1.106 (0.915, 2.258)	0.905 (0.085, 1.122)
M-CAP	0.025	0.181 (0.127, 0.332)	0.131 (0.074, 0.162)	0.131 (0.051, 0.142)
CDTS	10	26.756 (6.601, 37.161)	62.386 (37.155, 85.472)	86.139 (62.899, 94.343)
phastCons	0.99	0.7 (0.532, 1.0)	0.54 (0.256, 0.809)	0.001 (0.001, 0.692)
MetaSVM	0.5	0.106 (0.015, 0.363)	0.011 (0.012, 0.135)	0.011 (0.012, 0.109)
EVE	0.5	0.291 (0.264, 0.515)	0.252 (0.146, 0.295)	0.127 (0.062, 0.2)
VEST4	0.764	0.662 (0.484, 0.733)	0.504 (0.385, 0.59)	0.445 (0.369, 0.522)
GWAVA	0.5	0.294 (0.239, 0.475)	0.07 (0.071, 0.296)	0.07 (0.07, 0.215)
MutationTaster2	0.5	0.99 (0.742, 0.999)	0.99 (0.228, 0.992)	0.23 (0.033, 0.987)

95% percentile values of the bootstrap distribution are also displayed. Mutpred and CardioBoost were not included since they did not predict the minimum number of variants (N = 50) in the minority class required by VETA for threshold analysis.

Tool names in bold represent those that display minimally useful predictive power (> 0.70 weighted normalized MCC) across the different datasets (Figure 5).

Numbers in bold represent cases for which the reference threshold lies outside the 95% percentile values of the bootstrap distribution of adjusted thresholds.

Figure 5

Performance of prediction tools using adjusted thresholds on each dataset (ClinVar, SHaRe, and Walsh_2017). Optimized thresholds at Beta = 0.5 minimize the false positives (variants predicted as pathogenic that are benign). Optimized thresholds at Beta = 1 give the same importance to false positives and false negatives. Optimized thresholds at Beta = 1.5 minimize the false negatives (variants predicted as benign that are pathogenic). Tools highlighted in blue were selected as the best by averaging the ranks between the three datasets.

Adjusted thresholds that maximize performance for HCM variants at different levels of importance given to precision and recall. 95% percentile values of the bootstrap distribution are also displayed. Mutpred and CardioBoost were not included since they did not predict the minimum number of variants (N = 50) in the minority class required by VETA for threshold analysis. Tool names in bold represent those that display minimally useful predictive power (> 0.70 weighted normalized MCC) across the different datasets (Figure 5). Numbers in bold represent cases for which the reference threshold lies outside the 95% percentile values of the bootstrap distribution of adjusted thresholds. Performance of prediction tools using adjusted thresholds on each dataset (ClinVar, SHaRe, and Walsh_2017). Optimized thresholds at Beta = 0.5 minimize the false positives (variants predicted as pathogenic that are benign). Optimized thresholds at Beta = 1 give the same importance to false positives and false negatives. Optimized thresholds at Beta = 1.5 minimize the false negatives (variants predicted as benign that are pathogenic). Tools highlighted in blue were selected as the best by averaging the ranks between the three datasets. Contrasting with the top tools, an improvement of overall predictions was observed for the middle-ranked tools after threshold optimization (Figure 5). For example, VARITY, FATHMM and MTR had a >10% increase of the weighted normalized MCC values for almost all the dataset/threshold combinations. For the lowest performing tools, threshold optimization has no significant effect (e.g. GWAVA, MetaSVM, GERP; Figures 4, 5). In addition, tools with a large fraction of missing predictions rank poorly, regardless of the threshold optimization (e.g. Mutpred, VEST4, CardioBoost; Figures 3, 5).

Addressing circularity

Circularity is a critical issue to be considered when assessing performance metrics (88). In this regard, we interrogated whether some of the variants present in our evaluation datasets had been previously used for tool training (Type I circularity). We restricted circularity analysis to the best performing tools. The rank of each tool across all datasets was averaged, and the 10 tools with the lower average rank value with adjusted thresholds at β = 1 were selected. These included ClinPred, MISTIC, CAPICE, REVEL, MetaLR, MPC, MutScore, PrimateAI, FATHMM and CADD (Figure 5). For some tools, circularity could not be properly addressed because the training datasets were not explicitly available. This was the case for the pathogenic sets of REVEL, FATHMM and MISTIC (in this case, partially), which used HGMD (89) variants undisclosed for licensing reasons. For other unavailable sets (such as the benign variants of FATHHM and MPC, and both benign and pathogenic datasets of ClinPred), we tried to replicate the data generation process following the methods of each publication, but we cannot ensure that the original sets were accurately reproduced. All VCF files generated (except for CADD that was downloaded directly from the website) are available at https://github.com/PedroBarbosa/paper_HCM_benchmark. Finally, we generated a new test dataset termed “excluded_training_top_tools set”, where variants present in the training of top-performant tools were excluded (Supplementary Table S4). As expected, performance decreased after correcting for type I circularity, with no tool achieving a weighted normalized MCC of 0.9 (Figure 6A). Nevertheless, four tools display metric values above 0.85 (ClinPred, MISTIC, MPC, FATHMM). Notably, these tools still perform better than most of the others, including those that were not controlled for circularity (Figure 6A).

Figure 6

Performance of prediction tools after addressing circularity issues. Tools were ranked using the weighted normalized MCC on two new test datasets (A–C). (A) Variants identified as present in the training sets of the tools highlighted in blue were removed from the merged ClinVar, SHaRe, and Walsh_2017 datasets. (B) HCM ClinVar variants submitted after the tools highlighted in blue were developed. (C) Variants in the whole ClinVar irrespective of disease context. The tools selected as best-performers for HCM are highlighted in red (bold). As an alternative approach to address circularity, we analyzed specifically ClinVar variants that were reported after the tools under assessment were developed (“ClinVar_recent set”, Supplementary Table S5). Despite the very low number of variants and the partial overlap with the “excluded_training_top_tools” set, this dataset has the advantage of better controlling for bias favoring any tool for which no training data is available. The results show weighted normalized MCC values above 0.90 for FATHMM and MISTIC, and values above 0.80 for MetaLR, REVEL, CADD, ClinPred, and MPC (Figure 6B). Additional tools with evaluation scores above 0.80 include M-CAP, MVP and MutFormer, which partially use HGMD variants in their training datasets and therefore could not be controlled for potential biases. Having addressed type I circularity issues, we next identified the five best-performing tools by measuring the average of the tools' ranks in Figures 6A,B, considering each dataset size (i.e., more weight was given to the “excluded_training_top_tools” set). The resulting list includes ClinPred, MISTIC, FATHMM, MPC and MetaLR. We further observed a high correlation between the predictions of these best-performing tools (Supplementary Figure S2). We next asked how the best tools in discriminating between HCM-associated and benign missense variants perform in classifying missense variants irrespectively of the disease context. We selected missense variants in the whole ClinVar database and we excluded variants that were used in the training of the tools for which we addressed circularity. The resulting dataset consists of 25,971 missense variants (9,651 Pathogenic/Likely_pathogenic, and 16,320 Benign/Likely_benign, Supplementary Table S6). The analysis was performed with the previously recommended thresholds (as indicated in Table 1). Overall, the MCC scores are lower compared to the HCM datasets (pval = 0.036, one-sided Wilcoxon signed-rank test), highlighting the value of disease-specific analysis. Notably, a subset of the best tools selected for HCM (namely, ClinPred and MISTIC) still scored with MCC values > 0.80 (Figure 6C).

High-confidence prioritization of HCM-associated VUS

Finally, we used the five best-performing tools (ClinPred, MISTIC, FATHMM, MPC and MetaLR) to inspect a non-redundant set of HCM-associated variants classified as VUS in the SHaRe (N = 103) and Walsh_2017 (N = 14) datasets (Supplementary Table S7). For those variants annotated in ClinVar, we confirmed they remain classified as VUS (as of April 2022). The results show that the majority (81%) of the variants are predicted to be pathogenic by more than 50% of the tools. Particularly, 63 variants were predicted to be pathogenic by all the tools, most of them located in the MYH7 gene (Figure 7). We additionally inspected predictions on MYH7 VUS made by CardioVAI, which considers the ClinGen Expert Panel adaptation of ACMG/AMP guidelines for MYH7 variants (87). The results are consistent with the pathogenic predictions of the top 5 tools, except for two variants predicted as benign by CardioVAI (c.3701A>C and c.3551A>T). Given the high-agreement level of classifications, we propose that variants highlighted in Figure 7 should be prioritized for further clinical and functional studies.

Figure 7

High-confidence prioritization of HCM-associated VUS based on predictions of the 5 top-performant tools (ClinPred, MISTIC, FATHMM, MPC and MetaLR). On the left, 63 variants for which 100% of the tools predict pathogenicity. On the right, variants predicted to be benign by more than 50% of the tools.

Discussion

The accurate identification of genetic changes associated with increased risk for HCM remains challenging. According to the ACMG/AMP guidelines, computational predictions are included as one line of evidence to assess the clinical significance of genetic variation (18–20). Although many computational tools are currently available, it is unclear which should be selected for clinical genome interpretation. To date, multiple independent variant prediction benchmarking studies have been published (23–31). However, the results are usually not consistent, and one reason for this discrepancy may relate to the different benchmark datasets used (27, 32). In addition, these studies do not evaluate prediction tools on a disease-specific manner [except for (27)]. It is also common that performance is evaluated using the receiver operating characteristic (ROC) curve, which does not reflect the fixed thresholds used in medical genetic testing (32). Moreover, the frequently used dbNSFP resource lacks more recently developed approaches (68, 69). In this study, we developed a computational framework (VETA) to compare the performance of 39 algorithms in predicting missense variants known to be implicated in the pathogenesis of a specific disease, HCM. We focused on missense variants because HCM is frequently caused by this type of genetic variation (Figure 1B). HCM-associated missense variants may disrupt normal sarcomeric assembly and function by changing an amino acid in a highly-conserved protein residue, altering important kinase domains that affect ligand interaction, or changing surface-exposed residues that affect protein-protein interaction (47). Missense variants can also cause protein misfolding and accelerated degradation, thus leading to haploinsufficiency (47). As “ground-truth” for prediction assessment, we used datasets of HCM-associated missense variants classified with high-confidence based on expert-reviewed clinical and functional evidence. We included the fraction of missing scores in performance metrics, and this clearly influenced the ranking (Figure 3). In particular, cardiac-specific methods such as CardioVAI and CardioBoost, were designed to predict variants in “core” disease-linked genes such as MYH7. Many of the genes included in our benign dataset differ from those used by CardioVAI and CardioBoost and this is probably the reason why these tools failed to score multiple “ground-truth” benign variants in our analysis (Supplementary Figure S1). Another contribution of our study is the calculation of new thresholds specifically calibrated for HCM. We found all tools with predictive power above 0.7 (regarded as minimally useful) to have a recommended threshold that falls outside of the bootstrap estimate interval of the tuned thresholds (Table 2, bold tools and values). This reveals that previously reported reference thresholds are not ideal for application to HCM-related variants. To enable flexibility in the choice of new thresholds, we derived adjusted values where importance given to recall of precision varies. If the goal is to maximize the identification of pathogenic variants, one must use the threshold obtained using a Beta value of 1.5. However, for missense HCM variants, we observed that most errors come at the cost of lower specificity (benign variant predicted as harmful. Figures 6A,B), thus it might be preferable to use thresholds obtained at a Beta value of 0.5, especially when looking at rare missense variants. After tackling potential biases related to circularity issues, we identified ClinPred, MISTIC, FATHMM, MPC and MetaLR as the five best performers. ClinPred (63) incorporates two machine learning algorithms that use existing conservation, pathogenicity scores and population allele frequency from the gnomAD database as input features. MISTIC (64) combines two complementary machine learning algorithms using a soft voting system that integrates 113 missense features, ranging from allele frequencies from the Exome Aggregation Consortium (ExAC) and conservation/pathogenicity scores, to physiochemical and biochemical properties of amino acids. FATHMM (56) builds Hidden Markov models from multiple sequence alignments along with pathogenicity weights to predict the functional, molecular, and phenotypic consequences of amino acid substitutions. MPC (60) is a deleteriousness metric that incorporates depletion of missense variation across genes by leveraging the sequencing data from ExAC (60,706 individuals). Finally, MetaLR (25) is a Logistic Regression model that integrates multiple scoring methods. Thus, most of these top-ranked tools integrate several previous models as features in their algorithms, and their predictions tend to be highly correlated (Supplementary Figure S2). These results are in line with previous observations indicating that meta-predictors tend to perform better than individual counterparts (31, 90). While some experts argue that VUS reporting may lead to confusion and cause more harm than benefit to the patient and family (91), others highlight the importance of appropriate clinical follow-up as it may contribute to clarify the variant's impact and eventually lead to its reclassification (92). We propose combining the best performing tools identified in this study to provide clinicians with a high-confidence prioritization of VUS and newly detected variants identified by genetic testing in HCM patients. Testing family members for a prioritized variant may reveal its presence in multiple affected individuals and absence in healthy individuals, indicating that the variant should be considered pathogenic. Several lines of evidence indicate that a disease-specific approach improves variant interpretation, namely in inherited cardiac disorders (32). Indeed, detailed knowledge about the penetrance and age at onset of phenotypes associated with each disease, and the percentage of clinical cases accounted for by pathogenic variants in known genes are essential prerequisites for interpreting variants effectively (93). However, the majority of genetic diseases are so rare that it is difficult to compile specific “ground truth” datasets for tool assessment. This prompted us to investigate how the best tools for HCM perform in a disease-agnostic context (Figure 6C). Despite a clear reduction in performance metrics, ClinPred and MISTIC ranked among the five top tools being able to discriminate between pathogenic and benign missense variants with high sensitivity and specificity (>85%). In conclusion, this study provides an objective framework for selecting the best-performing computational predictors to assist clinical interpretation of unknown missense variants. The results reported here may lay the foundation for a more consistent, reproducible and transparent approach to variant prediction across clinical diagnostic centers.

Limitations of the study

The robustness of the analysis described in this study is highly dependent of the number and gene distribution of variants in the test datasets. In the pathogenic datasets used here, there is an over-representation of pathogenic variants in the MYH7 gene because the vast majority of missense variation in HCM occurs in this gene. In contrast, variants in the benign datasets are more uniformly distributed throughout different genes. As the performance metrics combines the scores for both pathogenic and benign variants, the potential bias related to MYH7 over-representation is in part counterbalanced. We did not evaluate prediction tools with high-throughput functional assays such as deep mutational scanning (30, 94) due to lack of HCM-specific data. Finally, our circularity-resilient analysis was limited to those tools for which the training datasets were available.

Data availability statement

Publicly available datasets were analyzed in this study. This data can be found here: https://github.com/PedroBarbosa/paper_HCM_benchmark.

Author contributions

MC-F, AF, and PB conceptualized and designed the study and wrote the manuscript. PB and MR collected and assembled the datasets. PB, MR, AF, and MC-F analyzed the data and interpreted the results. All authors contributed to the article and approved the submitted version.

Funding

This work was supported by Fundação para a Ciência e a Tecnologia (FCT), Portugal (Fellowship SFRH/BD/137062/2018 to PB, and research support to LASIGE, UIDB/00408/2020), by FEDER/POR Lisboa 2020-Programa Operacional Regional de Lisboa, PORTUGAL 2020 (Infogene, 045300; CAMELOT, LISBOA-01-0247-FEDER-045915), and la Caixa Foundation under the agreement LCF/PR/HR20/52400021.

Conflict of interest

Author MC-F is a cofounder and scientific advisor of GenoMed S.A., a molecular diagnosis company. Author AF is a consultant to GenoMed S.A. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Publisher's note

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

86 in total

1. Improving the assessment of the outcome of nonsynonymous SNVs with a consensus deleteriousness score, Condel.

Authors: Abel González-Pérez; Nuria López-Bigas
Journal: Am J Hum Genet Date: 2011-03-31 Impact factor: 11.025

2. DANN: a deep learning approach for annotating the pathogenicity of genetic variants.

Authors: Daniel Quang; Yifei Chen; Xiaohui Xie
Journal: Bioinformatics Date: 2014-10-22 Impact factor: 6.937

3. Comparison and integration of deleteriousness prediction methods for nonsynonymous SNVs in whole exome sequencing studies.

Authors: Chengliang Dong; Peng Wei; Xueqiu Jian; Richard Gibbs; Eric Boerwinkle; Kai Wang; Xiaoming Liu
Journal: Hum Mol Genet Date: 2014-12-30 Impact factor: 6.150

4. A Whole-Genome Analysis Framework for Effective Identification of Pathogenic Regulatory Variants in Mendelian Disease.

Authors: Damian Smedley; Max Schubach; Julius O B Jacobsen; Sebastian Köhler; Tomasz Zemojtel; Malte Spielmann; Marten Jäger; Harry Hochheiser; Nicole L Washington; Julie A McMurry; Melissa A Haendel; Christopher J Mungall; Suzanna E Lewis; Tudor Groza; Giorgio Valentini; Peter N Robinson
Journal: Am J Hum Genet Date: 2016-08-25 Impact factor: 11.025

5. Balancing uncertainty with patient autonomy in precision medicine.

Authors: Samantha Pollard; Sophie Sun; Dean A Regier
Journal: Nat Rev Genet Date: 2019-05 Impact factor: 53.242

6. The evaluation of tools used to predict the impact of missense variants is hindered by two types of circularity.

Authors: Dominik G Grimm; Chloé-Agathe Azencott; Fabian Aicheler; Udo Gieraths; Daniel G MacArthur; Kaitlin E Samocha; David N Cooper; Peter D Stenson; Mark J Daly; Jordan W Smoller; Laramie E Duncan; Karsten M Borgwardt
Journal: Hum Mutat Date: 2015-03-26 Impact factor: 4.878

7. A phenotype centric benchmark of variant prioritisation tools.

Authors: Denise Anderson; Timo Lassmann
Journal: NPJ Genom Med Date: 2018-02-05 Impact factor: 8.617

8. REVEL and BayesDel outperform other in silico meta-predictors for clinical variant classification.

Authors: Yuan Tian; Tina Pesaran; Adam Chamberlin; R Bryn Fenwick; Shuwei Li; Chia-Ling Gau; Elizabeth C Chao; Hsiao-Mei Lu; Mary Helen Black; Dajun Qian
Journal: Sci Rep Date: 2019-09-04 Impact factor: 4.379

9. Analysis of missense variants in the human genome reveals widespread gene-specific clustering and improves prediction of pathogenicity.

Authors: Mathieu Quinodoz; Virginie G Peter; Katarina Cisarova; Beryl Royer-Bertrand; Peter D Stenson; David N Cooper; Sheila Unger; Andrea Superti-Furga; Carlo Rivolta
Journal: Am J Hum Genet Date: 2022-02-03 Impact factor: 11.025

10. PredictSNP2: A Unified Platform for Accurately Evaluating SNP Effects by Exploiting the Different Characteristics of Variants in Distinct Genomic Regions.

Authors: Jaroslav Bendl; Miloš Musil; Jan Štourač; Jaroslav Zendulka; Jiří Damborský; Jan Brezovský
Journal: PLoS Comput Biol Date: 2016-05-25 Impact factor: 4.475