Literature DB >> 29884841

Comprehensive annotation of BRCA1 and BRCA2 missense variants by functionally validated sequence-based computational prediction models.

Steven N Hart¹, Tanya Hoskin¹, Hermela Shimelis², Raymond M Moore¹, Bingjian Feng³, Abigail Thomas¹, Noralane M Lindor⁴, Eric C Polley¹, David E Goldgar³, Edwin Iversen⁵, Alvaro N A Monteiro⁶, Vera J Suman¹, Fergus J Couch^7,8.

Abstract

PURPOSE: To improve methods for predicting the impact of missense variants of uncertain significance (VUS) in BRCA1 and BRCA2 on protein function.
METHODS: Functional data for 248 BRCA1 and 207 BRCA2 variants from assays with established high sensitivity and specificity for damaging variants were used to recalibrate 40 in silico algorithms predicting the impact of variants on protein activity. Additional random forest (RF) and naïve voting method (NVM) metapredictors for both BRCA1 and BRCA2 were developed to increase predictive accuracy.
RESULTS: Optimized thresholds for in silico prediction models significantly improved the accuracy of predicted functional effects for BRCA1 and BRCA2 variants. In addition, new BRCA1-RF and BRCA2-RF metapredictors showed area under the curve (AUC) values of 0.92 (95% confidence interval [CI]: 0.88-0.96) and 0.90 (95% CI: 0.84-0.95), respectively. Similarly, the BRCA1-NVM and BRCA2-NVM models had AUCs of 0.93 and 0.90. The RF and NVM models were used to predict the pathogenicity of all possible missense variants in BRCA1 and BRCA2.
CONCLUSION: The recalibrated algorithms and new metapredictors significantly improved upon current models for predicting the impact of variants in cancer risk-associated domains of BRCA1 and BRCA2. Prediction of the functional impact of all possible variants in BRCA1 and BRCA2 provides important information about the clinical relevance of variants in these genes.

Entities: Chemical

Keywords: BRCA1 and BRCA2; Functional evaluation; In silico prediction; Metapredictor; VUS

Mesh：

Substances：

Year: 2018 PMID： 29884841 PMCID： PMC6287763 DOI： 10.1038/s41436-018-0018-4

Source DB: PubMed Journal: Genet Med ISSN： 1098-3600 Impact factor: 8.822

INTRODUCTION

Pathogenic variants in BRCA1 and BRCA2 account for 20–25% of hereditary breast and ovarian cancer[1], 5–10% of breast cancers[2], and up to 15% of ovarian cancers[3]. While most known pathogenic variants in these genes truncate the encoded proteins, missense variants can also predispose to cancer. More than 90% of missense variants in public databases[4] identified by clinical genetic testing are listed as variants of uncertain significance (VUS)[5]. Missense variants with definitive pathogenic or neutral status can inform clinical management, prevention, and treatment. Thus, accurate methods to establish variant pathogenicity are needed. Family-based studies yielding likelihoods of pathogenicity, based on segregation of variants with cancer and personal and family history of cancer are established methods for determining pathogenicity of variants in BRCA1 and BRCA2. However, few missense variants have been clinically annotated by this method owing to the limited availability of family-based data. Similarly, functional assays[6,7] with established specificity and sensitivity for known pathogenic and neutral BRCA1 or BRCA2 variants, have been used alone or in combination with family-based segregation data to infer pathogenicity[8]. However, classification of all possible variants by functional assays is unlikely. Alternatively, the clinical relevance of variants can be assessed using sequence-based in silico prediction models, which can be applied to all possible missense VUS in these genes. Given the large number of unique VUS identified in BRCA1 and BRCA2, in silico prediction models will need to be incorporated in models that aim to predict the pathogenicity of VUS in these genes. Most commonly used prediction tools such as SIFT[9], PolyPhen[10], GERP[11], Align-GVGD[12], and CADD[13] have been developed using large-scale databases such as the Human Gene Mutation Database (HGMD)[14] or ClinVar[4]. While functional assays can out-perform these computational predictions of damage[6,15], development and/or calibration of in silico prediction models using well characterized functional data from validated assays is expected to improve variant annotation. In this study, HDR[6] functional data from 207 BRCA2 variants and transcriptional integrity[16,17] data for 248 BRCA1 variants were used to evaluate the performance of existing in silico algorithms. Sensitivity and specificity of the algorithms were optimized by defining more accurate thresholds, and by newer high performance Random Forest (RF) and naïve voting method (NVM) predictors. We show that optimization for one gene leads to poor performance when applied to the other, highlighting the importance of different gene-specific features for prediction accuracy.

MATERIALS AND METHODS

BRCA1 transcription integrity assay

Results from functional studies of variants in the BRCT domains of BRCA1 using a transcription integrity assay have been reported previously[16,17]. The sensitivity and specificity of this assay for missense variants in the BRCT domains of BRCA1 have been estimated at 100% (Sensitivity, 95%CI: 75%−100%; Specificity, 95%CI: 83%−100%)[16]. The 95% probability of pathogenicity and neutrality from the VarCall two-component mixture model for classification of BRCA1 missense variants[8] was used to define 61 pathogenic, 21 indeterminate (partial effect on function), and 166 neutral variants (total of 248). These data were used to define BRCA1 activity.

BRCA2 HDR assay

A cell-based homology directed DNA repair activity assay was used to assess the influence of missense variants in the DNA binding domain of BRCA2 on protein activity[6]. In brief, BRCA2 activity in brca2 deficient V-C8 cells expressing mutant forms of full-length BRCA2 was measured with a DR-GFP reporter plasmid after induction of a DNA double strand break using the I-Sce1 enzyme. The V-C8 hamster lung fibroblast cell line was a gift from Dr. Margaret Zdzienicka. Cells were verified by genotyping in the Mayo Clinic Medical Research Facility and routinely tested for mycoplasma contamination. The sensitivity and specificity of this assay for damaging missense variants in the DNA binding domain (DBD) of BRCA2 has previously been estimated at 100% (Sensitivity, 95%CI: 79%−100%; Specificity, 95%CI: 93%−100%) using 21 known neutral and 13 known pathogenic variants[6,18-20]. Results from 68 variants were combined with previous results from 139 previously characterized variants for a total of 207.

Damaging missense prediction tools

dbNSFP version 3.0a[21] was downloaded and converted into a BioR catalogue[22] to annotate variants. Align-GVGD[12] was accessed online. CAROL and CONDEL scores were gathered from Variant Effect Predictor (VEP)[23].

Optimized thresholds

Analyses included damaging, indeterminate, and neutral variants. Indeterminate variants were included in the neutral category (Scenario 1). An alternative approach, in which indeterminate variants are included in the pathogenic category (Scenario 2) is provided in Supplemental Materials. Optimal thresholds for individual predictive algorithms that maximized sensitivity and specificity for damaging variants were derived using results from the BRCA1 transcriptional integrity assay and BRCA2 HDR assay, individually (Figure S1 and S2). Matthew’s correlation coefficients (MCC) were calculated for each resulting binary classification relative to the functional assay standards[24]. The areas under the curve (AUCs) were estimated and reported with 95% confidence intervals using the DeLong error method. Receiver operating characteristic (ROC) analyses were performed using the package optimalCutpoints[25] for R software (v3.3.3; http://www.R-project.org).

Naïve Voting Method (NVM) models.

For each gene, a training set (a random sample of ~50% of the variants for each gene) and a test set (the remaining ~50%) were constructed using the sample function in R. The training set was used to determine the optimal number of individual prediction algorithms in the NVM model based on the maximal MCC. Starting with the individual prediction algorithm with the highest MCC, the prediction algorithm with the highest individual MCC among the models not previously chosen was added iteratively until the optimal numbers of prediction algorithms were included. If both a raw score (Score) and rank score (RankScore) for an algorithm were available, then only the RankScore was utilized. The NVM models and thresholds developed in the training sets were validated in the test sets. The MCC and other performance statistics were also re-calculated across the entire data sets (training and test combined) to be consistent with reporting of other models. Lollipop plots were generated with lollipops (v1.2, http://dx.doi.org/10.5281/zenodo.46184).

Random Forest models.

Random forest (RF) modelling utilized scores from each of the optimized individual prediction algorithms to identify the subset of prediction algorithms that maximized the accuracy of predicting damaging and non-damaging (indeterminate and neutral) variants in BRCA1 and BRCA2. The randomForest R package[26] was used with settings of n=500 trees and the number of predictor variables sampled as candidates at each split set to the recommended default of sqrt(p), where p is the number of predictor variables included in the model. For individual prediction algorithms available as both a Score and RankScore, only the RankScore was included in the random forest models. Variable importance was assessed using the mean decrease in accuracy resulting from exclusion of a given prediction model from the RF classifiers. Out-of-sample predictions on the probability scale were again derived for each model and used to estimate AUC, sensitivity, specificity, and MCC at optimized cut points for prediction of functional status.

Comparison to ClinVar

BRCA1 and BRCA2 classifications from ClinVar that were reviewed by an expert panel and had no conflicting interpretations were used. Pathogenic and likely pathogenic variants in ClinVar were grouped into the pathogenic (damaging) category, and variants annotated as benign or likely benign in ClinVar with no conflicting interpretations were defined as neutral (neutral).

Code availability

All code and data required to replicate all analyses are available on GitHub (https://github.com/Steven-N-Hart/NVM).

RESULTS

Functional characterization of 68 novel BRCA2 missense variants

In this study, 68 BRCA2 variants from the BRCA2 DBD were evaluated using the HDR assay. Of these, 17 showed HDR fold change <1.66, with probabilities of pathogenicity >0.99 (Table 1, Figure 1, Table S1), and 48 variants showed HDR>2.41 and probabilities of neutrality >0.99. Another three variants (p.I2672T, p.D2733V and p.P3150L) displayed partial activity (HDR fold change >1.66 and <2.41) and were annotated as indeterminate variants (Figure 1, Table S1). When combined with previously classified variants[6,27], 69 were predicted deleterious (damaging), 21 were intermediate/partial (indeterminate), and 117 were predicted benign/neutral (neutral) (Table 1, Table S1).

Table 1.

Predicted pathogenic missense variants defined by the BRCA2 HDR assay

Variant	cDNA	AGVGDclass	IARCclass	FC*	SE[#]	p(pathogenicity)	p(neutrality)	Origin

G2748D	c.8243G>A	C65	Class 5	0.68	± 0.07	1.00	2.96E-12	Lindor et al., 2012
L2686P	c.8057T>C	C45		0.72	± 0.04	1.00	9.50E-12	Guidugli et al., 2017
L2653P	c.7958T>C	C65	Class 5	0.75	± 0.08	1.00	3.40E-11	Lindor et al., 2012
E2663K	c.7987G>A	C55		0.83	± 0.01	1.00	4.00E-10	Current study
R3052W	c.9154C>T	C65	Class 5	0.86	± 0.09	1.00	9.33E-10	Lindor et al., 2012
L2721H	c.8162T>A	C25		0.88	± 0.10	1.00	1.70E-09	Guidugli et al., 2017
Y2624D	c.7870T>G	C65		0.88	± 0.06	1.00	1.75E-09	Current study
Y2624H	c.7870T>C	C65		0.89	± 0.31	1.00	2.40E-09	Current study
L3125R	c.9374T>G	C65		0.91	± 0.00	1.00	3.41E-09	Current study
R2784W	c.8350C>T	C65		0.91	± 0.10	1.00	3.88E-09	Guidugli et al., 2017
A2603P	c.7807G>C	C25		0.92	± 0.60	1.00	4.78E-09	Guidugli et al., 2017
N3124I	c.9371A>T	C65	Class 4	0.92	± 0.10	1.00	4.65E-09	Guidugli et al., 2013
L2647P	c.7940T>C	C65	Class 4	0.93	± 0.10	1.00	5.79E-09	Lindor et al., 2012
S2670L	c.8009C>T	C15		0.93	± 0.07	1.00	6.20E-09	Guidugli et al., 2017
G3076E	c.9227G>A	C65		0.95	± 0.10	1.00	1.02E-08	Guidugli et al., 2017
Y2624N	c.7870T>A	C65		0.95	± 0.00	1.00	1.20E-08	Current study
L3125H	c.9374T>A	C65		0.96	± 0.07	1.00	1.36E-08	Guidugli et al., 2017
L2510P	c.7529T>C	C65		0.98	± 0.11	1.00	2.61E-08	Guidugli et al., 2017
K2630Q	c.7888A>C	C45		0.98	± 0.08	1.00	2.49E-08	Guidugli et al., 2017
R2824G	c.8470A>G	C65		1.00	± 0.20	1.00	3.72E-08	Current study
H2623R	c.7868A>G	C25		1.00	± 0.04	1.00	3.59E-08	Guidugli et al., 2017
D2723H	c.8167G>C	C65	Class 5	1.00	± 0.01	1.00	3.81E-08	Lindor et al., 2012
N2781I	c.8342A>T	C65		1.00	± 0.10	1.00	4.08E-08	Current study
I2627F	c.7879A>T	C15	Class 5	1.01	± 0.08	1.00	4.83E-08	Lindor et al., 2012
A2730P	c.8188G>C	C0		1.01	± 0.01	1.00	5.46E-08	Current study
L2688P	c.8063T>C	C65	Class 4	1.02	± 0.08	1.00	5.80E-08	Guidugli et al., 2013
G3076R	c.9227G>T	C65		1.03	± 0.08	1.00	8.66E-08	Guidugli et al., 2017
G3076V	c.9226G>C	C65		1.03	± 0.11	1.00	8.08E-08	Guidugli et al., 2017
D2723V	c.8168A>T	C65		1.04	± 0.08	1.00	1.01E-07	Guidugli et al., 2017
W2788R	c.8362T>C	C25		1.05	± 0.08	1.00	1.37E-07	Guidugli et al., 2017
D2723A	c.8168A>C	C65		1.06	± 0.11	1.00	1.49E-07	Guidugli et al., 2017
W2788S	c.8363G>C	C35		1.06	± 0.08	1.00	1.44E-07	Guidugli et al., 2017
N3124K	c.9372C>A	C65		1.07	± 0.06	1.00	1.89E-07	Current study
G2609V	c.7826G>T	C65		1.07	± 0.08	1.00	2.24E-07	Guidugli et al., 2017
F2642S	c.7925T>C	C45		1.07	± 0.08	1.00	2.02E-07	Guidugli et al., 2017
D3095E	c.9285C>G	C35	Class 4	1.07	± 0.12	1.00	1.89E-07	Guidugli et al., 2013
T2722R	c.8165C>G	C65	Class 5	1.08	± 0.08	1.00	2.78E-07	Lindor et al., 2012
G2508R	c.7522G>C	C65		1.09	± 0.15	1.00	2.92E-07	Current study
S2691F	c.8072C>T	C0		1.10	± 0.06	1.00	4.09E-07	Guidugli et al., 2017
E3002K	c.9004G>A	C55		1.10	± 0.08	1.00	4.35E-07	Guidugli et al., 2017
D2723G	c.8168A>G	C65	Class 5	1.11	± 0.12	1.00	5.19E-07	Lindor et al., 2012
W2626R	c.7876T>C	C65		1.12	± 0.01	1.00	6.04E-07	Current study
G2596E	c.7787G>A	C65		1.12	± 0.09	1.00	5.77E-07	Guidugli et al., 2017
Q2561P	c.7682A>C	C15		1.13	± 0.06	1.00	7.81E-07	Guidugli et al., 2017
V2687F	c.8059G>T	C0		1.14	± 0.02	1.00	8.96E-07	Current study
H2623Y	c.7867C>T	C65		1.15	± 0.01	1.00	1.32E-06	Current study
W2626C	c.7878G>C	C65	Class 5	1.16	± 0.13	1.00	1.56E-06	Lindor et al., 2012
G2793R	c.8377G>A	C65		1.18	± 0.09	1.00	2.23E-06	Guidugli et al., 2017
A3028P	c.9082G>C	C0		1.18	± 0.04	1.00	2.48E-06	Current study
L2792P	c.8375T>C	C65		1.19	± 0.09	1.00	3.03E-06	Guidugli et al., 2017
G2793E	c.8378G>A	C65		1.19	± 0.07	1.00	2.71E-06	Guidugli et al., 2017
A2786P	c.8356G>C	C0		1.23	± 0.09	1.00	5.97E-06	Guidugli et al., 2017
R2784Q	c.8351G>A	C35		1.27	± 0.14	1.00	1.46E-05	Guidugli et al., 2017
G2596R	c.7786G>C	C65		1.28	± 0.10	1.00	1.73E-05	Guidugli et al., 2017
G2585R	c.7753G>A	C65		1.30	± 0.10	1.00	2.47E-05	Guidugli et al., 2017
G3003E	c.9008G>A	C65		1.33	± 0.08	1.00	4.59E-05	Guidugli et al., 2017
G2609D	c.7826G>A	C65	Class 4	1.35	± 0.07	1.00	6.19E-05	Guidugli et al., 2013
W2725L	c.8174G>T	C55		1.35	± 0.10	1.00	6.73E-05	Guidugli et al., 2017
K2498E	c.7492A>G	C55		1.36	± 0.10	1.00	7.98E-05	Guidugli et al., 2017
Q2655R	c.7964A>G	C35		1.38	± 0.09	1.00	1.13E-04	Guidugli et al., 2017
Y2726C	c.8177A>G	C65		1.49	± 0.16	1.00	7.86E-04	Guidugli et al., 2017
Q2925K	c.8773C>A	C45		1.51	± 0.12	1.00	1.07E-03	Guidugli et al., 2017
D3073G	c.9218A>G	C65		1.52	± 0.12	1.00	1.17E-03	Guidugli et al., 2017
R2659G	c.7975A>G	C65		1.57	± 0.12	1.00	2.57E-03	Guidugli et al., 2017
Y2624C	c.7871A>G	C65		1.57	± 0.05	1.00	2.71E-03	Current study
R2842P	c.8525G>C	C65		1.59	± 0.05	1.00	3.40E-03	Current study
D2611G	c.7832A>G	C65		1.61	± 0.12	0.99	5.09E-03	Guidugli et al., 2017
Y2660D	c.7978T>G	C65		1.61	± 0.12	0.99	5.03E-03	Guidugli et al., 2017
N2622S	c.7865A>G	C45		1.63	± 0.13	0.99	6.76E-03	Current study

Fold Change in GFP positive cells in HDR assay;

Standard Error

Figure 1.

HDR activity of 207 BRCA2 missense variants.

The model-based HDR fold change with standard error (SE) is displayed on a logarithmic scale. The SE is included as a measure of the reproducibility of the HDR assay for each variant. Solid lines represent 99% probability of pathogenicity and 99% probability of neutrality (fold increase in GFP (+) cells < 1.66 for damaging and fold increase in GFP (+) cells > 2.41 for neutral). Dotted lines separate variants classified as deleterious, indeterminate, and neutral.

Computational Predictions

Sensitivity and specificity of 40 computational prediction models with previously established cut points for damaging variants were determined using the functional assay data for BRCA1 and BRCA2 missense variants (Tables S2, Table S3). These default thresholds yielded either high sensitivity with low specificity (e.g. BRCA2 SIFT Score: sensitivity 100%, specificity <20%) or low sensitivity with high specificity (e.g. BRCA1 PROVEAN Score: sensitivity <0.02%, specificity 100%), depending on the gene (Table S3). To optimize the predictive ability of each model, thresholds that maximized sensitivity and specificity for damaging variants were defined separately for BRCA1 and BRCA2. For the purposes of predicting damaging, clinically relevant variants, models were generated by combining indeterminate with neutral variants (Scenario 1). Performance characteristics and AUC values for optimized individual prediction models for BRCA1 and BRCA2 are shown in Table 2 and Figure S2. The best performing individual models for BRCA1 incorporated conservation measures including deep interspecies protein alignments and physicochemical changes in amino acids (MetaSVM Score and RankScore[21]; PERCH and PERCH_noMAF[28]; Align-GVGD[12]; Polyphen2Hvar Score and RankScore[10], and VEST3 Score and RankScore[29]). These models yielded AUCs>0.87, sensitivity and specificity >80%, and MCCs up to 0.68 (VEST3Score) (Table 2, Table S4). These results represented a major improvement in performance over results based on default thresholds (mean = 0.29) (Table S3). The best performing models for BRCA2 were PERCH and PERCH_noMAF; MetaLR RankScore and Score; MetaSVM RankScore and Score; and VEST3 RankScore and Score. These yielded AUCs of 0.83–0.89, sensitivity and specificity >78% (85% for PERCH), and MCCs>0.53 (Table 2, Table S4), which were substantially improved over models using default parameters (MCC<0.42) (Table S3).

Table 2.

Performance of in silico prediction models with optimized thresholds for classification of BRCA1 and BRCA2 missense variants

Gene	Model	OptimalThreshold	AUC (95%CI)	FN / FP / TP / TN	MCC

BRCA1	NVM-Validation	≥9	0.94 (0.897–0.983)	5 / 8 / 25 / 79	0.719
	Vest3RankScore	≥0.85546	0.9 (0.849–0.95)	8 / 24 / 51 / 153	0.678
	Vest3Score	≥0.868	0.9 (0.849–0.95)	8 / 24 / 51 / 153	0.678
	RF	≥0.298	0.92 (0.879–0.96)	8 / 26 / 51 / 151	0.663
	AlignGVGDPrior	≥0.29	0.88 (0.829–0.931)	7 / 37 / 54 / 150	0.614
	PERCHnoMAF	≥0.206316	0.87 (0.814–0.924)	10 / 31 / 51 / 156	0.614
	PERCH	≥0.239853	0.87 (0.819–0.922)	10 / 32 / 51 / 155	0.607
	Polyphen2HvarRankScore	≥0.91584	0.89 (0.845–0.93)	11 / 30 / 48 / 147	0.593
	Polyphen2HvarScore	≥0.999	0.89 (0.845–0.93)	11 / 30 / 48 / 147	0.593
	MetaSVMRankScore	≥0.9083	0.89 (0.844–0.928)	11 / 34 / 48 / 143	0.565

BRCA2	NVM-Validation	≥4	0.89 (0.826–0.963)	6 / 9 / 29 / 59	0.683
	PERCH	≥0.295957	0.89 (0.847–0.939)	11 / 21 / 60 / 115	0.672
	PERCHnoMAF	≥0.272149	0.88 (0.832–0.929)	12 / 23 / 59 / 113	0.642
	RFModel	≥0.371	0.9 (0.843–0.947)	12 / 24 / 59 / 111	0.633
	MetaSVMRankScore	≥0.93181	0.87 (0.824–0.923)	15 / 29 / 56 / 107	0.555
	MetaSVMScore	≥0.7002	0.87 (0.824–0.923)	15 / 29 / 56 / 107	0.555
	MetaLRRankScore	≥0.92107	0.87 (0.823–0.922)	16 / 30 / 55 / 106	0.535
	MetaLRScore	≥0.7679	0.87 (0.823–0.922)	16 / 30 / 55 / 106	0.535
	Vest3RankScore	≥0.79963	0.83 (0.776–0.893)	16 / 30 / 55 / 106	0.535
	Vest3Score	≥0.811	0.83 (0.776–0.893)	16 / 30 / 55 / 106	0.535

FN: False negative; FP: False positive; TP: True positive; TN: True negative

AUC: Area under the curve from Receiver Operator characteristic analysis

MCC: Matthew Correlation Coefficient

To assess whether meta-predictor models improved prediction of the damaging variants for each gene, two new models were developed for both BRCA1 and BRCA2: (1) Random Forest (RF) classifiers of prediction methods were derived from the continuous outputs from the functional data (BRCA1-RF and BRCA2-RF); (2) naïve voting methods (NVM) were applied to optimized thresholds for each prediction model (BRCA1-NVM and BRCA2-NVM). CAROL[30] and CONDEL[31] predictors were not included in development of new BRCA2 models because prediction scores for 29 of 207 (14.0%) variants were not available. Only 12 of 248 (4.8%) BRCA1 and 1 of 207 (0.5%) BRCA2 variants were excluded from new model development due to missing data or conflicts between protein and DNA sequences (Table S2).

RF-Models

Random Forest (RF) classifiers were used to evaluate the impact of excluding individual prediction methods on the accuracy of composite prediction models. VEST3 RankScore and Align-GVGD had the greatest impact on the accuracy of BRCA1-RF, whereas Mutation Assessor RankScore and PERCH had the greatest impact on BRCA2-RF. The BRCA1-RF model (threshold ≥0.298) (Table S4) showed the second highest AUC value of all models for BRCA1 (0.92, 95%CI:0.88–0.96), with 86% sensitivity and 85% specificity. The BRCA1-RF model predicted 8 of 59 (13.6%) functionally impaired BRCA1 variants as neutral (false negatives), 12 of 21 (57.1%) functionally indeterminate variants as damaging, and 14 of 156 (9.0%) functionally intact neutral variants as damaging (false positives) (Table S2). Similarly, the BRCA2-RF model (threshold ≥0.371) (Table S4) had the highest AUC for BRCA2 (0.90, 95%CI:0.84–0.95) (Table 2, Table S4) with 83% sensitivity and 82% specificity (Table S4, Figure 2).

Figure 2.

Matthews Correlation Coefficients (MCC) for 42 in silico predictors with optimized thresholds for damaging versus indeterminate/neutral variants in BRCA1 and BRCA2.

Higher values indicate increased classifier performance.

NVM-Models

NVM models based on the optimal number of individual prediction algorithms for BRCA1 and BRCA2 variants were also developed. The optimal NVM for BRCA1, following training and validation (BRCA1-NVM Combined) contained 13 prediction models (Table S5). BRCA1 variants are predicted damaging when ≥9 of the 13 models exceed their individual thresholds for damaging variants (Table S5). BRCA1-NVM yielded an AUC of 0.94 with sensitivity of 83% and specificity of 91%. The highest proportion of BRCA1 misclassifications involved variants with indeterminate function, with 9 of 21 (42.9%) annotated as damaging. In contrast, the optimal BRCA2-NVM (BRCA2-NVM Combined) model after training and validation incorporated six prediction models with a threshold of ≥4 models predicting damaging variants (Table S5). This model yielded sensitivity of 82% and specificity of 87% (Table 2, Table S4, Table S5), with 14 functionally damaging variants predicted as neutral, and 18 indeterminate/neutral variants predicted as damaging. As with BRCA1, the false positive results were disproportionately enriched for indeterminate function with 6 of 21 (28.6%) misclassified. Overall, the predictive abilities of the RF and NVM models showed substantial improvement over individual in silico prediction methods using default parameters, and modest improvements over the best performing individual in silico methods optimized at thresholds specific to BRCA1 and BRCA2.

Application of selected models to all possible missense variants in BRCA1 and BRCA2

The RF and NVM models were used to assess the damaging potential of all theoretically possible missense substitutions resulting from single nucleotide changes in BRCA1 and BRCA2, contingent on availability of prediction scores from all the individual methods contributing to each model (Table S2). Because a subset of the contributing prediction algorithms are in part based on nucleotide substitution rates, several missense variants caused by different nucleotide changes may have more than one predicted RF or NVM score. Using BRCA1-NVM, 7.1% of BRCA1 variants were predicted as damaging. Similarly, 2.6% of BRCA2 variants were predicted as damaging using BRCA2-NVM. However, marked enrichment for NVM predicted damaging variants was observed in known functional domains (Figure 3). Analysis of the BRCA1 RING domain, predicted that 30–40% of all missense changes disrupt protein function. Similarly, 46% of all possible missense variants in the C-terminal BRCT domains and >20% in the larger C-terminal region (residue 1660–1810) were predicted damaging (Table S2). Interestingly, ~10% of all possible variants between amino acids 300 to 550, which have been associated with TP53[32], RAD50[33], and c-MYC[32] interactions, were predicted damaging (Table S2). For BRCA2, only the region from residues 2574 to 2771 that contains the helical and OB1 domains of the DNA binding domain was predicted to have >20% damaging variants, although 10% of variants in OB3 were also predicted damaging (Figure 3, Table S2). Few damaging missense variants were predicted in the OB2 domain. Similar results were obtained using the RF model (Table S2). Damaging mutations were not predicted in the N-terminus of BRCA2, containing the PALB2 interaction domain[34], possibly because of the small size of the interaction site.

Figure 3.

Estimates of the proportion of damaging missense variants by position in each gene.

The AAPOS x-axis represents the amino acid position, and the y-axis is the probability of a missense mutation being damaging from the NVM model. The lines were smoothed using a 50 amino acid sliding window.

DISCUSSION

Specific measures of BRCA1 and BRCA2 functional activity have been established as reliable measures of the functional impact and the likelihood of pathogenicity of variants in certain domains of BRCA1 and BRCA2[6,16]. However, in the absence of functional studies of individual variants, in silico models that incorporate functional or structural data are often considered useful predictors of function. Here, existing models for prediction of damaging missense variants were recalibrated based on BRCA1 and BRCA2 functional data and were combined in meta-predictor classifiers (NVM and RF). These meta-predictors leveraged the strengths and weaknesses and improved upon many of the individual models for predicting the functional implications of missense variants in the cancer risk-associated domains of BRCA1 and BRCA2. We subsequently used these highly sensitive and specific models to annotate all missense variants from the BRCA1 and BRCA2 genes as damaging or neutral. Importantly, because the BRCA1 transcriptional integrity assay and the BRCA2 HDR assay used for calibration of the various prediction models have 100% sensitivity and specificity for clinically pathogenic variants in the BRCA1 BRCT and BRCA2 DNA binding domain domains, respectively, the models may also predict the clinical pathogenicity of missense variants in these domains. Whether prediction of functional effects in other parts of these proteins also reflects pathogenicity remains to be determined using additional pathogenic and neutral standards. Overall, these prediction models are likely to alter the interpretation of many VUS in BRCA1 and BRCA2, leading to improved clinical genetic testing, and perhaps improved risk management of patients found to carry VUS. The current American College of Medical Genetics guidelines for variant classification recommends that in silico evidence can be counted as supporting evidence for pathogenicity (or lack thereof) if all of the in silico programs tested agree on the prediction, whereas in silico evidence should not be used for classification if in silico predictions disagree. However, the guidelines do not recommend specific in silico methods, or indicate the number of methods that should be evaluated[35]. This differs from the NVM model in two key areas. First, default thresholds of predictive models are not appropriate for BRCA1 and BRCA2 because the specificity is very low. The new thresholds for predictive models derived here should provide more accurate predictions of functional impact and therefore pathogenicity. Second, while using an ensemble of models is a rational strategy, requiring all models to be in agreement becomes overly stringent resulting in decreased performance (Figure S3 and Figure S4). Rather, the number of in silico models, the choice of which specific models, and the thresholds for those models that are required for an accurate consensus with both high sensitivity and specificity can vary by gene.

Effect of grouping indeterminate variants as either damaging or neutral

Generally, the performance of individual in silico prediction models, as well as the RF and NVM, were similar when indeterminate variants were grouped with either damaging or neutral. However, the performance of some of the known prediction methods was highly sensitive to indeterminate variant classification. Interestingly, the prediction methods that had the greatest difference in thresholds, depending on the incorporation of the indeterminate variants in the damaging or neutral categories, also had higher AUCs (e.g. PERCH, NVM, RF, MetaSVM) compared to those with no change in threshold (e.g. PolyPhen2HDiv, PolyPhen2HVar, MutationTaster Score), suggesting that the former methods are better predictors of indeterminate impact on function. However, the clinical relevance of the indeterminate variants in BRCA1 and BRCA2 is not well understood. Further understanding of function, pathogenicity, cancer risk, and associated refinement of thresholds for damage and pathogenicity for each functional assay may allow recalibration of the prediction models and improved prediction of clinically relevant BRCA1 and BRCA2 variants in the future.

Extending missense prediction outside established functional domains

When cross-referencing the NVM predictions with well-annotated ClinVar classifications, the predictions clustered in well-known domains, with damaging missense variants mostly restricted to BRCT and RING domains of BRCA1 and the DNA binding domain of BRCA2 [12] (Figure S5). The BRCA1-NVM prediction model clearly delineated both regions, with as many as 40% of missense variants in the RING domain and 50% in parts of the BRCT regions annotated as damaging. Interestingly, enrichment between amino acids 400–500 was also observed, but no damaging variants in this region have been defined by functional studies and no pathogenic variants have yet been observed in the clinically tested population. According to the BRCA1-NVM model, the total proportion of all theoretically possible damaging variants in BRCA1 is ~8%, almost all of which are located in the known RING and BRCT domains. For BRCA2, family based studies in combination with the Align-GVGD prediction method were previously used to estimate that 33% of missense variants in the BRCA2 DNA binding domain were damaging[36]. While based on small numbers of missense variants, this is consistent with predictions from the NVM and RF models for BRCA2, although the frequency based on the BRCA2-NVM and BRCA2-RF models is as high as 50% in specific regions. A notable drop in the estimated pathogenic potential was observed in the BRCA2 OB2 DNA binding domain. This was also observed when considering all pathogenic BRCA2 missense mutations listed in ClinVar. However, it should be noted that when applied to genes other than BRCA1 and BRCA2 (Table S6) (or even BRCA1-RF or -NVM and applied to BRCA2 and vice versa), the performance of the NVM and RF models was much lower, with MCCs <0.40, as shown for the BRCA2-NVM (Table S7). The sizable reduction in model accuracy suggested that recalibrated models are specific to the initial gene of interest and cannot be effectively extrapolated to other disease genes. Another potential explanation for this phenomenon is that not all missense variants in other genes may exert phenotypic effects through loss of activity. Because the BRCA1 and BRCA2 assays are limited to measurement of loss of function, perhaps more comprehensive assays to evaluate splicing alterations, gain-of-function mutations, and epigenetic influences on gene function are needed in order to extend the NVM and RF prediction models to other genes. Separately, disruption of functions other than transcriptional activation or homology directed repair by missense variants could result in recalibration of the NVM and RF prediction models. Other influences on model performance may include AT versus GC content of coding sequences and codon usage, and the structural effects of observed variants. Finally, the differences could be due to evolutionary constraint – since some models like Align GVGD and PolyPhen2 perform well for the highly conserved BRCA1, but profoundly less so for the less constrained BRCA2. While the clinical implications of truncating mutations in the BRCA1 and BRCA2 breast cancer predisposition genes are clear, interpretation of missense variants is more challenging. Here we present an approach for predicting the functional impact and potentially the pathogenicity of missense BRCA1 and BRCA2 variants, based on functional evaluation of variants and in silico sequence-based analysis. The functional studies of BRCA2 variants in combination with similar studies of BRCA1 now identify 130 variants in these genes that are damaging and likely pathogenic and may substantially increase risk of breast, ovarian, and other cancers. In contrast, public databases currently identify fewer than 40 such variants. In the absence of functional results, other methods for variant assessment are needed. Many in silico prediction methods exist for characterization of missense variants, but the interpretation of results from these methods, and the accuracy of the methods for predicting whether variants in BRCA1 and BRCA2 are damaging or neutral are not well defined. Here we recalibrated established in silico prediction methods for missense variants using results from BRCA1 and BRCA2 functional assays and developed RF and NVM models that incorporate multiple in silico prediction methods. These classifiers out-performed the individual in silico models. Overall this approach leverages measures of BRCA1 and BRCA2 functional activity to improve the classification of BRCA1 and BRCA2 VUS detected by clinical genetic testing and tumor sequencing.

22 in total

1. BRCA1- and BRCA2-specific in silico tools for variant interpretation in the CAGI 5 ENIGMA challenge.

Authors: Natàlia Padilla; Alejandro Moles-Fernández; Casandra Riera; Gemma Montalban; Selen Özkan; Lars Ootes; Sandra Bonache; Orland Díez; Sara Gutiérrez-Enríquez; Xavier de la Cruz
Journal: Hum Mutat Date: 2019-07-03 Impact factor: 4.878

2. Classification of BRCA2 Variants of Uncertain Significance (VUS) Using an ACMG/AMP Model Incorporating a Homology-Directed Repair (HDR) Functional Assay.

Authors: Kathleen S Hruska; Fergus J Couch; Chunling Hu; Lisa R Susswein; Maegan E Roberts; Hana Yang; Megan L Marshall; Susan Hiraki; Windy Berkofsky-Fessler; Sounak Gupta; Wei Shen; Carolyn A Dunn; Huaizhi Huang; Jie Na; Susan M Domchek; Siddhartha Yadav; Alvaro N A Monteiro; Eric C Polley; Steven N Hart
Journal: Clin Cancer Res Date: 2022-09-01 Impact factor: 13.801

3. Understanding and predicting the functional consequences of missense mutations in BRCA1 and BRCA2.

Authors: Raghad Aljarf; Mengyuan Shen; Douglas E V Pires; David B Ascher
Journal: Sci Rep Date: 2022-06-21 Impact factor: 4.996

4. Saturation variant interpretation using CRISPR prime editing.

Authors: Teija M I Bily; Jason Lequyer; Steven Erwood; Joyce Yan; Nitya Gulati; Reid A Brewer; Liangchi Zhou; Laurence Pelletier; Evgueni A Ivakine; Ronald D Cohn
Journal: Nat Biotechnol Date: 2022-02-21 Impact factor: 68.164

Review 5. Variants of uncertain clinical significance in hereditary breast and ovarian cancer genes: best practices in functional analysis for clinical annotation.

Authors: Alvaro N Monteiro; Peter Bouwman; Arne N Kousholt; Diana M Eccles; Gael A Millot; Jean-Yves Masson; Marjanka K Schmidt; Shyam K Sharan; Ralph Scully; Lisa Wiesmüller; Fergus Couch; Maaike P G Vreeswijk
Journal: J Med Genet Date: 2020-03-09 Impact factor: 6.318

Review 6. Decoding disease: from genomes to networks to phenotypes.

Authors: Aaron K Wong; Rachel S G Sealfon; Chandra L Theesfeld; Olga G Troyanskaya
Journal: Nat Rev Genet Date: 2021-08-02 Impact factor: 53.242

Review 7. Basic and Preclinical Research for Personalized Medicine.

Authors: Wanda Lattanzi; Cristian Ripoli; Viviana Greco; Marta Barba; Federica Iavarone; Angelo Minucci; Andrea Urbani; Claudio Grassi; Ornella Parolini
Journal: J Pers Med Date: 2021-04-29

8. High-throughput functional evaluation of BRCA2 variants of unknown significance.

Authors: Masachika Ikegami; Shinji Kohsaka; Toshihide Ueno; Yukihide Momozawa; Satoshi Inoue; Kenji Tamura; Akihiko Shimomura; Noriko Hosoya; Hiroshi Kobayashi; Sakae Tanaka; Hiroyuki Mano
Journal: Nat Commun Date: 2020-05-22 Impact factor: 14.919

9. A Recurrent BRCA2 Mutation Explains the Majority of Hereditary Breast and Ovarian Cancer Syndrome Cases in Puerto Rico.

Authors: Hector J Diaz-Zabala; Ana P Ortiz; Lisa Garland; Kristine Jones; Cynthia M Perez; Edna Mora; Nelly Arroyo; Taras K Oleksyk; Miguel Echenique; Jaime L Matta; Michael Dean; Julie Dutil
Journal: Cancers (Basel) Date: 2018-11-02 Impact factor: 6.639

10. Structural bioinformatics enhances mechanistic interpretation of genomic variation, demonstrated through the analyses of 935 distinct RAS family mutations.

Authors: Swarnendu Tripathi; Nikita R Dsouza; Raul Urrutia; Michael T Zimmermann
Journal: Bioinformatics Date: 2021-06-16 Impact factor: 6.937