| Literature DB >> 33937409 |
Hui-Heng Lin1, Hongyan Xu2, Hongbo Hu2, Zhanzhong Ma3, Jie Zhou2, Qingyun Liang2.
Abstract
High-throughput sequencing is gaining popularity in clinical diagnoses, but more and more novel gene variants with unknown clinical significance are being found, giving difficulties to interpretations of people's genetic data, precise disease diagnoses, and the making of therapeutic strategies and decisions. In order to solve these issues, it is of critical importance to figure out ways to analyze and interpret such variants. In this work, BRCA1 gene variants with unknown clinical significance were identified from clinical sequencing data, and then, we developed machine learning models so as to predict the pathogenicity for variants with unknown clinical significance. Through performance benchmarking, we found that the optimized random forest model scored 0.85 in area under receiver operating characteristic curve, which outperformed other models. Finally, we applied the best random forest model to predict the pathogenicity of 6321 BRCA1 variants from both sequencing data and ClinVar database. As a result, we obtained the predictive pathogenic risks of BRCA1 variants of unknown significance.Entities:
Year: 2021 PMID: 33937409 PMCID: PMC8062186 DOI: 10.1155/2021/6667201
Source DB: PubMed Journal: Biomed Res Int Impact factor: 3.411
Seven VUSs of BRCA1 identified from sequencing data and databases.
| ID | Variation | refSNP ID (a.k.a. rs number) | Clinical significance in database |
|---|---|---|---|
| 1 | c.1255G>C | rs876658873 | Unknown |
| 2 | c.824G>A | rs397509327 | Unknown |
| 3 | c.3448C>T | rs80357272 | Unknown |
| 4∗ | c.1348A>T | Not available | Unknown |
| 5 | c.2566T>C | rs80356892 | Likely benign |
| 6 | c.3748G>A | rs28897686 | Likely benign |
| 7 | c.571G>A | rs80357090 | Likely benign |
∗This BRCA1 variant was not found through querying databases.
Figure 1ROCs of 4 kinds of predictive models. ROCs indicating varied performances of different models were plotted. And the relevant AUCs were also computed to indicate models' overall performance. For support vector machine (the light blue curve), random forest (the purple curve), PolyPhen (the red curve), and SIFT (the green curve), their AUC values were 0.74, 0.78, 0. 74, and 0.78, respectively.
Figure 2The overall performance of optimized support vector machine, optimized random forest model, original (not optimized) support vector machine, and original random forest. The optimized random forest (dark blue) had an obviously larger AUC than the optimized support vector machine (light blue) and the original random forest (purple), while no significant increase of AUC was observed between the original support vector machine (yellow) and the optimized one (light blue). The quantified AUC values of optimized random forest and optimized support vector are 0.85 and 0.75, respectively, indicating that the random forest model had better performance after optimization, while the support vector machine did not.
Other performance indicators of the best random forest model. Tp, Tn, Fp, and Fn stand for the number of true-positive, true-negative, false-positive, and false-negative instance in the machine learning confusion matrix, respectively.
| ID | Indicator | Value | Calculation |
|---|---|---|---|
| 1 | True-positive rate (a.k.a. sensitivity or recall) | 0.84 | Tp/(Tp + Fn) |
| 2 | True-negative rate (a.k.a. specificity) | 0.86 | Tn/(Tn + Fp) |
| 3 | False-positive rate | 0.13 | Fp/(Fp + Tn) |
| 4 | False-negative rate | 0.16 | Fn/(Fn + Tp) |
| 5 | Positive predictive value (a.k.a. precision) | 0.77 | Tp/(Tp + Fp) |
| 6 | Accuracy | 0.85 | (Tp + Tn)/(Tp + Tn + Fp + Fn) |
| 7 | Balanced accuracy | 0.85 | (True‐positive rate + true‐negative rate)/2 |
| 8 |
| 0.80 | 2Tp/(2Tp + Fp + Fn) |
Predictive pathogenic risks for 7 VUSs of BRCA1 identified from our sequencing data.
| ID | Variation | (Original) clinical significance | Predictive pathogenic risk |
|---|---|---|---|
| 1 | c.1255G>C | Unknown | Pathogenic |
| 2 | c.824G>A | Unknown | Benign |
| 3 | c.3448C>T | Unknown | Benign |
| 4a | c.1348A>T | Unknown | Pathogenic |
| 5 | c.2566T>C | Likely benign | Benign |
| 6 | c.3748G>A | Likely benign | Benign |
| 7 | c.571G>A | Likely benign | Benign |
aExcept this variant, the rest of the variants can be found in the ClinVar database.
Overview of predictive results for 6321 BRCA1 VUS pathogenic risks.
| Class | Clinical significance (ClinVar database) | Variant amount | Number of predictive pathogenic variants (%) | Number of predictive benign variants (%) |
|---|---|---|---|---|
| 1 | Likely benign | 937 | 227 (24.22%) | 710 (75.78%) |
| 2 | Likely pathogenic | 82 | 25 (30.49%) | 57 (69.51%) |
| 3 | Not provided | 2797 | 621 (22.20%) | 2176 (77.80%) |
| 4 | Uncertain significance | 2235 | 644 (28.81%) | 1591 (71.19%) |
| 5 | Conflicting interpretations of pathogenicity | 264 | 74 (28.03%) | 190 (71.97%) |
| 6 | Total | 6315 | 1591 (25.19%) | 4724 (74.81%) |