| Literature DB >> 28093075 |
K Joeri van der Velde1,2, Eddy N de Boer2, Cleo C van Diemen2, Birgit Sikkema-Raddatz2, Kristin M Abbott2, Alain Knopperts2, Lude Franke2, Rolf H Sijmons2, Tom J de Koning2, Cisca Wijmenga2, Richard J Sinke2, Morris A Swertz3,4.
Abstract
We present Gene-Aware Variant INterpretation (GAVIN), a new method that accurately classifies variants for clinical diagnostic purposes. Classifications are based on gene-specific calibrations of allele frequencies from the ExAC database, likely variant impact using SnpEff, and estimated deleteriousness based on CADD scores for >3000 genes. In a benchmark on 18 clinical gene sets, we achieve a sensitivity of 91.4% and a specificity of 76.9%. This accuracy is unmatched by 12 other tools. We provide GAVIN as an online MOLGENIS service to annotate VCF files and as an open source executable for use in bioinformatic pipelines. It can be found at http://molgenis.org/gavin .Entities:
Keywords: Allele frequency; Automated protocol; Clinical next-generation sequencing; Gene-specific calibration; Pathogenicity prediction; Protein impact; Variant classification
Mesh:
Year: 2017 PMID: 28093075 PMCID: PMC5240400 DOI: 10.1186/s13059-016-1141-7
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Variant and classification origins of the benchmark datasets used
| Dataset | Benign variants (n) | Pathogenic variants (n) | Origin |
|---|---|---|---|
| VariBench tolerance DS7, training set | 11,347 | 6143 | PhenCode database, IDbases, and 18 individual LSDBs |
| VariBench tolerance DS7, test set | 1377 | 510 | PhenCode database, IDbases, and 18 individual LSDBs |
| MutationTaster2 benchmark set | 1194 | 161 | HGMD Professional and 1000 Genomes |
| ClinVar (additions of Nov 2015 to Feb 2016) | 1668 | 1688 | Submissions by clinical molecular geneticists, expert panels, diagnostic laboratories, and companies |
| UMCG, variants exported from clinical diagnostic interpretation software | 1176 | 174 | Clinical diagnostic classifications of variants in cardiology, dermatology, epilepsy, dystonia, and preconception screening |
| UMCG, germline variants for familial cancer cases | 301 | 26 | Hereditary cancer variant classifications by an MD following ACMG guidelines |
| Total | 17,063 | 8702 | 25,765 |
Stratification of the combined variant dataset into manifestation categories
| CGD manifestation panel | Genes (n) | Variants (n) | Likely pathogenic/pathogenic variants (n) |
|---|---|---|---|
| Allergy/Immunology/Infectious | 253 | 1952 | 1324 |
| Audiologic/Otolaryngologic | 217 | 1215 | 668 |
| Biochemical | 354 | 2538 | 1933 |
| Cardiovascular | 446 | 4360 | 2408 |
| Craniofacial | 387 | 1861 | 1106 |
| Dental | 80 | 783 | 518 |
| Dermatologic | 345 | 2749 | 1662 |
| Endocrine | 240 | 1801 | 1340 |
| Gastrointestinal | 338 | 2351 | 1620 |
| Genitourinary | 149 | 1026 | 753 |
| Hematologic | 267 | 2571 | 1914 |
| Musculoskeletal | 676 | 4935 | 2864 |
| Neurologic | 1012 | 6363 | 4055 |
| Obstetric | 34 | 223 | 140 |
| Oncologic | 203 | 2157 | 1207 |
| Ophthalmologic | 479 | 3649 | 2406 |
| Pulmonary | 90 | 717 | 485 |
| Renal | 302 | 2143 | 1459 |
|
|
|
|
|
The categories are defined by Clinical Genomics Database and are associated to clinically relevant genes. Variants were allocated to the manifestation categories based on their gene and were placed in multiple categories if a gene was associated to multiple manifestations
Performance overview of all tested tools
| Tool | Median sensitivity (%) | Median specificity (%) |
|---|---|---|
| CADD (thr. 15) | 93.6 | 57.1 |
| CADD (thr. 20) | 90.4 | 68.8 |
| CADD (thr. 25) | 71.5 | 85.3 |
| Condel | 70.3 | 39.5 |
| DANN | 63.8 | 66.7 |
| FATHMM | 69.5 | 61.9 |
| FunSeq | 61.7 | 50.2 |
| GAVIN | 91.4 | 76.9 |
| GWAVA | 47.6 | 26.2 |
| MSC_ClinVar95CI | 84.7 | 64.4 |
| MSC_HGMD99CI | 97.1 | 25.7 |
| PolyPhen2 | 68.0 | 46.8 |
| PONP2 | 47.5 | 26.9 |
| PredictSNP2 | 66.8 | 70.6 |
| PROVEAN | 65.9 | 62.1 |
| SIFT | 67.9 | 57.9 |
Fig. 1Performance of GAVIN and other tools across different clinical gene sets. Prediction quality is measured as sensitivity and specificity, i.e. the fraction of pathogenic variants correctly identified and the fraction of misclassifications/non-classifications while doing so
Fig. 2Comparison of gene-specific classification thresholds with genome-wide fixed thresholds in three groups of genes: 737 genes for which CADD is predictive, 684 genes for which CADD is less predictive, and 766 genes with scarce training data. For each group, 10,000 sets of 100 benign and 100 pathogenic variants were randomly sampled and tested from the full set of 25,765 variants and accuracy was calculated for gene-specific and genome-wide CADD and MAF thresholds