Literature DB >> 28968714

FATHMM-XF: accurate prediction of pathogenic point mutations via extended features.

Mark F Rogers¹, Hashem A Shihab², Matthew Mort³, David N Cooper³, Tom R Gaunt², Colin Campbell¹.

Abstract

Summary: We present FATHMM-XF, a method for predicting pathogenic point mutations in the human genome. Drawing on an extensive feature set, FATHMM-XF outperforms competitors on benchmark tests, particularly in non-coding regions where the majority of pathogenic mutations are likely to be found. Availability and implementation: The FATHMM-XF web server is available at http://fathmm.biocompute.org.uk/fathmm-xf/, and as tracks on the Genome Tolerance Browser: http://gtb.biocompute.org.uk. Predictions are provided for human genome version GRCh37/hg19. The data used for this project can be downloaded from: http://fathmm.biocompute.org.uk/fathmm-xf/. Contact: mark.rogers@bristol.ac.uk or c.campbell@bristol.ac.uk. Supplementary information: Supplementary data are available at Bioinformatics online.

Entities: Chemical Disease Gene Mutation Species

Mesh：

Year: 2018 PMID： 28968714 PMCID： PMC5860356 DOI： 10.1093/bioinformatics/btx536

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Many classifiers have been proposed for predicting the impact of single-nucleotide variants (SNVs) in the human genome (see Liu ). Initially these focused on non-synonymous mutations in coding regions of the genome, but most documented pathogenic SNVs come from non-coding regions, so more recent methods make predictions genome wide (Kircher ; Shihab ). CADD (Kircher ) has emerged as a standard for predicting pathogenic SNVs, although its performance has been challenged (Liu ). The recent GAVIN method adjusts CADD scores in a gene-specific manner, achieving greater accuracy than CADD, whilst assigning distinct Pathogenic and Benign labels that simplify interpretation (van der Velde ). Here we present FATHMM with an eXtended Feature set (FATHMM-XF) which yields highly accurate predictions for SNVs across the entire human genome. FATHMM-XF assigns a confidence score (a p-score) to every prediction, to simplify interpretation, and focus analysis on a subset of high-confidence predictions (cautious classification). In all tests, FATHMM-XF matches or outperforms competing methods, with its best performance in non-coding regions, where the majority of pathogenic SNVs are likely to be found. With cautious classification, FATHMM-XF consistently exceeds 94% accuracy on subsets of 80% of the highest-confidence predictions from benchmark test sets.

2 Materials and methods

To build FATHMM-XF we use supervised machine learning with labeled examples ascribed to pathogenic (positive) or benign (neutral) mutations. We obtain positive examples from the Human Gene Mutation Database (Stenson ) (HGMD), and neutral examples from the 1000 Genomes Project (The 1000 Genomes Project Consortium, 2012). We restrict neutral data to SNVs with a global minor allele frequency ≤1% and remove any that appear in the pathogenic dataset. To mitigate potential bias, we filter neutral examples, selecting only those within 1000 positions of a pathogenic mutation (Supplementary Section S2). In addition, we remove sex chromosomes X and Y to avoid potential biases that might arise when allosomes are included. Our final training set consists of 156 775 coding examples and 25 720 non-coding. We characterise SNVs using features from 27 data sets (herein called feature groups) from ENCODE (The ENCODE Project Consortium, 2012) and NIH Roadmap Epigenomics (Bernstein ) that have proved informative in other domains (Shihab , 2017b). We construct four additional feature groups from conservation scores, the Variant Effect Predictor (McLaren ); annotated gene models, and the DNA sequence itself (Supplementary Section S3). We convert feature groups into kernels to evaluate different combinations and kernel-based models. k-fold cross-validation is commonly used to evaluate models, but can introduce bias if, for example, the same gene is represented in both training and test sets. Instead, we use leave-one-chromosome-out cross-validation (LOCO-CV): for each fold we set aside one chromosome for testing and use the remaining chromosomes for training. We use Platt scaling (Platt, 1999) to assign a p-score to each prediction (the probability that a particular SNV is pathogenic). For cautious classification, we then establish confidence thresholds to analyse sub-populations of high-confidence predictions.

3 Results

For non-coding regions, the best model incorporates five feature groups, achieving 92.3% accuracy in LOCO-CV (Supplementary Table S6). Briefly, these feature groups encapsulate sequence conservation, proximity to genomic features (e.g. splice sites or transcription start sites) and chromatin accessibility. Cautious classification reaches 99% peak accuracy at a p-score threshold of (Supplementary Fig. S2). This high-confidence subset of examples (p ≥ 0.96 or ≤ 0.04) comprises nearly 40% of test examples, demonstrating that the threshold is not prohibitively restrictive. Relaxing the threshold enlarges this subset dramatically whilst retaining high accuracy: at , we cover 90% of examples with accuracy over 95% (Supplementary Section S4). For coding regions, the best model uses six feature groups, reaching 88.0% accuracy (Supplementary Table S8). Again, conservation features are most informative, along with proximity to genomic features and nucleotide sequence features (Supplementary Section S3). Cautious classification achieves peak accuracy of 98% at (Supplementary Fig. S2). This highest-confidence subset again comprises nearly 40% of examples; at , it includes 80% of examples with accuracy above 94.0%. We use these peak accuracy thresholds (0.96 for non-coding, 0.97 for coding) in subsequent analyses. We compared FATHMM-XF with four genome-wide SNV prediction methods: CADD (Kircher ), DANN (Quang et al., 2014), FATHMM-MKL (Shihab ) and GAVIN (van der Velde ). When we compared FATHMM-MKL LOCO-CV test results with competitors evaluated on the same data, FATHMM-XF achieved the highest accuracy of all, at 93% (Supplementary Section S5). In coding regions, FATHMM-XF and its closest competitor, GAVIN, yielded similar accuracy (88 and 89%, respectively). As reported earlier, FATHMM-XF yielded exceptionally high accuracy in cautious classification on these data, whilst consistently yielding predictions for nearly 40% of examples. To evaluate how well FATHMM-XF will generalise, we tested all methods on test sets we assembled from ClinVar data (Landrum ) (Supplementary Section S5). After removing any ClinVar examples found in our training sets, the test sets comprised 31 099 non-coding and 62 884 coding SNVs. In non-coding regions, FATHMM-XF matches or outperforms other methods, reaching 89% accuracy and 0.97 area under the ROC curve (AUC, Table 1, top). FATHMM-MKL yields comparable accuracy, but tends to under-perform the new model. GAVIN achieves higher MCC and PPV scores at the expense of lower accuracy. In cautious classification, FATHMM-XF yields exceptionally high scores, covering 30.9% of examples. In coding regions, it reaches 88% accuracy and 0.96 AUC (Table 1, bottom). GAVIN yields nominally higher accuracy (and, notably, 26% higher than CADD, upon which it is based), but at lower MCC and PPV. With cautious classification, FATHMM-XF again yields exceptional performance, covering 42.4% of examples. FATHMM-XF at its default threshold covers 100% of test examples, as do the other methods tested.

Table 1.

FATHMM-XF yields state-of-the-art accuracy on unseen ClinVar examples in both non-coding regions and coding regions.

Non-coding regions
Method	Acc.	AUC	Sens.	Spec.	MCC	PPV
FATHMM-XF	0.89	0.97	0.95	0.84	0.53	0.36
Cautious (τ=0.96)	0.96	0.99	0.99	0.93	0.87	0.82
FATHMM-MKL	0.88	0.95	0.94	0.82	0.49	0.33
GAVIN	0.87	—	0.82	0.93	0.61	0.52
CADD (v1.3)	0.64	0.95	0.98	0.30	0.18	0.12
DANN	0.61	0.95	0.99	0.23	0.15	0.11

Coding regions	Acc.	AUC	Sens.	Spec.	MCC	PPV

FATHMM-XF	0.88	0.96	0.84	0.92	0.76	0.83
Cautious (τ=0.97)	0.97	0.99	0.94	1.00	0.96	0.99
GAVIN	0.89	—	0.90	0.87	0.74	0.76
FATHMM-MKL	0.80	0.90	0.91	0.70	0.56	0.58
CADD (v1.3)	0.63	0.91	0.98	0.29	0.30	0.38
DANN	0.60	0.89	0.99	0.20	0.25	0.36

Note: (Top) FATHMM-XF yields the highest accuracy on unseen ClinVar examples for non-coding regions, outperforming its nearest competitor, FATHMM-MKL. Cautious classification yields exceptionally high scores, yielding predictions for 31% of examples. (Bottom) FATHMM-XF yields higher accuracy, AUC, MCC and PPV scores than competitors on unseen ClinVar examples in coding regions. The lone exception is GAVIN, with nominally higher accuracy. Cautious classification again achieves extremely high scores, yielding predictions for more than 42% of test examples.

FATHMM-XF yields state-of-the-art accuracy on unseen ClinVar examples in both non-coding regions and coding regions. Note: (Top) FATHMM-XF yields the highest accuracy on unseen ClinVar examples for non-coding regions, outperforming its nearest competitor, FATHMM-MKL. Cautious classification yields exceptionally high scores, yielding predictions for 31% of examples. (Bottom) FATHMM-XF yields higher accuracy, AUC, MCC and PPV scores than competitors on unseen ClinVar examples in coding regions. The lone exception is GAVIN, with nominally higher accuracy. Cautious classification again achieves extremely high scores, yielding predictions for more than 42% of test examples.

4 Discussion

At default thresholds, FATHMM-XF matches or outperforms competing methods using an eclectic mixture of data sources. Even when all methods are optimised, FATHMM-XF yields substantially higher accuracy in all of our tests (Supplementary Figs S7–S10). Under cautious classification, accuracy exceeds 95%, whilst producing predictions for up to 80% of positions genome-wide. While the proposed classifiers achieve high accuracy, further improvement seems possible. Notably, all methods exhibit low PPV on non-coding data except for FATHMM-XF’s cautious classification. Analysis of these variants (Supplementary Fig. S1) reveals differences in the proportions of intron and UTR variants represented in the training and test sets. Hence region-specific models may improve performance in non-coding regions, just as GAVIN’s gene-specific thresholding improves accuracy for CADD scores—by up to 26 percentage points in our tests. We will explore these approaches in future work. The FATHMM-XF web server for GRCh37/hg19 is available at fathmm.biocompute.org.uk/fathmm-xf, and as tracks on the Genome Tolerance Browser (gtb.biocompute.org.uk; Shihab ).

Funding

MR was supported by the Engineering and Physical Sciences Research Council (EPSRC) grants [EP/M01715X/1] and [EP/K008250/1]. TRG was supported by Medical Research Council Integrative Epidemiology Unit (MRC IEU) [MC UU 12013/8]. MM & DNC gratefully acknowledge the financial support of Qiagen Inc. through a licence agreement with Cardiff University. Conflict of Interest: none declared. Click here for additional data file.

13 in total

1. DANN: a deep learning approach for annotating the pathogenicity of genetic variants.

Authors: Daniel Quang; Yifei Chen; Xiaohui Xie
Journal: Bioinformatics Date: 2014-10-22 Impact factor: 6.937

2. The NIH Roadmap Epigenomics Mapping Consortium.

Authors: Bradley E Bernstein; John A Stamatoyannopoulos; Joseph F Costello; Bing Ren; Aleksandar Milosavljevic; Alexander Meissner; Manolis Kellis; Marco A Marra; Arthur L Beaudet; Joseph R Ecker; Peggy J Farnham; Martin Hirst; Eric S Lander; Tarjei S Mikkelsen; James A Thomson
Journal: Nat Biotechnol Date: 2010-10 Impact factor: 54.908

3. An integrative approach to predicting the functional effects of non-coding and coding sequence variation.

Authors: Hashem A Shihab; Mark F Rogers; Julian Gough; Matthew Mort; David N Cooper; Ian N M Day; Tom R Gaunt; Colin Campbell
Journal: Bioinformatics Date: 2015-01-11 Impact factor: 6.937

4. HIPred: an integrative approach to predicting haploinsufficient genes.

Authors: Hashem A Shihab; Mark F Rogers; Colin Campbell; Tom R Gaunt
Journal: Bioinformatics Date: 2017-06-15 Impact factor: 6.937

Review 5. The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies.

Authors: Peter D Stenson; Matthew Mort; Edward V Ball; Katy Evans; Matthew Hayden; Sally Heywood; Michelle Hussain; Andrew D Phillips; David N Cooper
Journal: Hum Genet Date: 2017-03-27 Impact factor: 4.132

6. GAVIN: Gene-Aware Variant INterpretation for medical sequencing.

Authors: K Joeri van der Velde; Eddy N de Boer; Cleo C van Diemen; Birgit Sikkema-Raddatz; Kristin M Abbott; Alain Knopperts; Lude Franke; Rolf H Sijmons; Tom J de Koning; Cisca Wijmenga; Richard J Sinke; Morris A Swertz
Journal: Genome Biol Date: 2017-01-16 Impact factor: 13.583

7. An integrated encyclopedia of DNA elements in the human genome.

Authors:
Journal: Nature Date: 2012-09-06 Impact factor: 49.962

8. An integrated map of genetic variation from 1,092 human genomes.

Authors: Goncalo R Abecasis; Adam Auton; Lisa D Brooks; Mark A DePristo; Richard M Durbin; Robert E Handsaker; Hyun Min Kang; Gabor T Marth; Gil A McVean
Journal: Nature Date: 2012-11-01 Impact factor: 49.962

9. ClinVar: public archive of relationships among sequence variation and human phenotype.

Authors: Melissa J Landrum; Jennifer M Lee; George R Riley; Wonhee Jang; Wendy S Rubinstein; Deanna M Church; Donna R Maglott
Journal: Nucleic Acids Res Date: 2013-11-14 Impact factor: 16.971

10. The Ensembl Variant Effect Predictor.

Authors: William McLaren; Laurent Gil; Sarah E Hunt; Harpreet Singh Riat; Graham R S Ritchie; Anja Thormann; Paul Flicek; Fiona Cunningham
Journal: Genome Biol Date: 2016-06-06 Impact factor: 13.583

96 in total

1. Untouchable genes in the human genome: Identifying ideal targets for cancer treatment.

Authors: Ivan P Gorlov; Olga Y Gorlova; Christopher I Amos
Journal: Cancer Genet Date: 2019-01-24

2. Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies.

Authors: Zilin Li; Xihao Li; Yaowu Liu; Jincheng Shen; Han Chen; Hufeng Zhou; Alanna C Morrison; Eric Boerwinkle; Xihong Lin
Journal: Am J Hum Genet Date: 2019-04-12 Impact factor: 11.025

Review 3. Emerging strategies to bridge the gap between pharmacogenomic research and its clinical implementation.

Authors: Volker M Lauschke; Magnus Ingelman-Sundberg
Journal: NPJ Genom Med Date: 2020-03-05 Impact factor: 8.617

Review 4. Importance of Genetic Studies of Cardiometabolic Disease in Diverse Populations.

Authors: Lindsay Fernández-Rhodes; Kristin L Young; Adam G Lilly; Laura M Raffield; Heather M Highland; Genevieve L Wojcik; Cary Agler; Shelly-Ann M Love; Samson Okello; Lauren E Petty; Mariaelisa Graff; Jennifer E Below; Kimon Divaris; Kari E North
Journal: Circ Res Date: 2020-06-04 Impact factor: 17.367

5. Efficient gene-environment interaction tests for large biobank-scale sequencing studies.

Authors: Xinyu Wang; Elise Lim; Ching-Ti Liu; Yun Ju Sung; Dabeeru C Rao; Alanna C Morrison; Eric Boerwinkle; Alisa K Manning; Han Chen
Journal: Genet Epidemiol Date: 2020-08-30 Impact factor: 2.135

Review 6. Principles and methods of in-silico prioritization of non-coding regulatory variants.

Authors: Phil H Lee; Christian Lee; Xihao Li; Brian Wee; Tushar Dwivedi; Mark Daly
Journal: Hum Genet Date: 2017-12-29 Impact factor: 4.132

7. Tumor somatic mutations also existing as germline polymorphisms may help to identify functional SNPs from genome-wide association studies.

Authors: Ivan P Gorlov; Xiangjun Xia; Spiridon Tsavachidis; Olga Y Gorlova; Christopher I Amos
Journal: Carcinogenesis Date: 2020-10-15 Impact factor: 4.944

8. Identification of CFTR variants in Latino patients with cystic fibrosis from the Dominican Republic and Puerto Rico.

Authors: Andrew M Zeiger; Meghan E McGarry; Angel C Y Mak; Vivian Medina; Sandra Salazar; Celeste Eng; Amy K Liu; Sam S Oh; Thomas J Nuckton; Deepti Jain; Thomas W Blackwell; Hyun Min Kang; Goncalo Abecasis; Leandra Cordero Oñate; Max A Seibold; Esteban G Burchard; Jose Rodriguez-Santana
Journal: Pediatr Pulmonol Date: 2019-10-30

9. Defining the genetic control of human blood plasma N-glycome using genome-wide association study.

Authors: Sodbo Zh Sharapov; Yakov A Tsepilov; Lucija Klaric; Massimo Mangino; Gaurav Thareja; Alexandra S Shadrina; Mirna Simurina; Concetta Dagostino; Julia Dmitrieva; Marija Vilaj; Frano Vuckovic; Tamara Pavic; Jerko Stambuk; Irena Trbojevic-Akmacic; Jasminka Kristic; Jelena Simunovic; Ana Momcilovic; Harry Campbell; Margaret Doherty; Malcolm G Dunlop; Susan M Farrington; Maja Pucic-Bakovic; Christian Gieger; Massimo Allegri; Edouard Louis; Michel Georges; Karsten Suhre; Tim Spector; Frances M K Williams; Gordan Lauc; Yurii S Aulchenko
Journal: Hum Mol Genet Date: 2019-06-15 Impact factor: 6.150

10. Coagulation factor VIII: Relationship to cardiovascular disease risk and whole genome sequence and epigenome-wide analysis in African Americans.

Authors: Laura M Raffield; Ake T Lu; Mindy D Szeto; Amarise Little; Kelsey E Grinde; Jessica Shaw; Paul L Auer; Mary Cushman; Steve Horvath; Marguerite R Irvin; Ethan M Lange; Leslie A Lange; Deborah A Nickerson; Timothy A Thornton; James G Wilson; Marsha M Wheeler; Neil A Zakai; Alex P Reiner
Journal: J Thromb Haemost Date: 2020-02-20 Impact factor: 5.824