Yongzhuang Liu1, Bingshan Li2, Renjie Tan1, Xiaolin Zhu2, Yadong Wang2. 1. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USASchool of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USA. 2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China, Center for Human Genome Variation, Duke University, Durham, NC 27708 and Center for Human Genetics Research, Department of Molecular Physiology and Biophysics, Vanderbilt University, Nashville, TN 37235, USA.
Abstract
MOTIVATION: Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. RESULTS: In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. AVAILABILITY: The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software.
MOTIVATION: Whole-genome and -exome sequencing on parent-offspring trios is a powerful approach to identifying disease-associated genes by detecting de novo mutations in patients. Accurate detection of de novo mutations from sequencing data is a critical step in trio-based genetic studies. Existing bioinformatic approaches usually yield high error rates due to sequencing artifacts and alignment issues, which may either miss true de novo mutations or call too many false ones, making downstream validation and analysis difficult. In particular, current approaches have much worse specificity than sensitivity, and developing effective filters to discriminate genuine from spurious de novo mutations remains an unsolved challenge. RESULTS: In this article, we curated 59 sequence features in whole genome and exome alignment context which are considered to be relevant to discriminating true de novo mutations from artifacts, and then employed a machine-learning approach to classify candidates as true or false de novo mutations. Specifically, we built a classifier, named De Novo Mutation Filter (DNMFilter), using gradient boosting as the classification algorithm. We built the training set using experimentally validated true and false de novo mutations as well as collected false de novo mutations from an in-house large-scale exome-sequencing project. We evaluated DNMFilter's theoretical performance and investigated relative importance of different sequence features on the classification accuracy. Finally, we applied DNMFilter on our in-house whole exome trios and one CEU trio from the 1000 Genomes Project and found that DNMFilter could be coupled with commonly used de novo mutation detection approaches as an effective filtering approach to significantly reduce false discovery rate without sacrificing sensitivity. AVAILABILITY: The software DNMFilter implemented using a combination of Java and R is freely available from the website at http://humangenome.duke.edu/software.
Authors: Donald F Conrad; Jonathan E M Keebler; Mark A DePristo; Sarah J Lindsay; Yujun Zhang; Ferran Casals; Youssef Idaghdour; Chris L Hartl; Carlos Torroja; Kiran V Garimella; Martine Zilversmit; Reed Cartwright; Guy A Rouleau; Mark Daly; Eric A Stone; Matthew E Hurles; Philip Awadalla Journal: Nat Genet Date: 2011-06-12 Impact factor: 38.330
Authors: Simon L Girard; Julie Gauthier; Anne Noreau; Lan Xiong; Sirui Zhou; Loubna Jouan; Alexandre Dionne-Laporte; Dan Spiegelman; Edouard Henrion; Ousmane Diallo; Pascale Thibodeau; Isabelle Bachand; Jessie Y J Bao; Amy Hin Yan Tong; Chi-Ho Lin; Bruno Millet; Nematollah Jaafari; Ridha Joober; Patrick A Dion; Si Lok; Marie-Odile Krebs; Guy A Rouleau Journal: Nat Genet Date: 2011-07-10 Impact factor: 38.330
Authors: Avinash Ramu; Michiel J Noordam; Rachel S Schwartz; Arthur Wuster; Matthew E Hurles; Reed A Cartwright; Donald F Conrad Journal: Nat Methods Date: 2013-08-25 Impact factor: 28.547
Authors: Mark A DePristo; Eric Banks; Ryan Poplin; Kiran V Garimella; Jared R Maguire; Christopher Hartl; Anthony A Philippakis; Guillermo del Angel; Manuel A Rivas; Matt Hanna; Aaron McKenna; Tim J Fennell; Andrew M Kernytsky; Andrey Y Sivachenko; Kristian Cibulskis; Stacey B Gabriel; David Altshuler; Mark J Daly Journal: Nat Genet Date: 2011-04-10 Impact factor: 38.330
Authors: Jiarui Ding; Ali Bashashati; Andrew Roth; Arusha Oloumi; Kane Tse; Thomas Zeng; Gholamreza Haffari; Martin Hirst; Marco A Marra; Anne Condon; Samuel Aparicio; Sohrab P Shah Journal: Bioinformatics Date: 2011-11-13 Impact factor: 6.937
Authors: Danny Challis; Jin Yu; Uday S Evani; Andrew R Jackson; Sameer Paithankar; Cristian Coarfa; Aleksandar Milosavljevic; Richard A Gibbs; Fuli Yu Journal: BMC Bioinformatics Date: 2012-01-12 Impact factor: 3.169
Authors: Jian Zhou; Christopher Y Park; Chandra L Theesfeld; Aaron K Wong; Yuan Yuan; Claudia Scheckel; John J Fak; Julien Funk; Kevin Yao; Yoko Tajima; Alan Packer; Robert B Darnell; Olga G Troyanskaya Journal: Nat Genet Date: 2019-05-27 Impact factor: 38.330
Authors: Tychele N Turner; Fereydoun Hormozdiari; Michael H Duyzend; Sarah A McClymont; Paul W Hook; Ivan Iossifov; Archana Raja; Carl Baker; Kendra Hoekzema; Holly A Stessman; Michael C Zody; Bradley J Nelson; John Huddleston; Richard Sandstrom; Joshua D Smith; David Hanna; James M Swanson; Elaine M Faustman; Michael J Bamshad; John Stamatoyannopoulos; Deborah A Nickerson; Andrew S McCallion; Robert Darnell; Evan E Eichler Journal: Am J Hum Genet Date: 2015-12-31 Impact factor: 11.025
Authors: Sheila Garcia-Rosa; Maria Galli de Amorim; Renan Valieris; Vanessa Daccach Marques; Julio Cesar Cetrulo Lorenzi; Vania Balardin Toller; Guilherme Sciascia do Olival; Wilson Araújo da Silva Júnior; Israel Tojal da Silva; Amilton Antunes Barreira; Diana Noronha Nunes; Emmanuel Dias-Neto Journal: BMC Res Notes Date: 2017-12-12
Authors: Pamela Feliciano; Xueya Zhou; Irina Astrovskaya; Tychele N Turner; Tianyun Wang; Leo Brueggeman; Rebecca Barnard; Alexander Hsieh; LeeAnne Green Snyder; Donna M Muzny; Aniko Sabo; Richard A Gibbs; Evan E Eichler; Brian J O'Roak; Jacob J Michaelson; Natalia Volfovsky; Yufeng Shen; Wendy K Chung Journal: NPJ Genom Med Date: 2019-08-23 Impact factor: 8.617
Authors: Jonas Carlsson Almlöf; Sara Nystedt; Aikaterini Mechtidou; Dag Leonard; Maija-Leena Eloranta; Giorgia Grosso; Christopher Sjöwall; Anders A Bengtsson; Andreas Jönsen; Iva Gunnarsson; Elisabet Svenungsson; Lars Rönnblom; Johanna K Sandling; Ann-Christine Syvänen Journal: Eur J Hum Genet Date: 2020-07-28 Impact factor: 4.246