Christos Kozanitis1, Andrew Heiberg, George Varghese, Vineet Bafna. 1. Computer Science and Engineering, University of California San Diego, 9500 Gilman Drive, San Diego, CA 92123 and Microsoft Research, 1065 La Avenida, Mountain View, CA 94043, USA.
Abstract
MOTIVATION: With high-throughput DNA sequencing costs dropping <$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. RESULTS: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference. AVAILABILITY: GQL can be downloaded from http://cseweb.ucsd.edu/~ckozanit/gql.
MOTIVATION: With high-throughput DNA sequencing costs dropping <$1000 for human genomes, data storage, retrieval and analysis are the major bottlenecks in biological studies. To address the large-data challenges, we advocate a clean separation between the evidence collection and the inference in variant calling. We define and implement a Genome Query Language (GQL) that allows for the rapid collection of evidence needed for calling variants. RESULTS: We provide a number of cases to showcase the use of GQL for complex evidence collection, such as the evidence for large structural variations. Specifically, typical GQL queries can be written in 5-10 lines of high-level code and search large datasets (100 GB) in minutes. We also demonstrate its complementarity with other variant calling tools. Popular variant calling tools can achieve one order of magnitude speed-up by using GQL to retrieve evidence. Finally, we show how GQL can be used to query and compare multiple datasets. By separating the evidence and inference for variant calling, it frees all variant detection tools from the data intensive evidence collection and focuses on statistical inference. AVAILABILITY: GQL can be downloaded from http://cseweb.ucsd.edu/~ckozanit/gql.
Authors: Christopher E Mason; Paul Zumbo; Stephan Sanders; Mike Folk; Dana Robinson; Ruth Aydt; Martin Gollery; Mark Welsh; N Eric Olson; Todd M Smith Journal: Adv Exp Med Biol Date: 2010 Impact factor: 2.622
Authors: Theru A Sivakumaran; Robert P Igo; Jeffrey M Kidd; Andy Itsara; Laura J Kopplin; Wei Chen; Stephanie A Hagstrom; Neal S Peachey; Peter J Francis; Michael L Klein; Emily Y Chew; Vedam L Ramprasad; Wan-Ting Tay; Paul Mitchell; Mark Seielstad; Dwight E Stambolian; Albert O Edwards; Kristine E Lee; Dmitry V Leontiev; Gyungah Jun; Yang Wang; Liping Tian; Feiyou Qiu; Alice K Henning; Thomas LaFramboise; Parveen Sen; Manoharan Aarthi; Ronnie George; Rajiv Raman; Manmath Kumar Das; Lingam Vijaya; Govindasamy Kumaramanickavel; Tien Y Wong; Anand Swaroop; Goncalo R Abecasis; Ronald Klein; Barbara E K Klein; Deborah A Nickerson; Evan E Eichler; Sudha K Iyengar Journal: PLoS One Date: 2011-10-12 Impact factor: 3.240
Authors: Jeffrey M Kidd; Gregory M Cooper; William F Donahue; Hillary S Hayden; Nick Sampas; Tina Graves; Nancy Hansen; Brian Teague; Can Alkan; Francesca Antonacci; Eric Haugen; Troy Zerr; N Alice Yamada; Peter Tsang; Tera L Newman; Eray Tüzün; Ze Cheng; Heather M Ebling; Nadeem Tusneem; Robert David; Will Gillett; Karen A Phelps; Molly Weaver; David Saranga; Adrianne Brand; Wei Tao; Erik Gustafson; Kevin McKernan; Lin Chen; Maika Malig; Joshua D Smith; Joshua M Korn; Steven A McCarroll; David A Altshuler; Daniel A Peiffer; Michael Dorschner; John Stamatoyannopoulos; David Schwartz; Deborah A Nickerson; James C Mullikin; Richard K Wilson; Laurakay Bruhn; Maynard V Olson; Rajinder Kaul; Douglas R Smith; Evan E Eichler Journal: Nature Date: 2008-05-01 Impact factor: 49.962
Authors: Shira Rockowitz; Nicholas LeCompte; Mary Carmack; Andrew Quitadamo; Lily Wang; Meredith Park; Devon Knight; Emma Sexton; Lacey Smith; Beth Sheidley; Michael Field; Ingrid A Holm; Catherine A Brownstein; Pankaj B Agrawal; Susan Kornetsky; Annapurna Poduri; Scott B Snapper; Alan H Beggs; Timothy W Yu; David A Williams; Piotr Sliz Journal: NPJ Genom Med Date: 2020-07-06 Impact factor: 8.617
Authors: Xinjie Zhu; Qiang Zhang; Eric Dun Ho; Ken Hung-On Yu; Chris Liu; Tim H Huang; Alfred Sze-Lok Cheng; Ben Kao; Eric Lo; Kevin Y Yip Journal: BMC Genomics Date: 2017-09-22 Impact factor: 3.969
Authors: Shira Rockowitz; Nicholas LeCompte; Mary Carmack; Andrew Quitadamo; Lily Wang; Meredith Park; Devon Knight; Emma Sexton; Lacey Smith; Beth Sheidley; Michael Field; Ingrid A Holm; Catherine A Brownstein; Pankaj B Agrawal; Susan Kornetsky; Annapurna Poduri; Scott B Snapper; Alan H Beggs; Timothy W Yu; David A Williams; Piotr Sliz Journal: NPJ Genom Med Date: 2020-07-06 Impact factor: 8.617