Samarth Rangavittal1, Robert S Harris1, Monika Cechova1, Marta Tomaszkiewicz1, Rayan Chikhi2, Kateryna D Makova1,3,4, Paul Medvedev5,3,6,4. 1. Department of Biology, Pennsylvania State University, University Park, PA 16802, USA. 2. CNRS, CRIStAL, 59655 Villeneuve d'Ascq, France. 3. The Center for Computational Biology and Bioinformatics. 4. The Center for Medical Genomics, Pennsylvania State University, University Park, PA 16802, USA. 5. Department of Computer Science and Engineering. 6. Department of Biochemistry and Molecular Biology.
Abstract
Motivation: The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results: We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation: Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact: kmakova@bx.psu.edu or pashadag@cse.psu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The haploid mammalian Y chromosome is usually under-represented in genome assemblies due to high repeat content and low depth due to its haploid nature. One strategy to ameliorate the low coverage of Y sequences is to experimentally enrich Y-specific material before assembly. As the enrichment process is imperfect, algorithms are needed to identify putative Y-specific reads prior to downstream assembly. A strategy that uses k-mer abundances to identify such reads was used to assemble the gorilla Y. However, the strategy required the manual setting of key parameters, a time-consuming process leading to sub-optimal assemblies. Results: We develop a method, RecoverY, that selects Y-specific reads by automatically choosing the abundance level at which a k-mer is deemed to originate from the Y. This algorithm uses prior knowledge about the Y chromosome of a related species or known Y transcript sequences. We evaluate RecoverY on both simulated and real data, for human and gorilla, and investigate its robustness to important parameters. We show that RecoverY leads to a vastly superior assembly compared to alternate strategies of filtering the reads or contigs. Compared to the preliminary strategy used by Tomaszkiewicz et al., we achieve a 33% improvement in assembly size and a 20% improvement in the NG50, demonstrating the power of automatic parameter selection. Availability and implementation: Our tool RecoverY is freely available at https://github.com/makovalab-psu/RecoverY. Contact: kmakova@bx.psu.edu or pashadag@cse.psu.edu. Supplementary information: Supplementary data are available at Bioinformatics online.
Authors: Helen Skaletsky; Tomoko Kuroda-Kawaguchi; Patrick J Minx; Holland S Cordum; LaDeana Hillier; Laura G Brown; Sjoerd Repping; Tatyana Pyntikova; Johar Ali; Tamberlyn Bieri; Asif Chinwalla; Andrew Delehaunty; Kim Delehaunty; Hui Du; Ginger Fewell; Lucinda Fulton; Robert Fulton; Tina Graves; Shun-Fang Hou; Philip Latrielle; Shawn Leonard; Elaine Mardis; Rachel Maupin; John McPherson; Tracie Miner; William Nash; Christine Nguyen; Philip Ozersky; Kymberlie Pepin; Susan Rock; Tracy Rohlfing; Kelsi Scott; Brian Schultz; Cindy Strong; Aye Tin-Wollam; Shiaw-Pyng Yang; Robert H Waterston; Richard K Wilson; Steve Rozen; David C Page Journal: Nature Date: 2003-06-19 Impact factor: 49.962
Authors: Y Q Shirleen Soh; Jessica Alföldi; Tatyana Pyntikova; Laura G Brown; Tina Graves; Patrick J Minx; Robert S Fulton; Colin Kremitzki; Natalia Koutseva; Jacob L Mueller; Steve Rozen; Jennifer F Hughes; Elaine Owens; James E Womack; William J Murphy; Qing Cao; Pieter de Jong; Wesley C Warren; Richard K Wilson; Helen Skaletsky; David C Page Journal: Cell Date: 2014-10-30 Impact factor: 41.582
Authors: Jennifer F Hughes; Helen Skaletsky; Laura G Brown; Tatyana Pyntikova; Tina Graves; Robert S Fulton; Shannon Dugan; Yan Ding; Christian J Buhay; Colin Kremitzki; Qiaoyan Wang; Hua Shen; Michael Holder; Donna Villasana; Lynne V Nazareth; Andrew Cree; Laura Courtney; Joelle Veizer; Holland Kotkiewicz; Ting-Jan Cho; Natalia Koutseva; Steve Rozen; Donna M Muzny; Wesley C Warren; Richard A Gibbs; Richard K Wilson; David C Page Journal: Nature Date: 2012-02-22 Impact factor: 49.962
Authors: Jennifer F Hughes; Helen Skaletsky; Tatyana Pyntikova; Tina A Graves; Saskia K M van Daalen; Patrick J Minx; Robert S Fulton; Sean D McGrath; Devin P Locke; Cynthia Friedman; Barbara J Trask; Elaine R Mardis; Wesley C Warren; Sjoerd Repping; Steve Rozen; Richard K Wilson; David C Page Journal: Nature Date: 2010-01-13 Impact factor: 49.962
Authors: Michael R Crusoe; Hussien F Alameldin; Sherine Awad; Elmar Boucher; Adam Caldwell; Reed Cartwright; Amanda Charbonneau; Bede Constantinides; Greg Edvenson; Scott Fay; Jacob Fenton; Thomas Fenzl; Jordan Fish; Leonor Garcia-Gutierrez; Phillip Garland; Jonathan Gluck; Iván González; Sarah Guermond; Jiarong Guo; Aditi Gupta; Joshua R Herr; Adina Howe; Alex Hyer; Andreas Härpfer; Luiz Irber; Rhys Kidd; David Lin; Justin Lippi; Tamer Mansour; Pamela McA'Nulty; Eric McDonald; Jessica Mizzi; Kevin D Murray; Joshua R Nahum; Kaben Nanlohy; Alexander Johan Nederbragt; Humberto Ortiz-Zuazaga; Jeramia Ory; Jason Pell; Charles Pepe-Ranney; Zachary N Russ; Erich Schwarz; Camille Scott; Josiah Seaman; Scott Sievert; Jared Simpson; Connor T Skennerton; James Spencer; Ramakrishnan Srinivasan; Daniel Standage; James A Stapleton; Susan R Steinman; Joe Stein; Benjamin Taylor; Will Trimble; Heather L Wiencko; Michael Wright; Brian Wyss; Qingpeng Zhang; En Zyme; C Titus Brown Journal: F1000Res Date: 2015-09-25
Authors: Marta Tomaszkiewicz; Samarth Rangavittal; Monika Cechova; Rebeca Campos Sanchez; Howard W Fescemyer; Robert Harris; Danling Ye; Patricia C M O'Brien; Rayan Chikhi; Oliver A Ryder; Malcolm A Ferguson-Smith; Paul Medvedev; Kateryna D Makova Journal: Genome Res Date: 2016-03-02 Impact factor: 9.043
Authors: Sven Warris; Elio Schijlen; Henri van de Geest; Rahulsimham Vegesna; Thamara Hesselink; Bas Te Lintel Hekkert; Gabino Sanchez Perez; Paul Medvedev; Kateryna D Makova; Dick de Ridder Journal: BMC Genomics Date: 2018-11-06 Impact factor: 3.969
Authors: Benjamin N Sacks; Zachary T Lounsberry; Halie M Rando; Kristopher Kluepfel; Steven R Fain; Sarah K Brown; Anna V Kukekova Journal: Genes (Basel) Date: 2021-01-14 Impact factor: 4.096