MOTIVATION: Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment-information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the approximately 1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as approximately 735 million (approximately 60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a 'draft' quality level. RESULTS: We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on approximately 1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible. AVAILABILITY: Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro).
MOTIVATION: Sequences produced by automated Sanger sequencing machines frequently contain fragments of the cloning vector on their ends. Software tools currently available for identifying and removing the vector sequence require knowledge of the vector sequence, specific splice sites and any adapter sequences used in the experiment-information often omitted from public databases. Furthermore, the clipping coordinates themselves are missing or incorrectly reported. As an example, within the approximately 1.24 billion shotgun sequences deposited in the NCBI Trace Archive, as many as approximately 735 million (approximately 60%) lack vector clipping information. Correct clipping information is essential to scientists attempting to validate, improve and even finish the increasingly large number of genomes released at a 'draft' quality level. RESULTS: We present here Figaro, a novel software tool for identifying and removing the vector from raw sequence data without prior knowledge of the vector sequence. The vector sequence is automatically inferred by analyzing the frequency of occurrence of short oligo-nucleotides using Poisson statistics. We show that Figaro achieves 99.98% sensitivity when tested on approximately 1.5 million shotgun reads from Drosophila pseudoobscura. We further explore the impact of accurate vector trimming on the quality of whole-genome assemblies by re-assembling two bacterial genomes from shotgun sequences deposited in the Trace Archive. Designed as a module in large computational pipelines, Figaro is fast, lightweight and flexible. AVAILABILITY: Figaro is released under an open-source license through the AMOS package (http://amos.sourceforge.net/Figaro).
Authors: Rekha Seshadri; Ian T Paulsen; Jonathan A Eisen; Timothy D Read; Karen E Nelson; William C Nelson; Naomi L Ward; Hervé Tettelin; Tanja M Davidsen; Maureen J Beanan; Robert T Deboy; Sean C Daugherty; Lauren M Brinkac; Ramana Madupu; Robert J Dodson; Hoda M Khouri; Kathy H Lee; Heather A Carty; David Scanlan; Robert A Heinzen; Herbert A Thompson; James E Samuel; Claire M Fraser; John F Heidelberg Journal: Proc Natl Acad Sci U S A Date: 2003-04-18 Impact factor: 11.205
Authors: Marcel Margulies; Michael Egholm; William E Altman; Said Attiya; Joel S Bader; Lisa A Bemben; Jan Berka; Michael S Braverman; Yi-Ju Chen; Zhoutao Chen; Scott B Dewell; Lei Du; Joseph M Fierro; Xavier V Gomes; Brian C Godwin; Wen He; Scott Helgesen; Chun Heen Ho; Chun He Ho; Gerard P Irzyk; Szilveszter C Jando; Maria L I Alenquer; Thomas P Jarvie; Kshama B Jirage; Jong-Bum Kim; James R Knight; Janna R Lanza; John H Leamon; Steven M Lefkowitz; Ming Lei; Jing Li; Kenton L Lohman; Hong Lu; Vinod B Makhijani; Keith E McDade; Michael P McKenna; Eugene W Myers; Elizabeth Nickerson; John R Nobile; Ramona Plant; Bernard P Puc; Michael T Ronan; George T Roth; Gary J Sarkis; Jan Fredrik Simons; John W Simpson; Maithreyan Srinivasan; Karrie R Tartaro; Alexander Tomasz; Kari A Vogt; Greg A Volkmer; Shally H Wang; Yong Wang; Michael P Weiner; Pengguang Yu; Richard F Begley; Jonathan M Rothberg Journal: Nature Date: 2005-07-31 Impact factor: 49.962
Authors: T D Read; G S A Myers; R C Brunham; W C Nelson; I T Paulsen; J Heidelberg; E Holtzapple; H Khouri; N B Federova; H A Carty; L A Umayam; D H Haft; J Peterson; M J Beanan; O White; S L Salzberg; R-c Hsia; G McClarty; R G Rank; P M Bavoil; C M Fraser Journal: Nucleic Acids Res Date: 2003-04-15 Impact factor: 16.971
Authors: Stephen Richards; Yue Liu; Brian R Bettencourt; Pavel Hradecky; Stan Letovsky; Rasmus Nielsen; Kevin Thornton; Melissa J Hubisz; Rui Chen; Richard P Meisel; Olivier Couronne; Sujun Hua; Mark A Smith; Peili Zhang; Jing Liu; Harmen J Bussemaker; Marinus F van Batenburg; Sally L Howells; Steven E Scherer; Erica Sodergren; Beverly B Matthews; Madeline A Crosby; Andrew J Schroeder; Daniel Ortiz-Barrientos; Catharine M Rives; Michael L Metzker; Donna M Muzny; Graham Scott; David Steffen; David A Wheeler; Kim C Worley; Paul Havlak; K James Durbin; Amy Egan; Rachel Gill; Jennifer Hume; Margaret B Morgan; George Miner; Cerissa Hamilton; Yanmei Huang; Lenée Waldron; Daniel Verduzco; Kerstin P Clerc-Blankenburg; Inna Dubchak; Mohamed A F Noor; Wyatt Anderson; Kevin P White; Andrew G Clark; Stephen W Schaeffer; William Gelbart; George M Weinstock; Richard A Gibbs Journal: Genome Res Date: 2005-01 Impact factor: 9.043
Authors: Stefan Kurtz; Adam Phillippy; Arthur L Delcher; Michael Smoot; Martin Shumway; Corina Antonescu; Steven L Salzberg Journal: Genome Biol Date: 2004-01-30 Impact factor: 13.583
Authors: Alejandro A Schäffer; Eric P Nawrocki; Yoon Choi; Paul A Kitts; Ilene Karsch-Mizrachi; Richard McVeigh Journal: Bioinformatics Date: 2018-03-01 Impact factor: 6.937
Authors: Nadine Borchert; Christoph Dieterich; Karsten Krug; Wolfgang Schütz; Stephan Jung; Alfred Nordheim; Ralf J Sommer; Boris Macek Journal: Genome Res Date: 2010-03-17 Impact factor: 9.043
Authors: Juan Falgueras; Antonio J Lara; Noé Fernández-Pozo; Francisco R Cantón; Guillermo Pérez-Trabado; M Gonzalo Claros Journal: BMC Bioinformatics Date: 2010-01-20 Impact factor: 3.169