Julia Herold1, Stefan Kurtz, Robert Giegerich. 1. Center of Biotechnology, Bielefeld University, Postfach 10 01 31, 33501 Bielefeld, Germany. jherold@cebitec.uni-bielefeld.de
Abstract
BACKGROUND: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. RESULTS: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. CONCLUSION: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
BACKGROUND: Analysis of sequence composition is a routine task in genome research. Organisms are characterized by their base composition, dinucleotide relative abundance, codon usage, and so on. Unique subsequences are markers of special interest in genome comparison, expression profiling, and genetic engineering. Relative to a random sequence of the same length, unique subsequences are overrepresented in real genomes. Shortest words absent from a genome have been addressed in two recent studies. RESULTS: We describe a new algorithm and software for the computation of absent words. It is more efficient than previous algorithms and easier to use. It directly computes unwords without the need to specify a length estimate. Moreover, it avoids the space requirements of index structures such as suffix trees and suffix arrays. Our implementation is available as an open source package. We compute unwords of human and mouse as well as some other organisms, covering a genome size range from 109 down to 105 bp. CONCLUSION: The new algorithm computes absent words for the human genome in 10 minutes on standard hardware, using only 2.5 Mb of space. This enables us to perform this type of analysis not only for the largest genomes available so far, but also for the emerging pan- and meta-genome data.
Authors: James E Galagan; Sarah E Calvo; Katherine A Borkovich; Eric U Selker; Nick D Read; David Jaffe; William FitzHugh; Li-Jun Ma; Serge Smirnov; Seth Purcell; Bushra Rehman; Timothy Elkins; Reinhard Engels; Shunguang Wang; Cydney B Nielsen; Jonathan Butler; Matthew Endrizzi; Dayong Qui; Peter Ianakiev; Deborah Bell-Pedersen; Mary Anne Nelson; Margaret Werner-Washburne; Claude P Selitrennikoff; John A Kinsey; Edward L Braun; Alex Zelter; Ulrich Schulte; Gregory O Kothe; Gregory Jedd; Werner Mewes; Chuck Staben; Edward Marcotte; David Greenberg; Alice Roy; Karen Foley; Jerome Naylor; Nicole Stange-Thomann; Robert Barrett; Sante Gnerre; Michael Kamal; Manolis Kamvysselis; Evan Mauceli; Cord Bielke; Stephen Rudd; Dmitrij Frishman; Svetlana Krystofova; Carolyn Rasmussen; Robert L Metzenberg; David D Perkins; Scott Kroken; Carlo Cogoni; Giuseppe Macino; David Catcheside; Weixi Li; Robert J Pratt; Stephen A Osmani; Colin P C DeSouza; Louise Glass; Marc J Orbach; J Andrew Berglund; Rodger Voelker; Oded Yarden; Michael Plamann; Stephan Seiler; Jay Dunlap; Alan Radford; Rodolfo Aramayo; Donald O Natvig; Lisa A Alex; Gertrud Mannhaupt; Daniel J Ebbole; Michael Freitag; Ian Paulsen; Matthew S Sachs; Eric S Lander; Chad Nusbaum; Bruce Birren Journal: Nature Date: 2003-04-24 Impact factor: 49.962
Authors: C J Bult; O White; G J Olsen; L Zhou; R D Fleischmann; G G Sutton; J A Blake; L M FitzGerald; R A Clayton; J D Gocayne; A R Kerlavage; B A Dougherty; J F Tomb; M D Adams; C I Reich; R Overbeek; E F Kirkness; K G Weinstock; J M Merrick; A Glodek; J L Scott; N S Geoghagen; J C Venter Journal: Science Date: 1996-08-23 Impact factor: 47.728
Authors: Jens Lichtenberg; Edwin Jacox; Joshua D Welch; Kyle Kurz; Xiaoyu Liang; Mary Qu Yang; Frank Drews; Klaus Ecker; Stephen S Lee; Laura Elnitski; Lonnie R Welch Journal: BMC Genomics Date: 2009-07-07 Impact factor: 3.969
Authors: Sara P Garcia; Armando J Pinho; João M O S Rodrigues; Carlos A C Bastos; Paulo J S G Ferreira Journal: PLoS One Date: 2011-01-31 Impact factor: 3.240
Authors: Shea N Gardner; Amy L Hiddessen; Peter L Williams; Christine Hara; Mark C Wagner; Bill W Colston Journal: Nucleic Acids Res Date: 2009-09-16 Impact factor: 16.971
Authors: Jens Lichtenberg; Alper Yilmaz; Joshua D Welch; Kyle Kurz; Xiaoyu Liang; Frank Drews; Klaus Ecker; Stephen S Lee; Matt Geisler; Erich Grotewold; Lonnie R Welch Journal: BMC Genomics Date: 2009-10-08 Impact factor: 3.969