Literature DB >> 19344490

The need for speed.

Paul Flicek1.   

Abstract

DNA sequence data are being produced at an ever-increasing rate. The Bowtie sequence-alignment algorithm uses advanced data structures to help data analysis keep pace with data generation.

Entities:  

Mesh:

Year:  2009        PMID: 19344490      PMCID: PMC2690991          DOI: 10.1186/gb-2009-10-3-212

Source DB:  PubMed          Journal:  Genome Biol        ISSN: 1474-7596            Impact factor:   13.583


In this month's Genome Biology, Langmead and colleagues [1] present the Bowtie algorithm. Bowtie is designed to align large numbers of relatively short DNA sequencing reads to an entire reference genome. It does so by first taking the reference genome assembly and changing the order of the sequence using something called the Burrows-Wheeler Transform. Why is this useful? Speed is the best answer: Bowtie is more than 30 times faster than other published tools designed to do the same task. Let's step back and see why the need for speed in our analysis algorithms is greater now than at any time in the genomic age. Over the past three years massively high-throughput sequencing, often called 'next-generation' sequencing, has developed from a few beta devices in key genome centers to a large installed base in research labs around the world. The success of sequencing machines such as Illumina/Solexa, ABI SOLiD and 454 FLX has facilitated the development of sequencing as a general-purpose experimental tool for many biological applications. The range of possible uses is rapidly establishing DNA sequencing as the microscope of modern biology. The scale of data generation is amazing; for example, in the course of its pilot phase the 1000 Genomes Project [2] has already generated almost 2,000-fold total coverage of the human genome from 180 individual samples, an amount orders of magnitude larger than the original Human Genome Project. There is a very real chance that before 2012 the amount of data generated by worldwide DNA sequencing will exceed the expected 15 petabytes of data per year produced by CERN's Large Hadron Collider. In the light of these spectacular developments in data-generation capacity, it should come as no surprise that the computational requirements for supporting large-scale genome sequencing are growing dramatically. A key question is whether bioinformaticians are up to the task. Fortunately, the sheer number of new algorithms - some, like Bowtie, are based on data structures and methods either newly introduced to biology or rediscovered in the light of challenges posed by next-generation sequence data - suggest that bioinformatics, if not yet entering a new golden age [3], is responding to the waves of data by building better surfboards rather than running for higher ground. Alignment is one of the first and most fundamental problems for any sequencing-based project in which a reference genome assembly already exists for the species concerned. Today's resequencing and functional studies (Box 1) directly leverage the effort required to create high-quality finished and draft genome assemblies such as those available for the human and mouse genomes. For next-generation sequencing studies the collected DNA sequencing reads are almost completely meaningless until they are aligned. Even the knowledge of whether the experiment succeeded is unknown until the sequencing reads are aligned to the reference genome.
Box 1. Resequencing and functional studies

A small sampling of recent work leveraging the developments in DNA sequencing technology.

A small sampling of recent work leveraging the developments in DNA sequencing technology. How do we address this essential step in the analysis and get as quickly as possible to the point where we can start to make sense of the biology? Programs such as Bowtie dramatically accelerate the alignment step by storing the reference genome in a highly ordered manner that facilitates very rapid searching of sequence. The key technology in Bowtie is called the Burrows-Wheeler Transform (BWT), which was originally developed for data compression. It works by reordering the original genome sequence such that certain patterns within the sequence are made explicit and therefore simplifies compression of the sequence. Importantly, the BWT reordering is reversible, so we are always able to reconstruct the original sequence. In fact, those readers who have ever downloaded compressed files from the Internet have probably already benefited from the BWT, which is at the heart of the bzip2 data compression algorithm [4]. Once the BWT has been constructed for the given genome assembly it is indexed for optimal searching by creating an FM index, which is, roughly speaking, a compressed suffix array of the genome sequence. These existing techniques and novel modifications by Langmead et al. [1] to existing sequencing matching algorithms allow Bowtie to use the FM index to rapidly align both exactly matching DNA sequencing reads and those with mismatches caused by sequencing error or sequence polymorphism, all while maintaining a memory footprint low enough to run on many standard laptop computers. The BWT and the FM index are not complete strangers to bioinformatics. Several groups have adopted the data structure to solve specific problems mostly related to comparing many short segments of the genome to the genome as a whole. Before massive resequencing datasets existed, a common application of this problem was microarray probe design [5,6]. In this case, one effective way to estimate cross-hybridization potential for a given array design is to do a brute-force comparison of all short DNA segments (that is, possible array probes) to the genome as a whole. Even when there are hundreds of billions of short sequencing reads the problem of alignment remains relatively easy compared with the problem of de novo genome assembly from short sequencing reads (especially for mammalian-sized genomes). A key difference comes from how easy it is to distribute the required computational work over the nodes of the compute clusters that are commonly used for bioinformatics analysis. For example, alignment is considered 'embarrassingly parallel', so named because of how easy it is to achieve parallelization. For the case of read alignment to the reference genome, the most common way to distribute the task across a compute cluster is to store the complete reference genome on each of the nodes of the cluster and then distribute the collection of reads equally across the nodes. The read alignments can be merged at the end of the process. De novo assembly requires that essentially all the information needed to solve the problem (that is, how sequencing reads are related to each other) is available to the assembly program. For short-read datasets and mammalian-sized genomes, this generally leads to extremely large memory requirements that grow with the genome size and number of sequencing reads or to software implementations based on complex message passing between compute nodes. To achieve large-scale alignment parallelization one only needs to be able to store the entire reference genome in memory available at each compute node. Without the BWT and the data compression it provides, storing a search-optimized data structure such as a suffix array for the entire genome is not feasible on each of the compute nodes found in today's clusters (see [5] for a more detailed discussion of the memory requirements of a mammalian genome suffix array both before and after a BWT). Bowtie is not the only alignment program designed for next-generation sequence data using an index based on the BWT, but it does appear to be the first reported in the literature. The creators of SOAP [7] have recently introduced SOAP2 [8] and the creators of MAQ [9] have produced BWA [10], both of which provide a significant improvement in speed over the hash-table-based implementations of SOAP and MAQ. For applications such as ChIP-seq and for rapid confirmation that the sequencing experiment performed as expected, Bowtie is likely to be the most effective solution. For some other applications, including whole-genome, paired-end resequencing projects, it may not yet be the right choice. Although much faster, Bowtie is not as accurate as MAQ in the case of a real dataset aligned with Bowtie's default parameters [1]. Parameter choices can increase Bowtie's accuracy, but at the cost of speed. Bowtie is also currently missing some critical functionality (for example, the ability to align paired reads). This functionality will certainly be added soon - either by the Bowtie developers, who have already implemented preliminary support for pair-end alignment in the most up-to-date version available on the Bowtie website [11], or by someone else enabled by Bowtie's open-source license. Bowtie is yet another example of a common story in bioinformatics. Whereas default alignment programs are provided by the instrument manufacturers, the wider scientific community has developed the programs now used by many, if not most, researchers. This is a testament to the software-development skills within the research community and the desire within that community to create tools that are easy to deploy and use within existing analysis pipelines. There can be no doubt that open data formats and the ability to tap into the widest segment of the community in the search for solutions is the best way forward for DNA sequence analysis. For now, sequence-alignment algorithms based on the BWT allow us to keep pace with the sequencing machines for at least another year. In today's fast-moving world of sequence generation, this is indeed a dramatic development.
  19 in total

1.  Mapping short DNA sequencing reads and calling variants using mapping quality scores.

Authors:  Heng Li; Jue Ruan; Richard Durbin
Journal:  Genome Res       Date:  2008-08-19       Impact factor: 9.043

2.  Annotating large genomes with exact word matches.

Authors:  John Healy; Elizabeth E Thomas; Jacob T Schwartz; Michael Wigler
Journal:  Genome Res       Date:  2003-09-15       Impact factor: 9.043

3.  The diploid genome sequence of an Asian individual.

Authors:  Jun Wang; Wei Wang; Ruiqiang Li; Yingrui Li; Geng Tian; Laurie Goodman; Wei Fan; Junqing Zhang; Jun Li; Juanbin Zhang; Yiran Guo; Binxiao Feng; Heng Li; Yao Lu; Xiaodong Fang; Huiqing Liang; Zhenglin Du; Dong Li; Yiqing Zhao; Yujie Hu; Zhenzhen Yang; Hancheng Zheng; Ines Hellmann; Michael Inouye; John Pool; Xin Yi; Jing Zhao; Jinjie Duan; Yan Zhou; Junjie Qin; Lijia Ma; Guoqing Li; Zhentao Yang; Guojie Zhang; Bin Yang; Chang Yu; Fang Liang; Wenjie Li; Shaochuan Li; Dawei Li; Peixiang Ni; Jue Ruan; Qibin Li; Hongmei Zhu; Dongyuan Liu; Zhike Lu; Ning Li; Guangwu Guo; Jianguo Zhang; Jia Ye; Lin Fang; Qin Hao; Quan Chen; Yu Liang; Yeyang Su; A San; Cuo Ping; Shuang Yang; Fang Chen; Li Li; Ke Zhou; Hongkun Zheng; Yuanyuan Ren; Ling Yang; Yang Gao; Guohua Yang; Zhuo Li; Xiaoli Feng; Karsten Kristiansen; Gane Ka-Shu Wong; Rasmus Nielsen; Richard Durbin; Lars Bolund; Xiuqing Zhang; Songgang Li; Huanming Yang; Jian Wang
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

4.  Genome-scale DNA methylation maps of pluripotent and differentiated cells.

Authors:  Alexander Meissner; Tarjei S Mikkelsen; Hongcang Gu; Marius Wernig; Jacob Hanna; Andrey Sivachenko; Xiaolan Zhang; Bradley E Bernstein; Chad Nusbaum; David B Jaffe; Andreas Gnirke; Rudolf Jaenisch; Eric S Lander
Journal:  Nature       Date:  2008-07-06       Impact factor: 49.962

5.  Ultrafast and memory-efficient alignment of short DNA sequences to the human genome.

Authors:  Ben Langmead; Cole Trapnell; Mihai Pop; Steven L Salzberg
Journal:  Genome Biol       Date:  2009-03-04       Impact factor: 13.583

6.  A Bayesian deconvolution strategy for immunoprecipitation-based DNA methylome analysis.

Authors:  Thomas A Down; Vardhman K Rakyan; Daniel J Turner; Paul Flicek; Heng Li; Eugene Kulesha; Stefan Gräf; Nathan Johnson; Javier Herrero; Eleni M Tomazou; Natalie P Thorne; Liselotte Bäckdahl; Marlis Herberth; Kevin L Howe; David K Jackson; Marcos M Miretti; John C Marioni; Ewan Birney; Tim J P Hubbard; Richard Durbin; Simon Tavaré; Stephan Beck
Journal:  Nat Biotechnol       Date:  2008-07       Impact factor: 54.908

7.  Transcriptome sequencing to detect gene fusions in cancer.

Authors:  Christopher A Maher; Chandan Kumar-Sinha; Xuhong Cao; Shanker Kalyana-Sundaram; Bo Han; Xiaojun Jing; Lee Sam; Terrence Barrette; Nallasivam Palanisamy; Arul M Chinnaiyan
Journal:  Nature       Date:  2009-01-11       Impact factor: 49.962

8.  DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome.

Authors:  Timothy J Ley; Elaine R Mardis; Li Ding; Bob Fulton; Michael D McLellan; Ken Chen; David Dooling; Brian H Dunford-Shore; Sean McGrath; Matthew Hickenbotham; Lisa Cook; Rachel Abbott; David E Larson; Dan C Koboldt; Craig Pohl; Scott Smith; Amy Hawkins; Scott Abbott; Devin Locke; Ladeana W Hillier; Tracie Miner; Lucinda Fulton; Vincent Magrini; Todd Wylie; Jarret Glasscock; Joshua Conyers; Nathan Sander; Xiaoqi Shi; John R Osborne; Patrick Minx; David Gordon; Asif Chinwalla; Yu Zhao; Rhonda E Ries; Jacqueline E Payton; Peter Westervelt; Michael H Tomasson; Mark Watson; Jack Baty; Jennifer Ivanovich; Sharon Heath; William D Shannon; Rakesh Nagarajan; Matthew J Walter; Daniel C Link; Timothy A Graubert; John F DiPersio; Richard K Wilson
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

9.  Bioinformatics: alive and kicking.

Authors:  Lincoln D Stein
Journal:  Genome Biol       Date:  2008-12-17       Impact factor: 13.583

10.  Accurate whole human genome sequencing using reversible terminator chemistry.

Authors:  David R Bentley; Shankar Balasubramanian; Harold P Swerdlow; Geoffrey P Smith; John Milton; Clive G Brown; Kevin P Hall; Dirk J Evers; Colin L Barnes; Helen R Bignell; Jonathan M Boutell; Jason Bryant; Richard J Carter; R Keira Cheetham; Anthony J Cox; Darren J Ellis; Michael R Flatbush; Niall A Gormley; Sean J Humphray; Leslie J Irving; Mirian S Karbelashvili; Scott M Kirk; Heng Li; Xiaohai Liu; Klaus S Maisinger; Lisa J Murray; Bojan Obradovic; Tobias Ost; Michael L Parkinson; Mark R Pratt; Isabelle M J Rasolonjatovo; Mark T Reed; Roberto Rigatti; Chiara Rodighiero; Mark T Ross; Andrea Sabot; Subramanian V Sankar; Aylwyn Scally; Gary P Schroth; Mark E Smith; Vincent P Smith; Anastassia Spiridou; Peta E Torrance; Svilen S Tzonev; Eric H Vermaas; Klaudia Walter; Xiaolin Wu; Lu Zhang; Mohammed D Alam; Carole Anastasi; Ify C Aniebo; David M D Bailey; Iain R Bancarz; Saibal Banerjee; Selena G Barbour; Primo A Baybayan; Vincent A Benoit; Kevin F Benson; Claire Bevis; Phillip J Black; Asha Boodhun; Joe S Brennan; John A Bridgham; Rob C Brown; Andrew A Brown; Dale H Buermann; Abass A Bundu; James C Burrows; Nigel P Carter; Nestor Castillo; Maria Chiara E Catenazzi; Simon Chang; R Neil Cooley; Natasha R Crake; Olubunmi O Dada; Konstantinos D Diakoumakos; Belen Dominguez-Fernandez; David J Earnshaw; Ugonna C Egbujor; David W Elmore; Sergey S Etchin; Mark R Ewan; Milan Fedurco; Louise J Fraser; Karin V Fuentes Fajardo; W Scott Furey; David George; Kimberley J Gietzen; Colin P Goddard; George S Golda; Philip A Granieri; David E Green; David L Gustafson; Nancy F Hansen; Kevin Harnish; Christian D Haudenschild; Narinder I Heyer; Matthew M Hims; Johnny T Ho; Adrian M Horgan; Katya Hoschler; Steve Hurwitz; Denis V Ivanov; Maria Q Johnson; Terena James; T A Huw Jones; Gyoung-Dong Kang; Tzvetana H Kerelska; Alan D Kersey; Irina Khrebtukova; Alex P Kindwall; Zoya Kingsbury; Paula I Kokko-Gonzales; Anil Kumar; Marc A Laurent; Cynthia T Lawley; Sarah E Lee; Xavier Lee; Arnold K Liao; Jennifer A Loch; Mitch Lok; Shujun Luo; Radhika M Mammen; John W Martin; Patrick G McCauley; Paul McNitt; Parul Mehta; Keith W Moon; Joe W Mullens; Taksina Newington; Zemin Ning; Bee Ling Ng; Sonia M Novo; Michael J O'Neill; Mark A Osborne; Andrew Osnowski; Omead Ostadan; Lambros L Paraschos; Lea Pickering; Andrew C Pike; Alger C Pike; D Chris Pinkard; Daniel P Pliskin; Joe Podhasky; Victor J Quijano; Come Raczy; Vicki H Rae; Stephen R Rawlings; Ana Chiva Rodriguez; Phyllida M Roe; John Rogers; Maria C Rogert Bacigalupo; Nikolai Romanov; Anthony Romieu; Rithy K Roth; Natalie J Rourke; Silke T Ruediger; Eli Rusman; Raquel M Sanches-Kuiper; Martin R Schenker; Josefina M Seoane; Richard J Shaw; Mitch K Shiver; Steven W Short; Ning L Sizto; Johannes P Sluis; Melanie A Smith; Jean Ernest Sohna Sohna; Eric J Spence; Kim Stevens; Neil Sutton; Lukasz Szajkowski; Carolyn L Tregidgo; Gerardo Turcatti; Stephanie Vandevondele; Yuli Verhovsky; Selene M Virk; Suzanne Wakelin; Gregory C Walcott; Jingwen Wang; Graham J Worsley; Juying Yan; Ling Yau; Mike Zuerlein; Jane Rogers; James C Mullikin; Matthew E Hurles; Nick J McCooke; John S West; Frank L Oaks; Peter L Lundberg; David Klenerman; Richard Durbin; Anthony J Smith
Journal:  Nature       Date:  2008-11-06       Impact factor: 49.962

View more
  3 in total

Review 1.  Sense from sequence reads: methods for alignment and assembly.

Authors:  Paul Flicek; Ewan Birney
Journal:  Nat Methods       Date:  2009-11       Impact factor: 28.547

2.  Analysing 454 amplicon resequencing experiments using the modular and database oriented Variant Identification Pipeline.

Authors:  Joachim M De Schrijver; Kim De Leeneer; Steve Lefever; Nick Sabbe; Filip Pattyn; Filip Van Nieuwerburgh; Paul Coucke; Dieter Deforce; Jo Vandesompele; Sofie Bekaert; Jan Hellemans; Wim Van Criekinge
Journal:  BMC Bioinformatics       Date:  2010-05-20       Impact factor: 3.169

3.  Realizing the potential of blockchain technologies in genomics.

Authors:  Halil Ibrahim Ozercan; Atalay Mert Ileri; Erman Ayday; Can Alkan
Journal:  Genome Res       Date:  2018-08-03       Impact factor: 9.043

  3 in total

北京卡尤迪生物科技股份有限公司 © 2022-2023.