Literature DB >> 17158514

Assembling millions of short DNA sequences using SSAKE.

René L Warren¹, Granger G Sutton, Steven J M Jones, Robert A Holt.

Abstract

UNLABELLED: Novel DNA sequencing technologies with the potential for up to three orders magnitude more sequence throughput than conventional Sanger sequencing are emerging. The instrument now available from Solexa Ltd, produces millions of short DNA sequences of 25 nt each. Due to ubiquitous repeats in large genomes and the inability of short sequences to uniquely and unambiguously characterize them, the short read length limits applicability for de novo sequencing. However, given the sequencing depth and the throughput of this instrument, stringent assembly of highly identical sequences can be achieved. We describe SSAKE, a tool for aggressively assembling millions of short nucleotide sequences by progressively searching through a prefix tree for the longest possible overlap between any two sequences. SSAKE is designed to help leverage the information from short sequence reads by stringently assembling them into contiguous sequences that can be used to characterize novel sequencing targets. AVAILABILITY: http://www.bcgsc.ca/bioinfo/software/ssake.

Mesh：

Year: 2006 PMID： 17158514 PMCID： PMC7109930 DOI： 10.1093/bioinformatics/btl629

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 INTRODUCTION

High-throughput DNA sequencing instrumentation capable of producing tens of millions of short (∼25 bp) sequences (reads) is becoming available (Bennett, 2004). The two most striking attributes of this technology, the large read depth and short sequence length, make it suitable for re-sequencing applications where a known reference sequence is used as a template for alignment. However, the ability to decode novel sequencing targets, such as unsequenced genomes or metagenomic libraries is limited. Twenty-five mers are far more ubiquitous than Sanger-size reads (500–1000 bp) in any given genome. Since the sequence complexity increases by a factor 4 for every base added, the likelihood of observing redundant sequences increases dramatically with decreased read length for sequences shorter than 20 bp. The read length needed to achieve maximal uniqueness varies depending on the genome being sequenced, its size and repeat content (Whiteford ). Although some studies have explored the feasibility of de novo genome assembly using 70–80 bp reads (Chaisson ), none describe tools for de novo assembly of shorter sequences. Here we present an application to assemble millions of short DNA sequences. The Short Sequence Assembly by progressive K-mer search and 3′ read Extension (SSAKE) program cycles through sequence data stored in a hash table, and progressively searches through a prefix tree for the longest possible k-mer between any two sequences. We ran the algorithm on simulated error-free 25mers from the bacteriophage PhiX174 (Sanger ), coronavirus SARS TOR2 (Marra ), bacteria Haemophilus influenzae (Fleischmann ) genomes and on 40 million 25mers from the whole-genome shotgun (WGS) sequence data from the Sargasso sea metagenomics project (Venter ). Our results indicate that SSAKE could be used for complete assembly of sequencing targets that are 30 kb in length (e.g. viral targets) and to cluster millions of identical short sequences from a complex microbial community.

2 METHODS

2.1 Material

The PhiX174, SARS TOR2 and H.influenzae genomes were downloaded from GenBank (GenBank identifier J02482, AY274119 and L42023, respectively). All possible 25mers were extracted from both strands for these genomes. Sequences were selected at random to simulate up to 400× read coverage for the viral genomes and up to 100× read coverage for H.influenzae. Forty million 25mers were selected at random from the Sargasso Sea WGS metagenomics data obtained from the Venter Institute ().

2.2 SSAKE algorithm

DNA sequences in a single multi fasta file are read in memory, populating a hash table keyed by unique sequence reads with values representing the number of occurrences of that sequence in the set. A prefix tree is used to organize the sequences and their reverse-complemented counterparts by their first eleven 5′ end bases. The sequence reads are sorted by decreasing number of occurrences to reflect coverage and minimize extension of reads containing sequencing errors. Each unassembled read, u, is used in turn to nucleate an assembly. Each possible 3′ most k-mer is generated from u and is used for the search until the word length is smaller than a user-defined minimum, m, or until the k-mer has a perfect match with the 5′ end bases of read r. In that latter case, u is extended by the unmatched 3′ end bases contained in r, and r is removed from the hash table and prefix tree. The process of cycling through progressively shorter 3′-most k-mers is repeated after every extension of u. Since only left-most searches are possible with a prefix tree, when all possibilities have been exhausted for the 3′ extension, the complementary strand of the contiguous sequence generated (contig) is used to extend the contig on the 5′ end. The DNA prefix tree is used to limit the search space by efficiently binning the sequence reads. There are two ways to control the stringency in SSAKE. The first is to stop the extension when a k-mer matches the 5′ end of more than one sequence read (−s 1). This leads to shorter contigs, but minimizes sequence misassemblies. The second is to stop the extension when a k-mer is smaller than a user-set minimum word length (m). SSAKE outputs a log file with run information along with two multi fasta files, one containing all sequence contigs constructed and the other containing the unassembled sequence reads.

3 RESULTS

SSAKE assembly of 4208 PhiX174 reads took 0.84 s on a single 2.2 GHz two dual-core CPU AMD Opteron™ computer with 4 GB RAM and yielded a single contig bearing 100% sequence identity (sum of identical base matches between two sequences divided by the contig length) with the PhiX174 genome (Table 1). On the same hardware, we were able to assemble the SARS-associated coronavirus de novo into a single contig having 99.91% sequencing identity with the genome. The read coverage needed to achieve this was 20 times higher than for PhiX174. Increased coverage was needed to insure only one valid path could be taken to assemble all reads. Assembly of H.influenzae reads was impaired by the presence, in the genome, of 28 perfectly repeated segments ranging in size from 70 to 5723 bases and 29 766 repeated 25mers. At best, we were able to assemble 7.3 million sequence reads into 284 contigs equal or larger than 75 bp and totaling 1.78 Mb. Of these contigs, 241 showed single, unique, full-length alignments to H.influenzae, and covered 1007 kb (54.62% of the genome) with 99.43% sequence identity. The remaining 43 contigs totaled 776 kb and all incorporated k-mers that mapped to repeats, causing broken alignments between the contigs and the genome.

Table 1

Short read assembly of PhiX174, SARS TOR2 and H.influenzae genomes using SSAKE on a single 2× 2.2 GHz dual-core AMD Opteron™ CPU with 4 GB RAM

Species (size bp)	Input random 25mers	Coverage	Run time (s)	Contig N50 length (bp)	Genome covered (%)	Mean sequence identity (%)
PhiX-174 (5386)	4208	20	0.84	5382	99.92	100
SARS TOR2 (29 751)	476 016	400	45.13	29 744	99.98	99.91
H.influenzae (1 830 023)^a	7 316 203	100	580.53	22 230	54.62	99.43
Sargasso Sea metagenome	40 000 000	NA	9.2E + 4	423	NA	92.29

Assembly of 40 M Sargasso Sea 25mers was done on a single 4× 1.4 GHz AMD Opteron™ CPU with 32 GB RAM.

Phix-174 was assembled using −m 11 −s 0, SARS using −m 15 −s 0, H.influenzae −m 16 −s 1 and Sargasso Sea using −m 16 −s 0.

aOnly contigs aligning once to the genome are shown. N50 length is length that marks 50% genome content.

Short read assembly of PhiX174, SARS TOR2 and H.influenzae genomes using SSAKE on a single 2× 2.2 GHz dual-core AMD Opteron™ CPU with 4 GB RAM Assembly of 40 M Sargasso Sea 25mers was done on a single 4× 1.4 GHz AMD Opteron™ CPU with 32 GB RAM. Phix-174 was assembled using −m 11 −s 0, SARS using −m 15 −s 0, H.influenzae −m 16 −s 1 and Sargasso Sea using −m 16 −s 0. aOnly contigs aligning once to the genome are shown. N50 length is length that marks 50% genome content. Forty million 25mers generated at random from Sargasso Sea genome shotgun Sanger-reads (Venter ) were assembled using −m 16 in ∼25 h on a 1.4 GHz Opteron™ computer with 32 GB of RAM using at most 19 GB RAM. Up to 11% of the reads used as input to SSAKE were assembled into contigs equal or larger than 100 bp, totaling 12.8 Mb. Unassembled reads accounted for 32.5% of the input sequences. The remaining reads were found in short contigs (26–99 bp). To evaluate assembly accuracy, we aligned all contigs ≥100 bp to a publicly available assembly of the Sargasso Sea WGS data using wuBLAST (Gish, 1996–2005, ). For this assembly, 99.6% of SSAKE contigs aligned to known Sargasso Sea contigs. The overall sequence identity of SSAKE contigs was 92.3%. Perfect alignments would not necessarily be expected due to the non-clonal nature of the members of this microbial community (Venter ). We benchmarked SSAKE on two separate Opteron computers (described above) using sets of 1 k, 10 k, 100 k, 1 M, 2 M, 5 M 10 M and 40 M random 25mers simulated from the Sargasso Sea metagenomics WGS data. We found that the assembly running time followed a linear trend on both machines (data not shown). Consistent with this trend, a fast 2.2 GHz computer chip with sufficient RAM (32 GB) would assemble 40 M sequences in ca. 10 h.

CONCLUSION

We have shown that with high-sequencing depth, short sequences can be used for de novo assembly of small DNA targets (e.g. viral genomes) that are up to 10's of kb in length. For larger and more complex sequencing targets, such as bacterial genomes, short reads can be rapidly and stringently assembled into contigs that accurately represent the non-repetitive portion of the genome. It is clear that the best approach for de novo sequencing of targets more complex than viral genomes will likely involve some combination of Sanger reads and assembled short reads. For metagenomics, our simulation involving 40 M short reads from the Sargasso Sea WGS data indicate that these types of reads can be used to produce conservative contigs in a robust and tractable manner, while minimizing probabilistic errors. As a stringent, efficient assembly tool SSAKE is expected to have broad application in de novo sequencing.

7 in total

1. Solexa Ltd.

Authors: Simon Bennett
Journal: Pharmacogenomics Date: 2004-06 Impact factor: 2.533

2. Fragment assembly with short reads.

Authors: Mark Chaisson; Pavel Pevzner; Haixu Tang
Journal: Bioinformatics Date: 2004-04-01 Impact factor: 6.937

3. Nucleotide sequence of bacteriophage phi X174 DNA.

Authors: F Sanger; G M Air; B G Barrell; N L Brown; A R Coulson; C A Fiddes; C A Hutchison; P M Slocombe; M Smith
Journal: Nature Date: 1977-02-24 Impact factor: 49.962

4. Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.

Authors: R D Fleischmann; M D Adams; O White; R A Clayton; E F Kirkness; A R Kerlavage; C J Bult; J F Tomb; B A Dougherty; J M Merrick
Journal: Science Date: 1995-07-28 Impact factor: 47.728

5. The Genome sequence of the SARS-associated coronavirus.

Authors: Marco A Marra; Steven J M Jones; Caroline R Astell; Robert A Holt; Angela Brooks-Wilson; Yaron S N Butterfield; Jaswinder Khattra; Jennifer K Asano; Sarah A Barber; Susanna Y Chan; Alison Cloutier; Shaun M Coughlin; Doug Freeman; Noreen Girn; Obi L Griffith; Stephen R Leach; Michael Mayo; Helen McDonald; Stephen B Montgomery; Pawan K Pandoh; Anca S Petrescu; A Gordon Robertson; Jacqueline E Schein; Asim Siddiqui; Duane E Smailus; Jeff M Stott; George S Yang; Francis Plummer; Anton Andonov; Harvey Artsob; Nathalie Bastien; Kathy Bernard; Timothy F Booth; Donnie Bowness; Martin Czub; Michael Drebot; Lisa Fernando; Ramon Flick; Michael Garbutt; Michael Gray; Allen Grolla; Steven Jones; Heinz Feldmann; Adrienne Meyers; Amin Kabani; Yan Li; Susan Normand; Ute Stroher; Graham A Tipples; Shaun Tyler; Robert Vogrig; Diane Ward; Brynn Watson; Robert C Brunham; Mel Krajden; Martin Petric; Danuta M Skowronski; Chris Upton; Rachel L Roper
Journal: Science Date: 2003-05-01 Impact factor: 47.728

6. Environmental genome shotgun sequencing of the Sargasso Sea.

Authors: J Craig Venter; Karin Remington; John F Heidelberg; Aaron L Halpern; Doug Rusch; Jonathan A Eisen; Dongying Wu; Ian Paulsen; Karen E Nelson; William Nelson; Derrick E Fouts; Samuel Levy; Anthony H Knap; Michael W Lomas; Ken Nealson; Owen White; Jeremy Peterson; Jeff Hoffman; Rachel Parsons; Holly Baden-Tillson; Cynthia Pfannkoch; Yu-Hui Rogers; Hamilton O Smith
Journal: Science Date: 2004-03-04 Impact factor: 47.728

7. An analysis of the feasibility of short read sequencing.

Authors: Nava Whiteford; Niall Haslam; Gerald Weber; Adam Prügel-Bennett; Jonathan W Essex; Peter L Roach; Mark Bradley; Cameron Neylon
Journal: Nucleic Acids Res Date: 2005-11-07 Impact factor: 16.971

7 in total

148 in total

1. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies.

Authors: Sébastien Boisvert; François Laviolette; Jacques Corbeil
Journal: J Comput Biol Date: 2010-10-20 Impact factor: 1.479

Review 2. Next-generation sequencing techniques for eukaryotic microorganisms: sequencing-based solutions to biological problems.

Authors: Minou Nowrousian
Journal: Eukaryot Cell Date: 2010-07-02

3. On genome annotation of Brucellaphage Gadvasu (BpG): discovery of ORFans for integrated systems biology approaches.

Authors: Deepti Chachra; Pushpinder Kaur; Prasad Siddavatam; Prashanth Suravajhala; Hari Mohan Saxena
Journal: Syst Synth Biol Date: 2015-11-21

4. Fusobacterium nucleatum infection is prevalent in human colorectal carcinoma.

Authors: Mauro Castellarin; René L Warren; J Douglas Freeman; Lisa Dreolini; Martin Krzywinski; Jaclyn Strauss; Rebecca Barnes; Peter Watson; Emma Allen-Vercoe; Richard A Moore; Robert A Holt
Journal: Genome Res Date: 2011-10-18 Impact factor: 9.043

5. Short read fragment assembly of bacterial genomes.

Authors: Mark J Chaisson; Pavel A Pevzner
Journal: Genome Res Date: 2007-12-14 Impact factor: 9.043

6. SHARCGS, a fast and highly accurate short-read assembly algorithm for de novo genomic sequencing.

Authors: Juliane C Dohm; Claudio Lottaz; Tatiana Borodina; Heinz Himmelbauer
Journal: Genome Res Date: 2007-10-01 Impact factor: 9.043

7. Velvet: algorithms for de novo short read assembly using de Bruijn graphs.

Authors: Daniel R Zerbino; Ewan Birney
Journal: Genome Res Date: 2008-03-18 Impact factor: 9.043

Review 8. Bioinformatics challenges of new sequencing technology.

Authors: Mihai Pop; Steven L Salzberg
Journal: Trends Genet Date: 2008-02-11 Impact factor: 11.639

9. De novo fragment assembly with short mate-paired reads: Does the read length matter?

Authors: Mark J Chaisson; Dumitru Brinza; Pavel A Pevzner
Journal: Genome Res Date: 2008-12-03 Impact factor: 9.043

10. The Oxytricha trifallax macronuclear genome: a complex eukaryotic genome with 16,000 tiny chromosomes.

Authors: Estienne C Swart; John R Bracht; Vincent Magrini; Patrick Minx; Xiao Chen; Yi Zhou; Jaspreet S Khurana; Aaron D Goldman; Mariusz Nowacki; Klaas Schotanus; Seolkyoung Jung; Robert S Fulton; Amy Ly; Sean McGrath; Kevin Haub; Jessica L Wiggins; Donna Storton; John C Matese; Lance Parsons; Wei-Jen Chang; Michael S Bowen; Nicholas A Stover; Thomas A Jones; Sean R Eddy; Glenn A Herrick; Thomas G Doak; Richard K Wilson; Elaine R Mardis; Laura F Landweber
Journal: PLoS Biol Date: 2013-01-29 Impact factor: 8.029