| Literature DB >> 18952627 |
Jason R Miller1, Arthur L Delcher, Sergey Koren, Eli Venter, Brian P Walenz, Anushka Brownley, Justin Johnson, Kelvin Li, Clark Mobarry, Granger Sutton.
Abstract
MOTIVATION: DNA sequence reads from Sanger and pyrosequencing platforms differ in cost, accuracy, typical coverage, average read length and the variety of available paired-end protocols. Both read types can complement one another in a 'hybrid' approach to whole-genome shotgun sequencing projects, but assembly software must be modified to accommodate their different characteristics. This is true even of pyrosequencing mated and unmated read combinations. Without special modifications, assemblers tuned for homogeneous sequence data may perform poorly on hybrid data.Entities:
Mesh:
Year: 2008 PMID: 18952627 PMCID: PMC2639302 DOI: 10.1093/bioinformatics/btn548
Source DB: PubMed Journal: Bioinformatics ISSN: 1367-4803 Impact factor: 6.937
Fig. 1.Two representations of a best overlap graph. In (a), the layout resembles a multiple sequence alignment. In (b) each read is represented by two nodes joined by an undirected edge. Arrows represent best overlaps, where best means covering the most sequence. There are mutual best overlaps between successive pairs of reads A through D. Due to erroneous bases at one end (wavy line), read E has a non-mutual best overlap to B. Paths span undirected and directed edges alternately. Path EBA converges on path ABCD. CABOG scores read E lower than the others since only three reads are on paths from it. Starting with any one of the high-scoring reads, CABOG would build initial unitig ABCD, then E. Using saved information about each path intersection, CABOG would discount the intersection at B because the path from E spanned only one read before B. It would break ABCD only if there were also a change in read arrival rate at B, which is not the case here. Although linear-time directed-path following finds the longest possible unitig in this constructed case, it is not guaranteed to do so when paths span multiple intersections.
Homogeneous components for hybrid datasets
| Sp | Cmp | Library | #Unmated | Len | #Mated | Len | Cov |
|---|---|---|---|---|---|---|---|
| F1 | FLX unmated | 2 55 329 | 259 | 0 | - | 28.2 | |
| F2 | FLX unmated | 2 54 703 | 259 | 0 | - | 28.2 | |
| M1 | FLX 3-6Kbp | 1 84 680 | 243 | 80 304 | 116 | 23.1 | |
| M2 | FLX 3-6Kbp | 1 87 012 | 243 | 81 926 | 116 | 23.4 | |
| S1 | Sanger 40Kbp | 90 | 601 | 2786 | 728 | 1.0 | |
| F1 | FLX unmated | 2 30 517 | 253 | 0 | - | 12.6 | |
| F2 | FLX unmated | 2 16 458 | 253 | 0 | - | 11.8 | |
| M1 | FLX 3-6Kbp | 2 34 299 | 232 | 65 118 | 115 | 13.3 | |
| F1 | FLX unmated | 2 98 610 | 266 | 0 | - | 26.0 | |
| F2 | FLX unmated | 2 78 142 | 267 | 0 | - | 24.3 | |
| S1 | Sanger 40Kbp | 38 | 537 | 1522 | 830 | 0.4 | |
| F1 | FLX unmated | 4 34 956 | 243 | 0 | - | 11.7 | |
| S1 | Sanger 40Kbp | 3272 | 434 | 21 092 | 713 | 1.7 | |
| Sanger 6-8Kbp | 4108 | 727 | 17 382 | 892 | 1.7 | ||
| Sanger 2-3Kbp | 2652 | 508 | 27 296 | 826 | 2.7 |
Sequence contribution from each component dataset. Sp, species name; Cmp, component name; Unmated/Mated, number of non-paired or paired-end reads; Len, for unmated and mated, the average clear range per read in bases; Cov, fold coverage of the genome by reads; FLX reads originate from the 454 GS FLX sequencer. Sanger reads originate from the ABI 3730 sequencer.
CABOG and Newbler assemblies of hybrid data sets
| Assembler | #Contigs | Contig N50 | Contig Max | Contig Sum |
|---|---|---|---|---|
| CABOG | 48 | 67 993 | 205 585 | 2 332 097 |
| Newbler | 119 | 27 561 | 134 859 | 2 183 278 |
| CABOG | 65 | 51 745 | 169 923 | 2 266 305 |
| Newbler | 104 | 32 377 | 154 008 | 2 184 009 |
| CABOG | 34 | 101 101 | 307 732 | 2 314 836 |
| Newbler | 115 | 29 216 | 110 686 | 2 179 717 |
| CABOG | 22 | 440 632 | 861 331 | 4 642 198 |
| Newbler | 87 | 87 223 | 240 232 | 4 516 116 |
| CABOG | 39 | 126 165 | 336 216 | 2 992 650 |
| Newbler | 70 | 79 879 | 203 365 | 2 963 428 |
| CABOG | 42 | 138 508 | 365 104 | 2 983 118 |
| Newbler | 99 | 45 693 | 171 391 | 2 951 683 |
| CABOG | 69 | 323 162 | 819 035 | 9 186 849 |
| Newbler | 73 | 247 897 | 731 211 | 9 097 078 |
The analysis included all contigs 2 kb or longer found in each assembler's FASTA output. N50, the length of the shortest contig required to span 50% of the genome length; Max, the length of the longest contig, Sum, the total contig span. Contig size statistics are shown in bases. The codes in parentheses refer to component datasets described in Table 1. Assemblies are compared by contig size statistics. Selected combinations are shown; others are provided in the Supplementary Material.
Assemblies of one hybrid data set by all assemblers
| Assembler | #Contigs | Contig N50 | Contig max | Contig sum |
|---|---|---|---|---|
| E.coli / FLX reads+FLX mates (F1+M1) | ||||
| CABOG | 27 | 285 910 | 833 636 | 4 629,501 |
| Newbler | 89 | 82 668 | 209 279 | 4 519,532 |
| PCAP | 152 | 50 897 | 175 160 | 4 554 652 |
| Euler-SR | 328 | 22 159 | 71 505 | 4 343 338 |
| Velvet | 490 | 11 510 | 53 664 | 4 230 559 |
The analysis is described in Table 2. Only CABOG and Newbler were designed for FLX hybrid datasets. Euler-SR had been introduced for 454 GS 20 reads+Sanger mates. PCAP was designed for Sanger mates only. Velvet was designed for short reads. The Goldberg method was not run since it requires Sanger mates to improve Newbler contigs. Arachne and the traditional Celera Assembler did not assemble this dataset. The assemblies are summarized and compared using contig length statistics.
Scaffold analysis of CABOG and Newbler assemblies
| Assembler | #Scaf. | Scaf. N50 | Scaf. max | Scaf. sum | Cov. (%) |
|---|---|---|---|---|---|
| CABOG | 7 | 392 892 | 661 267 | 2 324,483 | 98.7 |
| Newbler | 9 | 268 678 | 718 704 | 2 187 430 | 94.1 |
| CABOG | 7 | 417 898 | 758 093 | 2 339 970 | 98.9 |
| Newbler | 11 | 266 698 | 718 559 | 2 183 668 | 93.9 |
| CABOG | 9 | 450 308 | 758 275 | 2 335 950 | 98.8 |
| Newbler | 382 223 | 720 519 | 2 189 593 | 94.2 | |
| CABOG | 6 | 1 507 760 | 1 507 760 | 2 268 548 | 96.6 |
| Newbler | 51 | 1 489 797 | 1 489 797 | 2 185 214 | 94.3 |
| CABOG | 1 | 2 317 095 | 2 317 095 | 2 317 095 | 98.7 |
| Newbler | 6 | 1 550 861 | 1 550 861 | 2 184 352 | 93.9 |
The analysis included all scaffolds 2 kb or longer found in each assembler's FASTA output. Scaffold length statistics are shown in bases excluding the lengths of the gaps between contigs. Note that scaffold sum may not equal contig sum (Table 2) due to the 2 kb threshold being applied at the scaffold not contig level. Cov, bases of the reference covered by a sum over single best alignments of each full or partial scaffold sequence.
Errors in CABOG assemblies
| Genome | Dataset | Chimeric | Chimeric | Bad | Bad | Collapsed |
|---|---|---|---|---|---|---|
| join | end | end | contig | tandem | ||
| F1 | 0 | 0 | 0 | 0 | 4 | |
| F1+M2 | 3 | 8 | 1 | 1 | 11 | |
| F1+S1 | 0 | 0 | 1 | 0 | 7 | |
| M2+S1 | 0 | 1 | 2 | 0 | 9 | |
| F1 | 0 | 1 | 0 | 0 | 1 | |
| F2 | 0 | 2 | 0 | 0 | 4 | |
| F1+S1 | 0 | 0 | 0 | 0 | 1 | |
| F2+S1 | 0 | 0 | 0 | 0 | 0 |
The analysis included contigs at least 2 kb long. Chimeric join, a concatenation of unrelated sequences of at least 1 kb. Chimeric End, concatenation of less than 1 kb to a contig end. Bad end, less than 1 kb of unaligned sequence at a contig end. Bad Contig, unaligned contig. Collapsed Tandem, multiple alignments between a contig and the reference, partially overlapping in either sequence. Errors were estimated by analysis of alignments to reference sequences. Estimates were confirmed by two other alignment-based methods.