| Literature DB >> 12537568 |
Susan E Celniker1, David A Wheeler, Brent Kronmiller, Joseph W Carlson, Aaron Halpern, Sandeep Patel, Mark Adams, Mark Champe, Shannon P Dugan, Erwin Frise, Ann Hodgson, Reed A George, Roger A Hoskins, Todd Laverty, Donna M Muzny, Catherine R Nelson, Joanne M Pacleb, Soo Park, Barret D Pfeiffer, Stephen Richards, Erica J Sodergren, Robert Svirskas, Paul E Tabor, Kenneth Wan, Mark Stapleton, Granger G Sutton, Craig Venter, George Weinstock, Steven E Scherer, Eugene W Myers, Richard A Gibbs, Gerald M Rubin.
Abstract
BACKGROUND: The Drosophila melanogaster genome was the first metazoan genome to have been sequenced by the whole-genome shotgun (WGS) method. Two issues relating to this achievement were widely debated in the genomics community: how correct is the sequence with respect to base-pair (bp) accuracy and frequency of assembly errors? And, how difficult is it to bring a WGS sequence to the accepted standard for finished sequence? We are now in a position to answer these questions.Entities:
Mesh:
Substances:
Year: 2002 PMID: 12537568 PMCID: PMC151181 DOI: 10.1186/gb-2002-3-12-research0079
Source DB: PubMed Journal: Genome Biol ISSN: 1474-7596 Impact factor: 13.583
Figure 1Status of the Drosophila melanogaster euchromatic genome. Each chromosome arm is represented by a black horizontal line with a circle indicating its centromere. For each arm, seven tiers of information (A-G) are presented. (A) Each vertical green line represents the position of a transposable element. (B) Each vertical blue line represents the position of a 'declared' gap in Release 2. (C) Each vertical red line represents the position of an 'undeclared' gap in Release 2 greater than 20 bp, detected by comparing the Release 2 and Release 3 sequences. (D) Each vertical black line represents the position of a sequence gap that remains in Release 3. (E) The horizontal bars depict the regions of the genome assigned to LBNL (blue) or the HGSC, Baylor College of Medicine (brown) for generating Release 3. (F) The gray horizontal bar represents the status of the physical maps that supplied the initial BAC tiling paths for sequencing; presence of the gray bar indicates an available BAC contig. The sources of these BAC maps were as follows: chromosome X [12,50], chromosome arms 2L, 2R, 3L, and 3R [11] and chromosome 4 [13]. The black triangles represent the seven physical map gaps remaining in the euchromatic portion of the genome in Release 3. (G) The purple bar represents the position of cosmid, P1 or BAC clones that had been completely sequenced prior to Release 2. Those at the telomere of chromosome X were sequenced by the EDGP [51]; the other clones were sequenced by the BDGP at LBNL [1]. The numbers to the left of rows A, B, C and D are the chromosome arm totals for each category plotted. The scale in million bases (Mb) is shown at the bottom of the figure.
Status of Release 3
| Physical map gaps | Estimated error rate* | |||||||||||
| Chromosomal region | Group | Size | Number | Location | Estimated maximum size† | Finished BACs | Unfinished BACs | Sequence gaps‡ | Release 2 sequence | 104 to 105 | 105 to 106 | >106 |
| X (1-11) | HGSC | 13,053,575 | 1 | 9EF | 150 kb | 85 | 14§ | 22 | 234,520¶ | 2 | 16 | 241 |
| X (12-20) | LBNL | 8,921,907 | 1 | 20B2 | 200 kb | 73¥ | 2# | 2 | 0 | 19 | 76 | 84 |
| 2L | LBNL | 22,217,931 | 1 | 39D | ~500 kb - 1 Mb | 177 | 2** | 3 | 0 | 14 | 30 | 397 |
| 42B | 100 kb | |||||||||||
| 2R | LBNL | 20,302,755 | 2 | 159 | 4†† | 5 | 0 | 11 | 56 | 335 | ||
| 57B | 300 kb | |||||||||||
| 3L | HGSC | 23,352,213 | 1 | 64C | 100 kb | 175 | 9‡‡ | 11 | 47,653§§ | 6 | 50 | 409 |
| 3R | LBNL | 27,890,790 | 0 | NA | NA | 235 | 0 | 0 | 0 | 8 | 119 | 430 |
| 4 | LBNL | 1,237,864 | 1 | 102F | 100 kb | 14 | 0 | 1 | 0 | 3 | 7 | 13 |
| Total | 116,914,271 | 7 | 917 | 31 | 44 | 63 | 354 | 1,909 | ||||
*Estimated error rates were determined for 100-kb bins, chosen to overlap by 50 kb. Estimated error rates were determined for bins containing sequence or physical map gaps. However, gaps represented by Ns in the sequence did not contribute to the estimated error rate; thus, the error rate reflects only those sequences present. †In situ hybridization of flanking clones to polytene chromosomes and estimates of DNA content per band [47] allowed us to estimate the maximum size of the clone gaps. All of the gaps are in regions of tandem repeats and the flanking BACs extending into the gap might contain sufficient amounts of the repeat to lead to a misleading in situ mapping result. Therefore, we also examined the next BAC in the tiling path, not containing the repeat, to ensure we were using a unique sequence probe. Four BACs are listed for each gap, two on each side, in the order they occur in the genome. The gap at 9EF is flanked by BACR48E06 (location, 9C2-E1), BACR10I17 (ND) and BACR26N01 (9F1-10A2), BACR17B23 (10A1-2). The gap at 20B2 is flanked by BACR23I18 (19F3-A2) BACR22O16 (20A3-B2) and BACR06L03 (20B2-C2), BACR05K22 (20C1-2). The gap at 39D was sized by estimating the histone repeat copy number [16]. The estimate from the flanking BACs, BACR34H23 (39A6-C3) and BACR03L08 (39F1-2) is 400 kb. The gap at 42B is flanked by BACR13P06 (42A3-19), BACR36A03 (42B1-2) and BACR28N07 (42B1-3), BACR01C10 (42B3-C6). The gap at 57B is flanked by BACR03N16 (57A1-4), BACR08P05 (57A5-B3), and BACR10P16 (cytology 57B2-6), BACR04E05 (57B4-6). The gap at 64 C is flanked by BACR23H09 (64B15-C2), BACR17L24 (64C1-4), and BACR12G07 (64C5-12), BACR12P14 (64C9-12). The gap at 102F is flanked by BACR13D24 (102D6-E6), BACR22J20 (102E3-F2), and BACH59K20 (102F1-5), BACN05O16 (cross-hybridizes to all telomeres, consistent with its location at the chromosome end). ‡This number includes all instances where we inserted a string of Ns to indicate missing sequences; it is the sum of physical map gaps and gaps due to failure to complete the sequence of cloned regions. In some cases a single physical map gap results in more than one sequence gap. For example, all three sequence gaps on 2L are found in the unfinished BACs that extend into the histone repeat region and four of the five sequence gaps on 2R are found in the unfinished BACs that extend into the repeat region of 42B. Excluding the physical map gaps, the gaps on X 1-11 total 60.6 kb; the gaps on 2R total 1,549 bp; the gaps on 3L total 26.2 kb, excluding the two gaps mapping to heterochromatin. There are no gaps, other than physical map gaps, on 2L, 4 or X 12-20. §The Release 3 sequence of chromosome X 1-11 includes sequence from 14 unfinished BAC clones. Each of these BACs contains one or two regions of repeat sequence that are difficult to resolve. Eight of the unfinished clones contain Foldback (BACR40O10, BAC23M02, BACR19G09, BACR26B05, BACR29A04), multiple or rearranged roo (BACR17E02, BACR46E23) or 412 (BACR07P13) elements. Six of the clones (BACR01A14, BACR17E02, BACR19D19, BACR25I09, BACR29B18, BACR39C15) contain duplications of other, uncharacterized, repeats. BACR13J02 is the most distal clone in Release 3, extending the Release 2 assembly by approximately 15 kb. Seven of the 14 BACs that were unfinished at the time of Release 3 have since been finished. Five clones (CHORI 22340I08, BACR32E02, CHORI 221-14P20, CHORI 221-17A11 and CHORI 223-05O10) have been added to the tiling path to span the genomic regions that are still represented by Release 2 sequences (see ¶); these BACs were not sequenced for Release 3. 366 bp of sequence (coordinate 3.4 Mb, cytology 3EF) are not contained within a BAC but are spanned by 10-kb genomic clones. The EDGP identified two clones, BACR37M19 and BACR20K04, as mapping to this region [12] but we determined that their end sequences align elsewhere. The BAC clone coverage of the X chromosome is expected to be lower than the BAC clone coverage of the autosomes and may explain the BAC clone gap in 3EF. BACs whose names begin with CHORI are derived from a library made with randomly sheared DNA [48]. ¶Four Release 2 segments not covered in finished BACs were used to produce the Release 3 sequence (see Materials and methods, Arm assembly and overlap verification): 18.3 kb starting at position 1,262,967 bp; 104 kb starting at position 3,412,482 bp; 12.2 kb starting at position 9,489,057 bp; 99.7 kb at starting at position 10,462,912 bp. The latter segment extends into the clone gap at 9EF. ¥The last 36 kb of sequence at the centromeric end of the X chromosome are not contained within a BAC and are derived from a phrap assembly using WGS traces and the complete sequence of two 10-kb genomic clones. #One of the two unfinished BACs (BACR22O16) extends into the physical map gap and the second (BACR39I01) contains a sequence gap resulting from our inability to assemble a difficult repetitive region that includes at least eight copies of a 4.7 kb tandem repeat having similarity to a degenerate mdg3 transposable element lacking LTRs. **These two unfinished BACs (BACR05D08 and BACR43O11) flank and extend into the 1-Mb histone gene cluster. ††Three unfinished BACs (BACR48D05, BACR03A06 and BACR36A03) extend into the gap at 42B and one unfinished BAC (BACR08P05) extends into the 57B gap. ‡‡The nine unfinished BACs are BACR31B14, BACR43N11, BACR27G13, BACR29O22, BACR01D04, BACR01B21, BACR09G21, BACR30I05 and BACR34K23. BACR31B14, BACR43N11, and BACR27G13 contain sequence gaps that are a consequence of transposable elements (FB or roo) with complex internal rearrangements, tandem repeats or deletions. Two BACs, BACR29O22 and BACR01B21, contain a roo and a Doc element, respectively, and were not completed. One sequence gap in BACR01D04 is the result of a small misassembly that could not be resolved. Three other sequence gaps are in an unfinished segment of clone BACR34K23. Three (BACR09G21, BACR30I05 and BACR27G13) of the nine BACs that were unfinished at the time of Release 3 are now finished. Five clones (BACR29A07, CHORI 223-12D09, BACR15L14, CHORI221-06A19 and BACR03B05) have been have been added to the tiling path to span the genomic regions that are still represented by Release 2 sequences (see §§); these BACs were not sequenced for Release 3. The addition of BACR29A07 to the tiling path corrects an inversion in Release 3 at the 3L centromere. The BAC order is now BACR17M18, BACR29A07, BACR22B15 and BACR34K23. In addition, there are 13 finished BACs from 3L that have been submitted to GenBank with unresolved tandem repeat annotations, in accordance with the G16 finishing standards for the human genome project [49]. §§Three Release 2 segments not covered in finished BACs were used to produce the Release 3 sequence (see Materials and methods, Arm assembly and overlap verification): 10.8 kb starting at position 1, 18.9 kb starting at position 5,065,167, 12.6 kb starting at position 23,339,636 bp. The 18.9 kb sequence extends into the 64 C clone gap. The 12.6 kb sequence contains two gaps mapping to BACR30H12.
Sequence content of gaps in Release 2
| X | 2L | 2R | 3L | 3R | 4 | Subtotals | Total | ||||||||
| Total gaps | 743 | 154 | 145 | 269 | 189 | 31 | 1531 | ||||||||
| D* | U† | D | U | D | U | D | U | D | U | D | U | D | U | ||
| 606 | 137 | 86 | 68 | 77 | 68 | 189 | 80 | 128 | 61 | 21 | 10 | 1107 | 424 | 1531 | |
| 0 | |||||||||||||||
| Content | |||||||||||||||
| TEs | 61 | 42 | 48 | 33 | 42 | 42 | 52 | 47 | 50 | 39 | 9 | 5 | 262 | 208 | 470 |
| Simple repeats | 353 | 10 | 19 | 9 | 15 | 4 | 109 | 2 | 38 | 9 | 10 | 2 | 544 | 36 | 580 |
| Homopolymers | 10 | 0 | 1 | 0 | 3 | 0 | 2 | 0 | 3 | 0 | 0 | 0 | 19 | 0 | 19 |
| Unique sequence | 150 | 23 | 13 | 8 | 8 | 11 | 21 | 24 | 28 | 3 | 1 | 0 | 221 | 69 | 290 |
| Tandem repeats | 14 | 34 | 1 | 12 | 3 | 6 | 0 | 0 | 1 | 10 | 0 | 1 | 19 | 63 | 82 |
| Missassemblies | 3 | 18 | 1 | 4 | 1 | 2 | 0 | 1 | 7 | 0 | 1 | 0 | 13 | 25 | 38 |
| Gross misassemblies | 1 | 0 | 1 | 1 | 3 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 6 | 2 | 8 |
| Not yet determined | 14 | 10 | 2 | 1 | 2 | 3 | 5 | 6 | 0 | 0 | 0 | 1 | 23 | 21 | 44 |
Analysis of the sequence gaps in Release 2 determined by comparison with Release 3 (see text for details).*Declared (D) gaps represented in Release 2 by sets of Ns. †Undeclared (U) gaps not recognized in Release 2, and identified by comparison to Release 3.
Scaffold, contig and gap statistics for the three assemblies
| WGS1 | WGS2 | WGS3 | |
| Number of scaffolds | 816 | 2,198 | 2,775 |
| Total Mb spanned | 122.92 | 133.47 | 137.6 |
| Total Mb of sequence | 119.52 | 129.12 | 132.94 |
| N50 scaffold length (Mb) | 10.70 | 14.26 | 13.68 |
| Number of gaps | 2,926 | 5,319 | 4,936 |
| Number of intra-scaffold gaps | 2,110 | 3,121 | 2,161 |
| Mean contig length (kb) | 40.8 | 24.3 | 26.9 |
| Mean gap length (bp) | 1,611 | 1,395 | 2,190 |
WGS scaffolds that align to the euchromatic portion of Release 3
| WGS1 | WGS2 | WGS3 | Release 3 | |
| Number of scaffolds covering Release 3 | 55 | 63 | 53 | 13 |
| Total Mb spanned | 116.39 | 117.44 | 117.6 | 116.91 |
| Total Mb of Release 3 spanned | 116.4 | 116.5 | 116.8 | - |
| Total Mb of sequence | 114.15 | 115.83 | 116.42 | 116.87 |
| Total Mb of Release 3 sequence | 114.1 | 115 | 115.6 | - |
| N50 scaffold length (inMb) | 10.85 | 14.45 | 13.89 | 18.5 |
| Number of gaps | 2,173 | 2,315 | 1,130 | 44 |
| Mean contig length (kb) | 52.2 | 49.5 | 102 | 2,335 |
| Mean gap length (bp) | 1,531 | 912 | 1,335 | - |
Order and orientation errors in the WGS assemblies compared to Release 3
| WGS1 | WGS2 | WGS3 | ||||
| Number of segments | Number of base-pairs | Number of segments | Number of base-pairs | Number of segments | Number of base-pairs | |
| Aligned segments | 2,125 | 113.30 Mb | 2,270 | 114.41 Mb | 1,087 | 114.99 Mb |
| Local errors* | 9 | 68.33 kb | 7 | 9.80 kb | 3 | 5.64 kb |
| Interleaving failures† | 17 | 39.42 kb | 28 | 137.75 kb | 33 | 139.42 kb |
| Repeat errors‡ | 25 | 42.52 kb | 1 | 0.66 kb | 1 | 0.98 kb |
| Gross misassemblies§ | 3 | 10.69 kb | 0 | 0 | ||
*Local errors include inversions and transpositions within a contig or that cause the order of contigs to be incorrect within a scaffold. †Interleaving failures are cases where it has not been recognized that two scaffolds overlap because the end contig in one scaffold lies in a gap in the adjacent scaffold. ‡Repeat errors are incorrect assemblies of transposable elements (see text for description). §Gross misassemblies are cases in which scaffolds themselves are out of order.
Sequence error rates for the WGS assemblies
| Errors per 10 kb | WGS1 | WGS2 | WGS3 |
| All sequence | 4.12 | 2.23 | 1.1 |
| In tandem repeats | 95.2 | 61.4 | 48.8 |
| In interspersed repeats | 78.2 | 15.8 | 9.62 |
| In unique sequence | 1.82 | 1.31 | 0.38 |
| > 10 bp from gap | 1.37 | 1.02 | 0.29 |
| > 50 bp from gap | 1.32 | 0.95 | 0.26 |