| Literature DB >> 20064230 |
Robert W Blakesley1, Nancy F Hansen, Jyoti Gupta, Jennifer C McDowell, Baishali Maskeri, Beatrice B Barnabas, Shelise Y Brooks, Holly Coleman, Payam Haghighi, Shi-Ling Ho, Karen Schandler, Sirintorn Stantripop, Jennifer L Vogt, Pamela J Thomas, Gerard G Bouffard, Eric D Green.
Abstract
BACKGROUND: The approaches for shotgun-based sequencing of vertebrate genomes are now well-established, and have resulted in the generation of numerous draft whole-genome sequence assemblies. In contrast, the process of refining those assemblies to improve contiguity and increase accuracy (known as 'sequence finishing') remains tedious, labor-intensive, and expensive. As a result, the vast majority of vertebrate genome sequences generated to date remain at a draft stage.Entities:
Mesh:
Year: 2010 PMID: 20064230 PMCID: PMC2827409 DOI: 10.1186/1471-2164-11-21
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Human-grade finished BAC sequences from ENm001.
| Common name | Taxonic name | Total BACs | Total Mb |
|---|---|---|---|
| Armadillo | 25 | 2.63 | |
| Baboon | 15 | 2.26 | |
| Black Lemur | 8 | 1.41 | |
| Cat | 20 | 2.11 | |
| Chicken | 7 | 0.84 | |
| Chimpanzee | 13 | 1.73 | |
| Colobus Monkey | 14 | 2.12 | |
| Cow | 15 | 2.34 | |
| Dog | 9 | 1.34 | |
| Dusky Titi | 17 | 2.29 | |
| Elephant | 24 | 2.25 | |
| Galago | 13 | 2.17 | |
| Gibbon | 16 | 2.43 | |
| Gorilla | 11 | 1.94 | |
| Ground Squirrel | 16 | 1.91 | |
| Guinea Pig | 15 | 2.01 | |
| Hedgehog | 25 | 2.86 | |
| Horse | 14 | 2.13 | |
| Horseshoe Bat | 16 | 1.98 | |
| Little Brown Bat | 15 | 2.16 | |
| Macaque | 14 | 1.79 | |
| Marmoset | 13 | 2.13 | |
| Mouse Lemur | 10 | 1.76 | |
| Orangutan | 13 | 2.02 | |
| Owl Monkey | 18 | 2.39 | |
| Pig | 10 | 1.50 | |
| Platypus | 16 | 1.68 | |
| Rabbit | 17 | 2.32 | |
| Rat | 17 | 2.16 | |
| Ring-tailed Lemur | 10 | 1.51 | |
| Sheep | 15 | 2.18 | |
| Shrew | 19 | 2.07 | |
| Squirrel Monkey | 15 | 2.38 | |
| Tenrec | 12 | 1.92 | |
| Tetraodon | 3 | 0.32 | |
| Torafugu | 2 | 0.17 | |
| Vervet Monkey | 13 | 1.93 | |
| Wallaby | 16 | 2.05 | |
| Totals | 541 | 73.20 | |
The total number of BACs (Total BACs) and the amount of non-redundant human-grade finished sequence generated (Total Mb) for each species within genomic region ENm001 are shown.
Comparative-grade finished BAC sequences listed by species.
| Species | Total BACs | Total Mb |
|---|---|---|
| Armadillo | 137 | 18.8 |
| Baboon | 110 | 19.2 |
| Cat | 105 | 14.6 |
| Colobus Monkey | 78 | 16.2 |
| Dusky Titi | 88 | 15.3 |
| Elephant | 138 | 18.8 |
| Galago | 99 | 18.5 |
| Gibbon | 83 | 14.9 |
| Ground Squirrel | 89 | 13.4 |
| Guinea Pig | 93 | 15.6 |
| Hedgehog | 130 | 19.4 |
| Horseshoe Bat | 95 | 15.5 |
| Marmoset | 97 | 19.0 |
| Mouse Lemur | 60 | 13.8 |
| Owl Monkey | 88 | 15.3 |
| Platypus | 76 | 10.8 |
| Rabbit | 95 | 17.4 |
| Shrew | 115 | 15.7 |
| Squirrel Monkey | 73 | 14.6 |
| Tenrec | 95 | 17.1 |
| Vervet Monkey | 87 | 14.5 |
| Totals | 2,031 | 338.5 |
The total number of BACs (Total BACs) and the amount of comparative-grade finished sequence generated (Total Mb) for each species are shown.
Comparative-grade finished BAC sequences listed by ENCODE region.
| ENCODE region | Size in human (kb) | Total BACs | Total Mb |
|---|---|---|---|
| ENm001 | 1,877 | 357 | 58.9 |
| ENm003 | 500 | 91 | 15.3 |
| ENm005 | 1,696 | 256 | 42.6 |
| ENm010 | 500 | 96 | 16.1 |
| ENm012 | 1,000 | 179 | 29.4 |
| ENm013 | 1,114 | 184 | 30.3 |
| ENm014 | 1,163 | 214 | 35.4 |
| ENr111 | 500 | 84 | 14.4 |
| ENr211 | 500 | 89 | 15.2 |
| ENr213 | 500 | 90 | 15.4 |
| ENr221 | 500 | 100 | 16.5 |
| ENr222 | 500 | 98 | 16.5 |
| ENr312 | 500 | 94 | 15.7 |
| ENr323 | 500 | 99 | 16.9 |
| Totals | 14,850 | 2,031 | 338.5 |
The total number of BACs (Total BACs) and the amount of comparative-grade finished sequence generated (Total Mb) for each genomic region are shown.
Figure 1Gaps in comparative-grade finished BAC sequences from ENm001. The human- and comparative-grade finished sequences of the 541 BACs summarized in Table 1 were compared, and the gaps detected in the comparative-grade finished sequence were analyzed. Indicated for each species are the total bases within gaps per Mb of human-grade finished sequence (A), the number of gaps per Mb of human-grade finished sequence (B), and the median size of the gaps in base pairs (C).
Figure 2Characteristics of gaps in comparative-grade finished BAC sequences from ENm001. Sequences within the gaps (blue bars) summarized in Figure 1 were analyzed with respect to GC (A) and simple repeat (B) content; similar analyses were performed for the entire human-grade finished BAC sequences (orange bars). Results for the 19 species with the greatest differences (i.e., gap sequences vs. total BAC sequences) for each analysis are shown. Each error bar represents the 95% confidence interval.
Repeat content of total sequences and gap sequences.
| Repeat Type | Total Sequence | Total Gaps | Captured Gaps | Uncaptured Gaps |
|---|---|---|---|---|
| All Repeats | 36.9 (0.4) | 48.9 (1.4) | 45.1 (1.7) | 49.9 (1.7) |
| Simple | 1.6 (0.0) | 4.6 (0.7) | 10.3 (1.0) | 2.9 (0.5) |
| LTR | 4.7 (0.1) | 5.6 (0.7) | 2.8 (0.5) | 6.4 (0.8) |
| SINE | 9.3 (0.2) | 8.8 (0.7) | 9.6 (1.0) | 8.6 (0.5) |
| LINE | 18.6 (0.3) | 27.9 (1.3) | 20.6 (1.6) | 29.9 (1.8) |
| DNA | 2.7 (0.0) | 2.0 (0.2) | 1.8 (0.4) | 2.1 (0.4) |
Sequences of the 541 BACs (Table 1) were analyzed in aggregate by RepeatMasker. The percentages of bases represented by the indicated types of sequence repeats are reported for the total sequence, total gaps, captured gaps, and uncaptured gaps. Numbers in parenthesis are the one sigma error values.
Figure 3Gaps in comparative-grade finished BAC sequences from multiple genomic regions. The comparative-grade finished sequences of the 2,031 BACs summarized in Tables 2 and 3 were analyzed for the presence of uncaptured (orange bar) and captured (blue bar) gaps (see text for details). The numbers of captured and uncaptured gaps per Mb averaged across all BACs from each indicated genomic region (A) or species (B) are indicated. Each error bar represents the 95% confidence interval. In A, the data for three additional ENCODE pilot project regions (ENr231, ENr232, and ENr333) are shown for comparison because of the notably high frequency of gaps in their sequences; however, there were not sufficient numbers of sequenced BACs from these regions to qualify for inclusion in the second data set (see text for details).
Figure 4Variation in the redundancy of sequence reads generated using shotgun libraries prepared with standard and copy-control . A shotgun-subclone library was prepared from each of three BACs [GenBank:AC153092, AC190087, and AC186717] and used to transform either standard DH10B tonA (Std) or copy-control EPI400 (CC) E. coli strains. From each library, paired forward and reverse sequence reads were then generated from randomly selected subclones to produce assemblies that provided an average of eightfold sequence redundancy. Aligned representative regions of the assemblies that highlight differences in sequence redundancy encountered with the two E. coli strains are shown for each BAC. Yellow lines indicate sequence-read redundancies on the upper and lower strands of the indicated sequence contig (black/red line); the horizontal orange lines depict a redundancy value of 10.
BAC sequence assemblies generated using standard versus copy-control bacterial host cells.
| BAC | Species | ENCODE Region | Redundancy variation | No. contigs | No. uncaptured gaps | |||
|---|---|---|---|---|---|---|---|---|
| Standard | Copy control | Standard | Copy control | Standard | Copy control | |||
| Cat | ENm011 | ++++ | - | 18 | 13 | 6 | 2 | |
| Platypus | ENm009 | ++++ | - | 15 | 8 | 13 | 3 | |
| Owl Monkey | ENm006 | ++++ | - | 14 | 9 | 6 | 3 | |
| Owl Monkey | ENm011 | ++ | - | 22 | 15 | 9 | 1 | |
| Shrew | ENr331 | ++ | - | 22 | 11 | 11 | 0 | |
| Shrew | ENr322 | ++ | + | 21 | 22 | 9 | 10 | |
| Platypus | ENm006 | ++ | - | 20 | 10 | 7 | 3 | |
| Rabbit | ENm006 | ++ | + | 11 | 6 | 5 | 0 | |
| Owl Monkey | ENr312 | ++ | ++ | 10 | 6 | 5 | 4 | |
| Platypus | ENr312 | ++ | + | 9 | 12 | 7 | 3 | |
| Hedgehog | ENr211 | + | - | 21 | 13 | 9 | 0 | |
| Owl Monkey | ENm007 | + | - | 10 | 6 | 2 | 0 | |
| Dusky Titi | ENm006 | + | - | 8 | 13 | 0 | 2 | |
Representative BACs whose generated sequences were associated with atypical characteristics (in terms of large numbers of contigs and uncaptured gaps) were tested. Two exceptions [GenBank: AC190000 and AC188356] represent typical sequence assemblies with respect to redundancy of sequence reads, number of contigs, and number of uncaptured gaps. Only three BACs [GenBank: AC190002, AC187194, and AC172296] belong to the second data set (Tables 2 and 3), while the remaining BACs have sequences orthologous to other ENCODE regions. The variation in the redundancy of sequence reads across all contigs was qualitatively assessed as low (-), medium (+), high (++), or very high (++++); see Figure 4 for illustrative examples.