| Literature DB >> 16934154 |
Michael J Moore1, Amit Dhingra, Pamela S Soltis, Regina Shaw, William G Farmerie, Kevin M Folta, Douglas E Soltis.
Abstract
BACKGROUND: Plastid genome sequence information is vital to several disciplines in plant biology, including phylogenetics and molecular biology. The past five years have witnessed a dramatic increase in the number of completely sequenced plastid genomes, fuelled largely by advances in conventional Sanger sequencing technology. Here we report a further significant reduction in time and cost for plastid genome sequencing through the successful use of a newly available pyrosequencing platform, the Genome Sequencer 20 (GS 20) System (454 Life Sciences Corporation), to rapidly and accurately sequence the whole plastid genomes of the basal eudicot angiosperms Nandina domestica (Berberidaceae) and Platanus occidentalis (Platanaceae).Entities:
Mesh:
Substances:
Year: 2006 PMID: 16934154 PMCID: PMC1564139 DOI: 10.1186/1471-2229-6-17
Source DB: PubMed Journal: BMC Plant Biol ISSN: 1471-2229 Impact factor: 4.215
Characteristics of the GS 20 combined run data assemblies
| combined run data length | 130503 bp | 136335 bp |
| no. of combined data contigs | 8 | 10 |
| average contig length | 16313 bp | 13634 bp |
| size of largest contig | 35901 bp | 28803 bp |
| total no. of reads | 31019 | 23743 |
| average read length | 103.6 bp | 99.8 bp |
| overall average read depth (incl. one IR) | 24.6× | 17.3× |
| overall average read depth (incl. both IRs) | 20.5× | 14.6× |
| IR average read depth | 24.2× | 28.2× |
| SC average read depth | 24.7× | 14.9× |
| proportion of bases ≥ Q40 | 99.8% | 99.4% |
| no. of gaps | 9 | 11 |
| total gap length | 34 bp | 390 bp |
| average gap length | 3.8 bp | 35.5 bp |
| no. of zero-length gaps | 7 | 5 |
| size of largest gap | 32 bp | 170 bp |
Characteristics of the GS 20 combined run data assemblies. The overall average read depth is calculated in two ways: by including one copy of the inverted repeat (IR) region (to reflect the fact that the two copies of the IR are indistinguishable during genome sequencing, and are therefore contigged together) and by including both copies of the IR region. SC = single-copy region.
Figure 1Plastid genome map of . Map of the plastid genome of Nandina domestica (Berberidaceae), showing annotated genes and introns. Asterisks (*) after the gene names indicate the presence of introns; the introns themselves are denoted by white boxes within genes. Within the genome map, the inverted repeat regions (IRA and IRB) are depicted by the solid black bars, and the large and small single-copy regions (LSC and SSC) are depicted by the solid gray bars. Regions that were conventionally sequenced are indicated by the blue bars to the inside of the genome map.
Figure 2Plastid genome map of . Map of the plastid genome of Platanus occidentalis (Platanaceae), showing annotated genes and introns. Asterisks (*) after the gene names indicate the presence of introns; the introns themselves are denoted by white boxes within genes. Within the genome map, the inverted repeat regions (IRA and IRB) are depicted by the solid black bars, and the large and small single-copy regions (LSC and SSC) are depicted by the solid gray bars. Regions that were conventionally sequenced are indicated by the blue bars to the inside of the genome map.
Basic characteristics of the Nandina and Platanus plastid genomes
| total genome length | 156599 | 161791 |
| IR length | 26062 | 25066 |
| SSC length | 19002 | 19509 |
| LSC length | 85473 | 92150 |
| total length of coding sequence (both IRs) | 92284 | 91397 |
| total length of coding sequence (one IR) | 75763 | 75716 |
| total length of noncoding sequence (both IRs) | 64315 | 70394 |
| total length of noncoding sequence (one IR) | 54774 | 61009 |
| overall G/C content | 38.3% | 38.0% |
Basic characteristics of the Nandina and Platanus plastid genomes. All lengths are given in base pairs (bp). IR = inverted repeat region; SSC = small single-copy region; LSC = large single-copy region.
List of genes present in the plastid genomes of Nandina and Platanus
| Gene Class | ||||
| Ribosomal RNAs | ||||
| Transfer RNAs | ||||
| Photosystem I | ||||
| Photosystem II | ||||
| Cytochrome b6/f | ||||
| ATP synthase | ||||
| NADH | ||||
| dehydrogenase | ||||
| Ribosomal proteins | ||||
| large subunit | ||||
| small subunit | ||||
| RNA polymerase | ||||
| Miscellaneous | ||||
| proteins | ||||
| Hypothetical | ||||
| proteins | ||||
List of genes present in the plastid genomes of Nandina and Platanus. Genes with an asterisk (*) contain introns; genes that are present as duplicate copies due to their position within the inverted repeat regions are indicated as (×2). Ψ = pseudogene.
Error rates for the GS 20 plastid genome sequence
| Region | combined | ||
| overall genome | 0.043 | 0.031 | 0.037 |
| overall SC | 0.030 | 0.064 | 0.047 |
| overall IR | 0.054 | 0.004 | 0.029 |
| overall coding | 0.027 | 0.029 | 0.028 |
| overall noncoding | 0.085 | 0.036 | 0.062 |
| SC coding | 0.036 | 0.055 | 0.046 |
| SC noncoding | 0.000 | 0.161 | 0.057 |
| IR coding | 0.018 | 0.000 | 0.009 |
| IR noncoding | 0.115 | 0.011 | 0.063 |
Observed error rates for the GS 20 plastid genome sequence of Nandina, Platanus, and both genomes combined (given in percent). These error rates are based on known GS 20 errors discovered in regions of conventional comparison sequence. Only one copy of the IR was included in error calculation.
Raw values used in error calculations
| combined | ||||||
| Region | # errors | length (bp) | # errors | length (bp) | # errors | length (bp) |
| overall genome | 20 | 46134 | 14 | 45249 | 34 | 91383 |
| overall SC | 6 | 20072 | 13 | 20183 | 19 | 40255 |
| overall IR | 14 | 26062 | 1 | 25066 | 15 | 51128 |
| overall coding | 9 | 33170 | 10 | 34006 | 19 | 67176 |
| overall noncoding | 11 | 12946 | 4 | 11243 | 15 | 24189 |
| SC coding | 6 | 16649 | 10 | 18325 | 16 | 34974 |
| SC noncoding | 0 | 3405 | 3 | 1858 | 3 | 5263 |
| IR coding | 3 | 16521 | 0 | 15681 | 3 | 32202 |
| IR noncoding | 11 | 9541 | 1 | 9385 | 12 | 18926 |
Raw values that were used in calculations of observed error in GS 20 plastid genome sequence. Length refers to the length of conventional sequence data used in error calculations.
Characteristics of GS 20 sequencing errors
| combined | |||
| proportion of length-variant HR errors | 100.0 | 61.5 | 84.8 |
| proportion of TLI HR errors | 0.0 | 38.5 | 15.2 |
| proportion of A/T HR errors | 95.0 | 100.0 | 97.0 |
| proportion of C/G HR errors | 5.0 | 0.0 | 3.0 |
| proportion of errors associated with HR sets | 55.0 | 46.2 | 51.5 |
| proportion of errors associated with HRs ≥ 5 | 45.0 | 76.9 | 57.6 |
| average length of HR associated with error | 5.4 | 6.5 | 5.8 |
| proportion of HR-associated insertion errors | 85.0 | 69.2 | 78.8 |
| proportion of HR-associated deletion errors | 15.0 | 30.8 | 21.2 |
Characteristics of observed GS 20 sequencing errors that were associated with homopolymer runs. All values are reported in percent. HR = homopolymer run; TLI = transposition-like insertion (see text).
Figure 3Illustrations of a transposition-like insertion error and a homopolymer run set. Illustrations of a transposition-like insertion error and a homopolymer run set. (A) Comparison of a hypothetical stretch of GS 20 genome sequence (top) vs. the "correct" sequence (bottom) in order to illustrate an example of a transposition-like insertion error, in which a base identical in composition to a given HR is inserted in a nearby, nonadjacent position. The transposition-like insertion error in the GS 20 sequence is indicated by the arrow; the colon (:) in the "correct" sequence indicates the absence of the A at the same position. (B) Example of a homopolymer run set.
Figure 4Distribution of errors associated with homopolymer runs. Distribution of errors associated with homopolymer runs, as a function of homopolymer run length.
GS 20 quality scores associated with insertion errors
| # of insertion errors | |||
| GS 20 quality scores | combined | ||
| < 20 | 14 | 8 | 22 |
| 20–40 | 2 | 1 | 3 |
| > 40 | 1 | 1 | 2 |
Number of insertion errors in GS 20 combined sequence, as a function of the GS 20 phred-equivalent quality score at the insertion error site.