| Literature DB >> 18753151 |
Richard Cronn1, Aaron Liston, Matthew Parks, David S Gernandt, Rongkun Shen, Todd Mockler.
Abstract
Organellar DNA sequences are widely used in evolutionary and population genetic studies, however, the conservative nature of chloroplast gene and genome evolution often limits phylogenetic resolution and statistical power. To gain maximal access to the historical record contained within chloroplast genomes, we have adapted multiplex sequencing-by-synthesis (MSBS) to simultaneously sequence multiple genomes using the Illumina Genome Analyzer. We PCR-amplified approximately 120 kb plastomes from eight species (seven Pinus, one Picea) in 35 reactions. Pooled products were ligated to modified adapters that included 3 bp indexing tags and samples were multiplexed at four genomes per lane. Tagged microreads were assembled by de novo and reference-guided assembly methods, using previously published Pinus plastomes as surrogate references. Assemblies for these eight genomes are estimated at 88-94% complete, with an average sequence depth of 55x to 186x. Mononucleotide repeats interrupt contig assembly with increasing repeat length, and we estimate that the limit for their assembly is 16 bp. Comparisons to 37 kb of Sanger sequence show a validated error rate of 0.056%, and conspicuous errors are evident from the assembly process. This efficient sequencing approach yields high-quality draft genomes and should have immediate applicability to genomes with comparable complexity.Entities:
Mesh:
Year: 2008 PMID: 18753151 PMCID: PMC2577356 DOI: 10.1093/nar/gkn502
Source DB: PubMed Journal: Nucleic Acids Res ISSN: 0305-1048 Impact factor: 16.971
Pinus samples used in this study. Voucher specimens are deposited in the Oregon State University Herbarium
| Species | Taxonomy | Source | Reference | Genbank |
|---|---|---|---|---|
| Subgenus | Newport, Lincoln Co., OR, USA. Accession CONT40 (A. Liston 1315) | EU998740 | ||
| Subgenus | Nanga Parbat Region, Gilgit, Pakistan. 35.400°N, 74.591°E. Accession GERA04 (R. Businský 41123) | EU998741 | ||
| Subgenus | NE Montague, Siskiyou Co., CA, USA. 41.850°N, 122.313°W. Accession LAMB08 (USFS Region 5 Seed Orchard) | EU998743 | ||
| Subgenus | White Mountains, Inyo County, CA, USA. 37.612°N, 118.241°W. Accession LONG01 (Kazmierski s.n.) | EU998744 | ||
| Subgenus | Near Eureka, UT, USA. 39.941°N, 112.146°W. Accession MONO11 (D. Gernandt 479) | EU998745 | ||
| Subgenus | Near San Antonio Peña Nevada, Nuevo León, Mexico. 23.767°N, 99.900°W. Accession NELS03 (D. Gernandt 10198-15098) | EU998746 | ||
| Subgenus | Bi Doup Mountain, Lam Dong, Vietnam. 12.0°N, 108.68°E. Accession KREM03 (Royal Botanic Garden First Darwin Expedition 242) | EU998742 | ||
| Newport, Lincoln Co., OR, USA. Accession PICSIT04 (Liston 1314) | EU998739 |
Summaries of total tagged and aligned reads from two multiplex experiments on the Illumina/Solexa 1G Genome Sequencer
| Experiment | MPLX S1 | MPLX S6 | ||||||
|---|---|---|---|---|---|---|---|---|
| Total reads | 6 391 206 | 5 053 895 | ||||||
| Adapters | 167 038 | 147 632 | ||||||
| Net | 6 224 168 | 4 906 263 | ||||||
| Tag | CCT | GGT | AAT | ATT | CCT | GGT | AAT | CGT |
| Taxon | CONT | GERA | KREM | LAMB | LONG | MONO | NELS | PICSIT |
| Total reads | 1 423 449 | 1 336 385 | 1 552 811 | 1 420 032 | 930 019 | 1 232 647 | 1 111 158 | 1 263 800 |
| Aligned reads | 1 082 697 | 1 023 041 | 1 204 585 | 1 090 216 | 756 726 | 995 128 | 852 081 | 1 001 719 |
| Aligned (%) | 76.1 | 76.6 | 77.6 | 76.8 | 81.4 | 80.7 | 76.7 | 79.3 |
| Mean coverage | 59 | 138 | 149 | 72 | 117 | 186 | 75 | 55 |
| Number of contigs (RGA) | 57 | 9 | 24 | 68 | 25 | 39 | 104 | 183 |
| Mean length | 2066 | 13 017 | 4852 | 1697 | 4665 | 2959 | 1098 | 626 |
| SD | 4196 | 12 213 | 11 768 | 5709 | 5604 | 4994 | 3908 | 1615 |
| Median contig length | 136 | 9454 | 349 | 86 | 2586 | 409 | 76 | 82 |
| N50 | 8012 | 26 178 | 10 580 | 10 437 | 9460 | 10 401 | 7135 | 4092 |
| Sum contig lengths | 117 784 | 117 153 | 116 448 | 115 444 | 117 189 | 116 456 | 114 246 | 114 679 |
| Exon gaps | 6467 | 4272 | 4861 | 6621 | 4918 | 7924 | 6200 | 8346 |
| Exon complete (%) | 91.0 | 94.0 | 93.2 | 90.7 | 93.1 | 88.9 | 91.3 | 88.3 |
Figure 1.Relative frequencies of barcode error by barcode tag (CCT, GGT), experiment (S1, S6) and nucleotide position (1,2, 3). Observed frequencies of erroneous, nontag nucleotides are indicated by position 1 (salmon), 2 (blue) and 3 (green); first and second position errors were far more common than third position errors. Slices within a position are scaled proportionately to the number of base calls for that nucleotide; if errors were present at equal frequencies within a base position, each slice would be of equal size and would not extend beyond the perimeter of the circle. In all experiments, errors involving substitutions to ‘A’ were more frequent than expected for position 1 and 3, where errors involving substitutions to ‘T’ were more frequent than expected for position 2.
Figure 2.Plots showing sequencing depth by position for eight chloroplast genomes sequenced by multiplex sequencing-by-synthesis. Microreads per position (y-axis) are plotted in gray relative to the position in the assembly (x-axis, in kb). The median number of reads across each PCR amplicon is indicated by black lines.
Figure 3.Frequency spectrum of mononucleotide repeats observed in reference and microread assemblies of Pinus chloroplast genomes. The number of repeats per length class (6–24 bp) is plotted for P. thunbergii (THUN; salmon) and P. koraiensis (KORA; blue). The average and 95% confidence interval for eight microread assemblies (seven Pinus, one Picea, white circles) are also shown. Inset: relationship between the proportions of repeats terminating contigs and the length of each repeat class for the eight microread assemblies. The least squares regression line is indicated.
Error estimates for six genic regions across eight species
| Locus | Length | CONT 40 | GERA 03 | KREM 03 | LAMB 08 | LONG 01 | MONO 10 | NELS 03 | PICSIT 04 | Row Total |
|---|---|---|---|---|---|---|---|---|---|---|
| 641 | 0 | 0 | 0 | 0 | 1 | 0 | 7 | 0 | 8 | |
| 2182 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | N/A | 2 | |
| 454 | 0 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 4 | |
| 594 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 717 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | |
| 346 | 1 | 0 | 2 | 0 | 0 | 0 | 0 | 4 | 7 | |
| Total | 4934 | 1 | 0 | 2 | 2 | 1 | 4 | 7 | 4 | 21 |
| Error (%) | 0.020 | 0 | 0.040 | 0.040 | 0.020 | 0.081 | 0.142 | 0.145 | 0.056 |
Total sequence differences between Illumina/Solexa-derived assemblies and traditional Sanger sequencing are shown.
aThe psbCD region was not amplified from Picea sitchensis.
Figure 4.Simulations of higher level multiplex levels. Random subsets of microreads from the P. gerardiana data set were sampled to simulate multiplex levels ranging from 4× (1.37 million microreads) to 16× (0.34 million microreads). Triplicate random subsets were assembled with Velvet de novo assembly, and assemblies were evaluated for sequencing depth (A), the number of contigs (B) and the summed contig lengths (C). Solid lines show the best fit line from least squares regression and shaded regions show the 95% confidence interval of the best fit line. The curved line (C) shows the best fit with a smoothing spline (λ = 5 × 1015; r2 = 0.973).