| Literature DB >> 23469211 |
Ricardo Ramirez-Gonzalez1, Douglas W Yu, Catharine Bruce, Darren Heavens, Mario Caccamo, Brent C Emerson.
Abstract
High-throughput parallel sequencing is a powerful tool for the quantification of microbial diversity through the amplification of nuclear ribosomal gene regions. Recent work has extended this approach to the quantification of diversity within otherwise difficult-to-study metazoan groups. However, nuclear ribosomal genes present both analytical challenges and practical limitations that are a consequence of the mutational properties of nuclear ribosomal genes. Here we exploit useful properties of protein-coding genes for cross-species amplification and denoising of 454 flowgrams. We first use experimental mixtures of species from the class Collembola to amplify and pyrosequence the 5' region of the COI barcode, and we implement a new algorithm called PyroClean for the denoising of Roche GS FLX pyrosequences. Using parameter values from the analysis of experimental mixtures, we then analyse two communities sampled from field sites on the island of Tenerife. Cross-species amplification success of target mitochondrial sequences in experimental species mixtures is high; however, there is little relationship between template DNA concentrations and pyrosequencing read abundance. Homopolymer error correction and filtering against a consensus reference sequence reduced the volume of unique sequences to approximately 5% of the original unique raw reads. Filtering of remaining non-target sequences attributed to PCR error, sequencing error, or numts further reduced unique sequence volume to 0.8% of the original raw reads. PyroClean reduces or eliminates the need for an additional, time-consuming step to cluster reads into Operational Taxonomic Units, which facilitates the detection of intraspecific DNA sequence variation. PyroCleaned sequence data from field sites in Tenerife demonstrate the utility of our approach for quantifying evolutionary diversity and its spatial structure. Comparison of our sequence data to public databases reveals that we are able to successfully recover both interspecific and intraspecific sequence diversity.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23469211 PMCID: PMC3585932 DOI: 10.1371/journal.pone.0057615
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Primers, MIDs, sequence formats and consensus reference sequence used in this study.
| Primer | Sequence (5′ - 3′) |
| ColFol-for | TTTCAACAAATCATAARGAYATYGG |
|
| TAAACTTCNGGRTGNCCAAAAAATCA |
|
| Adaptor A+MID+TTTCAACAAATCATAARGAYATYGG |
|
| Adaptor B+CANCCNGTNCCNGCNCCNCTYTC |
|
| >GGQCYR401C2J7Z length = 354 xy = 1142_0845 region = 1 run = R_2010_05_04_08_38_41_ |
|
| >GGQCYR401C2J7Z_1 |
|
| >Seq1_2343 |
|
| >EMBOSS_001 NHNNNNNTNNNTNNWNHTNKSNNNNNKNNNNNSNHYNNYNGGNDYNDNNYTNARNNYNNNNNTNNSNNNNRANNTNRSNVRNNYNRGNNNNNWNNTNRRNNRNGANCANVYNTANAAYRYNNYDRTNACNKCNCANGCNKTYDYNATRATNTTYTTYRYDGTNAKNCCNNTHWTRVTHGGNGGNHTHGGNAANTKRHTNVTNCCNNTNATRVTNRRNKCNSCNGAYATNKCNTTNCCNCGNHTNANNAAYHTRAGNTTYTGRYTNYTNCCNCCNDSNHTNNNNNTNNTNNBNNNNRGNDSNNYNDBNNANDNNGRNDNNGGNACNGGNTGRNNNNYNTAYCCNCCNNTNKCNDVNNNNNYNDBNCANNNNGGNNBNDSNRTNGANNTNDNNATYTTYWSNYTNCANYYNRCNGGNRYNNSNTMNATYYTNGGNGCNRTNARYTTYANNWSNWCNDBHDDNNAYATNNRNNNNNNNNNNNTNNNNTGRRANNDNNYNHBNYTNYTNNBNTGNDSNRTNHWHNTNACNDCNDYHYTNYTNBYNNYNDSNHTNCCNKTNNTNNNNGGNGCNRTNWCNATRYTNNTNWYNGAYCGNAANNTNAANNCNDSNTTYTTYNNNCCNDSNGGNGGNGNNGANYMNRWHYTNTWNCANCNYHWNNYY |
ColFol-for ColFol-rev are the Sanger primers. 454-ColFol-for and 454-Col307-rev are the primers for mass amplification. Adaptors A and B are used by the ‘454’ sequencer to attach individual DNA molecules to microscopic beads, for subsequent sequencing. MIDs (Multiplex Identifiers) are 7 bp sequences that allow different samples to be sequenced together on a single ‘454’ plate and then separated bioinformatically for downstream analysis. There is no MID with 454-Col307-rev because we only pyrosequenced from the forward direction. Row 5 is an example of a 454 read after demultiplexing with the Roche tools. The forward primer is underlined, and the reverse primer dashed underlined. The MID tag is removed during demultiplexing. Row 6 is an example of a sequence after processing of sequences to produce a file of unique sequences.
PyroCleaning results for mtDNA COI amplicons generated from experimental pools constructed from 27 genomic extracts from 23 Collembola species.
| Pool 1 | Pool 2 | Pool 3 | Pool 4 | Pool 5 | |||||||||||||||||||||||||||
| MID1 | MID2 | MID3 | MID4 | MID5 | MID6 | MID7 | MID8 | MID9 | MID10 | MID11 | MID12 | MID13 | MID14 | MID15 | |||||||||||||||||
| Total read count | |||||||||||||||||||||||||||||||
| Raw reads | 16,035 | 15,644 | 16,964 | 13,536 | 15,679 | 13,987 | 9,898 | 13,808 | 6,225 | 9,012 | 13,478 | 14,654 | 14,638 | 17,891 | 14,586 | ||||||||||||||||
| Reads above min. length | 10,843 | 11,931 | 11,608 | 10,333 | 12,220 | 10,738 | 7,587 | 10,628 | 4,060 | 7,087 | 10,272 | 11,518 | 10,848 | 12,538 | 10,391 | ||||||||||||||||
| Step 3 | 10,839 [100] | 11,930 [100] | 11,602 [100] | 10,329 [100] | 12,217 [100] | 10,738 [100] | 7,585[100] | 10,625 [100] | 4,060[100] | 7,087[100] | 10,268 [100] | 11,517 [100] | 10,847 [100] | 12,535 [100] | 10,383 [100] | ||||||||||||||||
| Step 4 | 2,715 | 2,833 | 2,806 | 2,074 | 2,659 | 2,130 | 1,856 | 2,725 | 1,006 | 1,672 | 2,276 | 2,735 | 2,302 | 2,790 | 2,087 | ||||||||||||||||
| Unique read count | |||||||||||||||||||||||||||||||
| Reads above min. length | 9,777 | 10,879 | 10,522 | 9,408 | 11,036 | 9,800 | 6,862 | 9,605 | 3,751 | 6,385 | 9,257 | 10,422 | 9,585 | 11,026 | 9,211 | ||||||||||||||||
| Step 3 | 638 [6.5] | 718 [6.6] | 676 [6.4] | 634 [6.7] | 713 [6.5] | 667 [6.8] | 435 [6.3] | 621 [6.5] | 225 [6.0] | 380 [6.0] | 596 [6.4] | 721 [7.0] | 592 [6.2] | 677 [6.1] | 553 [6.0] | ||||||||||||||||
| Step 4 | 169 [1.7] | 190 [1.7] | 173 [1.6] | 149 [1.6] | 166 [1.5] | 125 [1.3] | 124 [1.8] | 161 [1.7] | 65 [1.7] | 89 [1.4] | 138 [1.5] | 156 [1.5] | 110 [1.1] | 153 [1.4] | 114 [1.2] | ||||||||||||||||
Each pool has been amplified in triplicate. Sequences were generated on 1/2 a 454 plate that generated a total of 156,315 raw reads. Raw reads had a maximum length of 534 bp and an average length of 343.4 bp. Steps 3–4 are summarised in the text. Numbers in brackets represents sequence reduction as the % of raw reads above a minimum length of 170 nucleotides.
Figure 1Heat map of the percentage representation of the 27 Collembola genomes within each of the 5 genomic pools, and within each of the three MID tag pools derived from each of the 5 genomic pools.
For visual clarity, percentage representation of 2% or more is presented as maximum representation. See Supplementary Table 1 to see the actual percentages of each sequence found, and species names.
Figure 2Regression analyses of the percentage of collembolan mtDNA COI sequences within an MID tag pool (y axis) against the percentage of genomic template from which they are derived within an experimental genomic pool (x axis).
Data come from Table S1. The panels from top to bottom are MID1 against pool 1, MID4 against pool 2, and MID7 against pool 3.
PyroCleaning results for mtDNA COI amplicons generated from community samples of Collembola from two forest sites on the island of Tenerife.
| Site 1 | Site 2 | |||||
| MID7 | MID8 | MID9 | MID10 | MID11 | MID12 | |
| Total read count | ||||||
| Raw reads | 11,249 | 11,479 | 9,234 | 7,583 | 11,426 | 11,584 |
| Reads above min. length | 7,311 | 7,613 | 6,008 | 5,255 | 7515 | 8,165 |
| Step 3 | 7,311 [100] | 7,613 [100] | 6,007 [100] | 5,255 [100] | 7,515 [100] | 8,165 [100] |
| Step 4 | 4,492 [61] | 4,697 [62] | 3,501 [58] | 3,276 [62] | 4,289 [57] | 4,923 [60] |
| Step 5 | 3,644 [50] | 3,815 [50] | 2,879 [48] | 2,844 [54] | 3,540 [47] | 4,114 [50] |
| Step 6 | 3,596 [49] | 3,764 [49] | 2,828 [47] | 2,744 [52] | 3,409 [45] | 4,006 [49] |
| Unique read count | ||||||
| Reads above min. length | 4,742 | 4,962 | 4,067 | 3,356 | 5,234 | 5,473 |
| Step 3 | 450 [9.5] | 427 [8.6] | 381 [9.3] | 304 [9.1] | 493 [9.4] | 504 9.2] |
| Step 4 | 252 [5.3] | 231 [4.7] | 215 [5.3] | 170 [5.1] | 289 [5.5] | 287 [5.2] |
| Step 5 | 27 [0.57] | 27 [0.54] | 30 [0.74] | 44 [1.31] | 49 [0.94] | 46 [0.84] |
| Step 6 | 17 [0.36] | 17 [0.34] | 19 [0.46] | 20 [0.60] | 20 [0.38] | 21 [0.38] |
Each site has been amplified in triplicate. 62,825 raw reads were generated on 1/2 of a 454 plate from a total of 103,850 raw reads (the remainder of raw reads belonged to another experiment). Raw reads had a maximum length of 521 bp and an average length of 343 bp. Steps 3–6 are summarised in the text. Numbers in brackets represents sequence reduction as the % of raw reads above a minimum length of 170 nucleotides.
Figure 3Neighbour joining tree of PyroCleaned sequences derived from two forest sampling sites in the island of Tenerife, and 21 taxonomically identified Sanger sequences samples from Tenerife (in bold).
Sequence identifiers represent MID tag, the name of the sequence, and the frequency representation of the sequence. Sequences were assessed for taxonomic identity against the BOLD Identification Database, and we report closest matches for Collembola, and non-Collembola if matching is higher. Letters in bold represent sequence matches to Collembola of 99% or higher: A – Tetrodontophora bielanensis (88%), Pelosia muscerda (Lepidoptera, 91%), B – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); C – Tetrodontophora bielanensis (89%), Myotis ikonnikova (Chiroptera, 91%); D – Verhoeffiella sp. (89%), Stenopsyche (Trichoptera, 92%), E – no significant match, Stenopsyche (Trichoptera, 94%), F – Folsomia (94%), Ophiogomphus (Odonata, 95%), G – Protaphorura (99%), H – Protaphorura (100%), I – Xenylla humicola (90%), Carterocephalus silvicola (Lepidoptera, 91%), J – Bourletiella (89%), K – Lepidocyrtus violaceus (88%), Herona marathus (Lepidoptera, 90%) L – Tullbergia sp. (99%), M – Parisotoma notabilis L3 (92%), N – Folsomina yosii (90%), O – Isotomiella (99%), P – Parisotoma notabilis (100%), Q - Parisotoma notabilis (100%), R – Parisotoma notabilis (100%), S – Xenylla humicola (90%), Finlaya (Culicidae, 91%), T – Verhoeffiella sp. (90%), U – Entomobrya atrocincta (100%), V - Entomobrya atrocincta (99%), W - Entomobrya atrocincta (100%), X – Parisotoma notabilis (94%), Phasus (Lepidoptera, 96%), Y – Isotomurus (92%), Z – Desoria sp. (91%), Bathymunida nebulosa (Decapoda, 95%), AA – Entomobryidae (90%), Polytremis pellucida (Lepidoptera, 93%), AB – Isotoma sp. (91%), Neophylax rickeri (Trichoptera, 92%), AC – Isotomurus (86%), Chordate (90%), AD – Neanura muscorum (100%), AE – Ceratophysella sp. (99%), AF – Ceratophysella sp. (100%). The five sequences with an asterisk are potential numts or pcr error that exceed the final filter threshold. The two sequences with a filled circle are left in as an example of probable numts.
Denoising of mtDNA COI amplicons generated from community samples of Collembola from two forest sites on the island of Tenerife with the pipeline of Yu et al. (2012).
| Site 1 | Site 2 | |||||
| MID7 | MID8 | MID9 | MID10 | MID11 | MID12 | |
| Total read count | ||||||
| Step 1 (quality control) | 10,413 | 10,581 | 8,416 | 7,043 | 10,635 | 10,737 |
| Step 2 (PyNAST, 60%) | 10,394 | 10,572 | 8,392 | 7,040 | 10,603 | 10,712 |
| Unique read count | ||||||
| Step 2 (PyNAST, 60%) | 7,040 | 7,306 | 5,822 | 4,752 | 7,492 | 7,443 |
| Step 2 (USEARCH) | 2,021 | 2,032 | 1,621 | 1,452 | 2,255 | 2,147 |
| Step 3 (MACSE) | 709 | 719 | 305 | 152 | 940 | 817 |
| Step 4 (DNACLUST, 99%) | 580 | 610 | 267 | 139 | 805 | 710 |
| Step 5 (CROP, 98%) | 69 | 52 | 52 | 65 | 103 | 98 |
Steps 1–3 represent reduction of unique sequence volume by denoising, while steps 4 and 5 further reduce unique sequence volume by the creation of summary clusters of sequences. See Yu et al. (2012) for a detailed explanation of each of the steps. Bracketed values in step 2 represent sequences inferred to be chimeras with the de novo chimera detection function UCHIME in USEARCH, all of which were removed in step 4 of PyroClean.