| Literature DB >> 28165046 |
Perrine Cruaud1, Jean-Yves Rasplus1, Lillian Jennifer Rodriguez1,2, Astrid Cruaud1.
Abstract
Until now, the potential of NGS for the construction of barcode libraries or integrative taxonomy has been seldom realised. Here, we amplified (two-step PCR) and simultaneously sequenced (MiSeq) multiple markers from hundreds of fig wasp specimens. We also developed a workflow for quality control of the data. Illumina and Sanger sequences accumulated in the past years were compared. Interestingly, primers and PCR conditions used for the Sanger approach did not require optimisation to construct the MiSeq library. After quality controls, 87% of the species (76% of the specimens) had a valid MiSeq sequence for each marker. Importantly, major clusters did not always correspond to the targeted loci. Nine specimens exhibited two divergent sequences (up to 10%). In 95% of the species, MiSeq and Sanger sequences obtained from the same sampling were similar. For the remaining 5%, species were paraphyletic or the sequences clustered into divergent groups on the Sanger + MiSeq trees (>7%). These problematic cases may represent coding NUMTS or heteroplasms. Our results illustrate that Illumina approaches are not artefact-free and confirm that Sanger databases can contain non-target genes. This highlights the importance of quality controls, working with taxonomists and using multiple markers for DNA-taxonomy or species diversity assessment.Entities:
Mesh:
Substances:
Year: 2017 PMID: 28165046 PMCID: PMC5292727 DOI: 10.1038/srep41948
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Illustration of the two-step PCR approach.
DNA regions targeted for amplification.
| Gene region | ||||
|---|---|---|---|---|
| Primer pair | LCO1490puc-HCO2198puc | UEA3-HCO2198 | CB1-CB2 | F2-557F-F2-1118R |
| “ | “ | |||
| Primer position | ||||
| Amplicon size (nt) | 658 | 409 | 433 | 518 |
| MiSeq sequenced product (nt) | 709 | 460 | 485 | 563 |
| Read overlap? | no | yes | yes | no |
(*Sequenced product = forward primer + amplicon + reverse primer).
Figure 2Analytical workflow.
Step 1, from read filtering to clustering.
Figure 3Analytical workflow.
Step 2, quality control of clusters of reads.
Sequencing results of the MiSeq data set.
| Gene region | ||
|---|---|---|
| PCR success (MiSeq) | 272 (73.7%) 99 species (86.1%) | 272 specimens (73.7%) 98 species (85.2%) |
| Number (%) of specimens for which at least one cluster of reads was obtained | 325 (88.1%) [incl. 58 with PCR−] | 280 (75.9%) [incl. 17 with PCR−] |
| Number of specimens with PCR + but no cluster of reads | 5 (1.8%) | 9 (3.3%) |
| Average (maximum) number of clusters per specimen | 2.2 (12) | 1.6 (8) |
| Number of specimens for which the consensus sequence of at least one cluster successfully passed the translation to AA step | 270 (83.1%) [incl. 39 with PCR−] | 270 (96.4%) [incl. 15 with PCR−] |
| Number of specimens for which the consensus sequence of the major cluster | 77 (23.7%) [incl. 25 with PCR−] | 17 (6.1%) [incl. 2 with PCR−] |
| Number of specimens for which the consensus sequence of the major cluster | 0 | 1 (0.4%) [incl. 1 with PCR−] (lab. aerosol contamination) |
| Number of specimens for which the consensus sequence of the major cluster was identical to sequence(s) of another species of | 21 (6.5%) [incl. 9 with PCR−] | 8 (2.9%) [incl. 1 with PCR−] |
(Regions for which paired-end reads did overlap).
1As revealed by a visual inspection of the gel after the first PCR step.
2The cluster that contains the largest proportion of reads/sequences is called the “major cluster”.
3At this stage of the process, “major cluster” stands for the cluster that contains the largest proportion of reads/sequences AND whose consensus sequence successfully passed the translation to amino acids step.
4as revealed by NCBI-BLAST (e.g. symbionts, parasites, or laboratory aerosol contamination).
5as revealed by visual inspection of trees. In this case, sequences belong to the target group (Ceratosolen) but are identical to sequences from another species (100% BP). May be due to cross-contamination during library preparation or conversion of indexes due to mixed clusters on the flow cell (clonal clusters derived from more than one template molecule), but also to mtDNAintrogression (which is undetectable without taxonomic knowledge of the group or comparison with nuDNA data sets).
Sequencing results of the MiSeq data set.
| Gene region | ||||
|---|---|---|---|---|
| Forward reads | Reverse reads | Forward reads | Reverse reads | |
| PCR success (MiSeq) | 244 specimens (66.1%) 93 species (80.9%) | 244 specimens (66.1%) 89 species (77.4%) | ||
| Number (%) of specimens for which at least one cluster of reads was obtained | 273 (74.0%) [incl. 36 with PCR−] | 305 (82.7%) [incl. 72 with PCR−] | 300 (81.3%) [incl. 61 with PCR−] | 330 (89.4%) [incl. 93 with PCR−] |
| Number of specimens with PCR+ but no cluster of reads | 7 (2.9%) | 11 (4.5%) | 5 (2.0%) | 7 (2.9%) |
| Average (maximum) number of clusters per specimen | 3,0 (17) | 3,5 (12) | 3,7 (17) | 4,3 (23) |
| Number of specimens for which the consensus sequence of at least one cluster successfully passed the translation to AA step | 218 (79.9%) [incl. 18 with PCR−] | 199 (65.2%) [incl. 25 with PCR−] | 291 (97.0%) [incl. 50 with PCR−] | 227 (68.8%) [incl. 27 with PCR−] |
| Number of specimens for which the consensus sequence of the major cluster2 did not pass the translation to AA step | 116 (42.5%) [incl. 25 with PCR−] | 278 (91.1%) [incl. 63 with PCR−] | 33 (11.0%) [incl. 20 with PCR−] | 279 (84.5%) [incl. 86 with PCR−] |
| Number of specimens for which the consensus sequence of the major cluster | 28 (10.3%) [incl. 4 with PCR−] (27 nematodes, 1 lab. aerosol contamination) | 47 (15.4%) [incl. 2 with PCR−] (41 | 0 | 0 |
| Number of specimens for which the consensus sequence of the major cluster was identical to sequence(s) of another species of | 6 (2.2%) [incl 2 with PCR−] | 1 (0.3%) [incl 0 with PCR−] | 11 (3.7%) [incl. 4 with PCR−] | 6 (1.8%) [incl. 2 with PCR−] |
(Regions for which paired-end reads did not overlap). See Table 2 for legends.
Figure 4Success rate of amplification as a factor of time since storage of specimens in alcohol.
Final results obtained on the combined data set (Sanger + MiSeq), after completion of the workflow.
| Gene region | ||||
|---|---|---|---|---|
| Number of specimens with a valid sequence | 195 | 264 | 203 | |
| Number (%) of specimens/species with at least one consensus sequence | 306 (82.9%) [incl. 27 with PCR−]/109 (94.8%) | 261 (70.7%) [incl. 13 with PCR−]/95 (82.6%) | 273 (74.0%) [incl. 45 with PCR−]/97 (84.3%) | |
| Number of specimens/species with PCR+ but no sequence | 14 | 18/4 | 9/3 | |
| Number of specimens/species with PCR− but at least one consensus sequence | 27 | 14/3 | 45/11 | |
| Number of specimens with two valid sequences | 1 (0.3%) | 8 (3.1%) | 0 | |
| Number of problematic species (See text for discussion) | 3 (2.6%) | 4 (3.5%) | 0 | |
| 2 (1.7%) | ||||
1At least one of the two PCR reactions (COI-short or COI-long) was positive.
2At least one of the two PCR reactions was positive for at least one specimen.
3The two PCR reactions were negative.
4The two PCR reactions were negative for all specimens.
Figure 5RAxML tree for the Cytb data set (MiSeq + Sanger) (BP: 1000 replicates).
Red (resp. blue) circles represent sequences produced by Sanger (resp. MiSeq) sequencing. Dotted lines show problematic cases as discussed in text (see also Fig. S2).