| Literature DB >> 34674330 |
Pedro G Ribeiro1,2, María Fernanda Torres Jiménez3,4, Tobias Andermann3,4,5,6, Alexandre Antonelli3,4,7,8, Christine D Bacon3,4, Pável Matos-Maraví1,4.
Abstract
The increasing availability of short-read whole genome sequencing (WGS) provides unprecedented opportunities to study ecological and evolutionary processes. Although loci of interest can be extracted from WGS data and combined with target sequence data, this requires suitable bioinformatic workflows. Here, we test different assembly and locus extraction strategies and implement them into secapr, a pipeline that processes short-read data into multilocus alignments for phylogenetics and molecular ecology analyses. We integrate the processing of data from low-coverage WGS (<30×) and target sequence capture into a flexible framework, while optimizing de novo contig assembly and loci extraction. Specifically, we test different assembly strategies by contrasting their ability to recover loci from targeted butterfly protein-coding genes, using four data sets: a WGS data set across different average coverages (10×, 5× and 2×) and a data set for which these loci were enriched prior to sequencing via target sequence capture. Using the resulting de novo contigs, we account for potential errors within contigs and infer phylogenetic trees to evaluate the ability of each assembly strategy to recover species relationships. We demonstrate that choosing multiple sizes of kmer simultaneously for assembly results in the highest yield of extracted loci from de novo assembled contigs, while data sets derived from sequencing read depths as low as 5× recovers the expected species relationships in phylogenetic trees. By making the tested assembly approaches available in the secapr pipeline, we hope to inspire future studies to incorporate complementary data and make an informed choice on the optimal assembly strategy.Entities:
Keywords: zzm321990secaprzzm321990; de novo assembly; loci extraction; low-coverage whole genome sequencing; target sequence capture
Mesh:
Year: 2021 PMID: 34674330 PMCID: PMC9298010 DOI: 10.1111/mec.16240
Source DB: PubMed Journal: Mol Ecol ISSN: 0962-1083 Impact factor: 6.622
FIGURE 2Flowchart summarizing the process of contig assemblage implemented in abyss (left, Jackman et al., 2017; Simpson et al., 2009) and spades (right, Bankevich et al., 2012). The chart briefly describes the de Bruijn graph construction and the differences between both assemblers. See Box 1 for more details
FIGURE 1Schematic representation of the workflow implemented in this study using the secapr version 2.2.3 pipeline. The bash commands used in each step are shown inside coloured (blue) boxes
Pairwise contrasts of the marginal estimated means of recovered loci per sample between multi‐kmer and single‐kmer strategies and between both multi‐kmer strategies
| Pairwise comparisons among strategies | Completeness of alignments | WGS 10× read depth | Target sequence capture | ||
|---|---|---|---|---|---|
| Estimate |
| Estimate |
| ||
|
| Unprocessed | 136 | <.0001 | 126 | <.0001 |
|
| 99 | <.0001 | 120 | <.0001 | |
|
| 143 | <.0001 | 115 | <.0001 | |
|
| 191 | <.0001 | 104 | <.0001 | |
|
| 68 | <.0001 | 85 | <.0001 | |
|
| 68 | <.0001 | 41 | .012 | |
|
| 31 | .1311 | 35 | .0535 | |
|
| 75 | <.0001 | 30 | .1386 | |
|
| 123 | <.0001 | 19 | .6389 | |
|
| Processed | 117 | <.0001 | 107 | <.0001 |
|
| 81 | <.0001 | 102 | <.0001 | |
|
| 129 | <.0001 | 100 | <.0001 | |
|
| 180 | <.0001 | 92 | <.0001 | |
|
| 27 | .2523 | 44 | .006 | |
|
| 90 | <.0001 | 63 | <.0001 | |
|
| 54 | .0002 | 58 | <.0001 | |
|
| 102 | <.0001 | 56 | .0001 | |
|
| 153 | <.0001 | 48 | .0015 | |
Contrasts are made considering an average of 10× coverage for the low‐coverage WGS and the target sequence capture data sets. Standard error = 12.4 for all pairwise comparisons; degrees of freedom = 477 for all pairwise comparisons
Pairwise contrasts of the marginal estimated means of recovered loci per sample between multi‐kmer and single‐kmer strategies and between both multi‐kmer strategies for each of the subsets of depths of coverage of low‐coverage WGS data set
| Pairwise comparisons among strategies | Completeness of alignments | WGS 10× read depth | WGS 5× read depth | WGS 2× read depth | |||
|---|---|---|---|---|---|---|---|
| Estimate |
| Estimate |
| Estimate |
| ||
|
| Unprocessed | 149.1 | <.0001 | 133.55 | <.0001 | 42.55 | .0005 |
|
| 116.02 | <.0001 | 126.17 | <.0001 | 61.62 | <.0001 | |
|
| 166.22 | <.0001 | 172.62 | <.0001 | 95.97 | <.0001 | |
|
| 217.75 | <.0001 | 229.5 | <.0001 | 109.85 | <.0001 | |
|
| 70.38 | <.0001 | 69.28 | <.0001 | −33.97 | .0118 | |
|
| 78.72 | <.0001 | 64.27 | <.0001 | 76.52 | <.0001 | |
|
| 45.63 | .0001 | 56.88 | <.0001 | 95.58 | <.0001 | |
|
| 95.83 | <.0001 | 103.33 | <.0001 | 129.93 | <.0001 | |
|
| 147.37 | <.0001 | 160.22 | <.0001 | 143.82 | <.0001 | |
|
| Processed | 117.8 | <.0001 | 102.25 | <.0001 | 11.25 | .8786 |
|
| 82.78 | <.0001 | 92.93 | <.0001 | 28.38 | .061 | |
|
| 131.48 | <.0001 | 137.88 | <.0001 | 61.23 | <.0001 | |
|
| 182.45 | <.0001 | 194.2 | <.0001 | 74.55 | <.0001 | |
|
| 33.12 | .0155 | 32.02 | <.0001 | −71.23 | <.0001 | |
|
| 84.68 | <.0001 | 70.23 | <.0001 | 82.48 | <.0001 | |
|
| 49.67 | <.0001 | 60.92 | <.0001 | 99.62 | <.0001 | |
|
| 98.37 | <.0001 | 105.87 | <.0001 | 132.47 | <.0001 | |
|
| 149.33 | <.0001 | 162.18 | .022 | 145.78 | <.0001 | |
Standard error for all pairwise contrasts = 10.2, degrees of freedom for all pairwise contrasts = 375.
FIGURE 3Boxplot of the median recovered loci per sample for each assembly strategy for the 10× average coverage WGS and target sequence capture data sets. (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the two different types of sequencing approaches that our data are derived from, target sequence capture and low‐coverage WGS at 10× coverage
FIGURE 4Boxplot of the median recovered loci per sample for each assembly strategy for the three subsets of different average coverages (10×, 5× and 2×). (a) Unprocessed data, without the exclusion of sequences with more than 50% missing information – Ns. (b) Processed data (when sequences with more than 50% missing information ‐ Ns ‐ are excluded). Different colours represent the different average depths of coverage for the WGS data sets
FIGURE 5Species tree obtained with ASTRAL III (Zhang et al., 2018) by using gene trees estimated with iq‐tree version 2.0.7 (Minh et al., 2020). Tips represent each of the studied species and numbers represent local posterior probabilities inferred by ASTRAL III for the specific node. This species tree was obtained using the alignment derived from our abyss multi‐kmer strategy for the WGS data set with an average of 10× read depth coverage. Only this tree is shown since topology is the same for other well‐resolved species tree
FIGURE 6Kernel density of the percentage of error per locus, averaged across species. Percentage of error was calculated from the mismatches and alignment lengths in the blast results. (a) Comparisons between spades and our multi‐kmer abyss approach, showing the density of the average percentage of error in assembly strategy (same data set, different assembly approach); WGS indicates our whole genome sequence data set with 10× average depth of coverage and TC indicates the target sequence capture data set. (b) Comparisons showing the density of the average percentage of error from comparing the same assembly approach but different coverages. Multi‐kmer indicates the multi‐kmer abyss approach. In each density graph, the term before “vs” indicates the subject and the term after “vs” indicates the query. Numbers under each density graph indicate the average number of loci aligned per species in the blast alignments. Graphs show that the percentage of errors tend to increase when using reduced read coverage, regardless of the used assembly strategy