| Literature DB >> 32477841 |
Johanna R Jantzen1,2, Prabha Amarasinghe1,2, Ryan A Folk3, Marcelo Reginato4,5, Fabian A Michelangeli4, Douglas E Soltis1,2, Nico Cellinese2, Pamela S Soltis2.
Abstract
PREMISE: Putatively single-copy nuclear (SCN) loci, which are identified using genomic resources of closely related species, are ideal for phylogenomic inference. However, suitable genomic resources are not available for many clades, including Melastomataceae. We introduce a versatile approach to identify SCN loci for clades with few genomic resources and use it to develop probes for target enrichment in the distantly related Memecylon and Tibouchina (Melastomataceae).Entities:
Keywords: HybPiper; MarkerMiner; Memecylon; Tibouchina; phylogenomics; target capture
Year: 2020 PMID: 32477841 PMCID: PMC7249273 DOI: 10.1002/aps3.11345
Source DB: PubMed Journal: Appl Plant Sci ISSN: 2168-0450 Impact factor: 1.936
Figure 1Overview of the steps in the two‐tiered probe development pipeline. In the first tier, loci are selected using MarkerMiner and loci from other sources are added. In the second tier, genome‐skimming reads are assembled using HybPiper to the reference sequence for each locus selected using the first tier; alignments of assembled sequences were used for probe design for target capture.
Figure 2Venn diagrams of transcriptomic (Miconia and Medinilla) sources of loci identified by MarkerMiner (from Tier 1) by a genomic resource. (A) Loci common to the Arabidopsis thaliana genome and the Medinilla and Miconia transcriptomes. (B) Loci common to the Theobroma cacao genome and the Medinilla and Miconia transcriptomes. The comparison of overlap between genomes is not visualized.
Sources of loci and template sequences by method and genome source.
| Locus sources (identification method) | Genomic sources for template sequences |
|---|---|
| MarkerMiner |
|
| Angiosperms353 |
|
| Published SCN (Reginato and Michelangeli, |
|
| Functional loci (TAIR database) | Melastomateae genome skims (two species) |
|
| |
| Angiosperms353 sequences (rosids) |
Sequencing statistics for target enrichment of representative species of Memecylon and Tibouchina, averaged when multiple samples were sequenced per species.
| Species | Percent on‐target reads | No. of total reads | Mean locus length per species (excluding zeros) | Maximum locus length per species | No. of templates with sequences (min–max) | No. of templates with sequences at 50% of reference length (min–max) | No. of loci with potential paralogs |
|---|---|---|---|---|---|---|---|
|
| 84.6 | 4,060,390 | 551 | 4314 | 332 (329–334) | 218 (207–229) | 4 |
|
| 86.6 | 4,306,749 | 539 | 4698 | 323 (290–347) | 209 (180–229) | 4 |
|
| 79.8 | 2,625,799 | 536 | 4230 | 306 | 198 | 8 |
|
| 83.9 | 1,271,186 | 553 | 4227 | 288 | 192 | 2 |
|
| 80.8 | 1,540,585 | 552 | 4224 | 295 | 193 | 2 |
|
| 92.9 | 2,146,671 | 493 | 4233 | 305 | 186 | 4 |
|
| 83.1 | 2,127,129 | 555 | 4314 | 310 (289–334) | 207 (177–226) | 2 |
|
| 84.2 | 2,281,370 | 559 | 4230 | 305 | 205 | 5 |
|
| 89.6 | 2,600,174 | 530 | 4338 | 311 | 196 | 1 |
|
| 76.1 | 821,062 | 572 | 4314 | 290 | 202 | 3 |
|
| 73.3 | 2,657,756 | 529 | 4104 | 318 | 203 | 15 |
|
| 85.9 | 2,332,960 | 545 | 4092 | 315 (301–328) | 206 (201–211) | 5 |
|
| 88.7 | 1,536,920 | 516 | 4215 | 311 | 187 | 3 |
|
| 81.4 | 1,774,516 | 550 | 4302 | 349 (295–366) | 261 (219–275) | 44 |
|
| 87.4 | 3,226,663 | 566 | 3969 | 348 (243–402) | 262 (154–323) | 14 |
|
| 86.1 | 2,344,437 | 571 | 4512 | 364 (312–394) | 279 (222–316) | 51 |
|
| 89.8 | 1,273,246 | 527 | 3840 | 324 (312–335) | 235 (228–241) | 8 |
|
| 68.0 | 2,389,578 | 591 | 3831 | 320 | 262 | 21 |
|
| 89.7 | 1,560,620 | 523 | 3831 | 346 (339–353) | 243 (239–247) | 28 |
|
| 87.6 | 2,048,631 | 532 | 3840 | 328 (265–382) | 241 (197–280) | 35 |
|
| 83.2 | 3,072,283 | 549 | 3891 | 364 (338–397) | 271 (232–321) | 61 |
|
| 84.7 | 5,222,221 | 574 | 3837 | 360 | 277 | 25 |
Sequencing statistics for target enrichment of population‐level sampling of Memecylon.
| Sample | Species | No. of total reads | Percent on‐target reads | No. of templates with sequences | No. of templates with sequences at 50% of reference length | No. of potential paralogs |
|---|---|---|---|---|---|---|
| C3 |
| 2,956,804 | 84.0 | 316 | 215 | 2 |
| E3 |
| 5,766,861 | 86.5 | 328 | 199 | 3 |
| E5 |
| 973,415 | 86.6 | 296 | 189 | 1 |
| G3 |
| 2,845,352 | 90.1 | 320 | 205 | 3 |
| GM5 |
| 2,166,529 | 89.4 | 306 | 197 | 3 |
| MK3 |
| 12,565,563 | 87.6 | 350 | 225 | 8 |
| MK6 |
| 798,693 | 88.9 | 290 | 180 | 0 |
| OM10 |
| 5,766,164 | 83.8 | 334 | 222 | 6 |
| OM12 |
| 2,796,409 | 87.8 | 320 | 202 | 5 |
| SIL1 |
| 5,976,421 | 87.4 | 346 | 217 | 8 |
| U5 |
| 4,217,194 | 86.6 | 329 | 218 | 3 |
| B2 |
| 1,086,552 | 80.4 | 302 | 199 | 3 |
| B4 |
| 2,564,946 | 81.8 | 321 | 216 | 4 |
| L1 |
| 1,205,197 | 80.1 | 299 | 206 | 3 |
| L3 |
| 1,917,426 | 87.8 | 312 | 194 | 5 |
| LP2 |
| 4,587,459 | 88.8 | 326 | 210 | 4 |
| M4 |
| 4,929,180 | 78.7 | 332 | 229 | 3 |
| MG1 |
| 3,948,452 | 86.2 | 319 | 209 | 2 |
| BR1 |
| 524,245 | 86.0 | 252 | 174 | 1 |
| ME1 |
| 5,329,524 | 89.0 | 327 | 211 | 1 |
| MO2 |
| 2,126,241 | 87.3 | 310 | 202 | 3 |
| MO3 |
| 4,695,432 | 87.3 | 332 | 221 | 4 |
| NK1 |
| 4,651,542 | 82.6 | 331 | 227 | 0 |
| O1 |
| 1,265,648 | 76.0 | 303 | 209 | 1 |
| O4 |
| 1,035,677 | 76.4 | 298 | 214 | 1 |
| O5 |
| 1,025,963 | 79.6 | 294 | 205 | 0 |
| O7 |
| 1,003,379 | 83.2 | 296 | 201 | 2 |
| O8 |
| 1,040,441 | 77.9 | 297 | 202 | 4 |
| OH1 |
| 744,716 | 88.0 | 289 | 177 | 1 |
| OM1 |
| 891,562 | 83.4 | 301 | 203 | 1 |
| S2 |
| 3,837,706 | 85.1 | 327 | 221 | 3 |
| S7 |
| 1,141,947 | 81.4 | 298 | 211 | 0 |
| W6 |
| 4,327,313 | 86.5 | 334 | 216 | 7 |
Sequencing statistics for target enrichment of Memecylon and Tibouchina, averaged for each clade.
| Statistics |
|
|
|---|---|---|
| Mean percent on‐target reads (min–max) | 84.5 (42–95) | 84.3 (68–92) |
| Mean reads mapped | 2,492,585 | 1,973,410 |
| Mean total reads | 2,913,061 | 2,334,623 |
| Mean read depth | 702× | 554× |
| Mean locus count per species (min–max) | 218 (101–263) | 226 (206–247) |
| Mean template count per species (min–max) | 304 (105–442) | 377 (285–455) |
| Mean taxon count per locus (template) | 30 (29) | 24 (28) |
| Mean number of templates with sequences (min–max) | 305 (105–347) | 377 (79–411) |
| Mean number of templates with sequences at 50% of length (min–max) | 200 (38–267) | 259 (10–340) |
| Mean total exon length, bp | 401 | 439 |
| Mean total intron length, bp | 822 | 568 |
| Mean supercontig length, bp | 1209 | 1019 |
| Mean number of potential paralogs per species (min–max) | 5 (0–15) | 40 (7–102) |
Sequencing statistics for target enrichment of Memecylon and Tibouchina, grouped by genomic resource used to design probes.
| Sample clades | Genome source for probe design | ||||
|---|---|---|---|---|---|
|
|
| Angiosperms353 Rosid sequences |
|
| |
|
| |||||
| No. of taxa per template | 60/62 | 8/62 | 23/62 | 31/62 | 37/62 |
| Average (min–max) percent identity between templates and target sequences | 82.8 (10–100) | NA | 30.2 (10–100) | 49.8 (10–100) | 39.0 (10–81.3) |
| Average (min–max) exon length | 762 (60–4698) | NA | 231 (51–1287) | 341 (57–3837) | 582 (69–1752) |
| Average (min–max) intron length | 1545 (0–28,880) | NA | 387 (0–3626) | 697 (0–17,048) | 595 (0–2964) |
| Average (min–max) supercontig length | 2268 (95–19,522) | NA | 617 (92–4280) | 1029 (73–20,164) | 1154 (74–3348) |
| Percent reads on target | 67.8 (MM) / 10.6 (A353) | 0.56 | 6.06 | 44.8 (MM) / 7.24 (A353) | 60.3 (MM) / 0.38 (A353) |
|
| |||||
| No. of taxa per template | 26/40 | 17/40 | 13/40 | 37/40 | 33/40 |
| Average (min–max) percent identity between templates and target sequences | 50.8 (10–100) | 25.2 (10.2–94.9) | 39.7 (10–97.8) | 81.0 (10–100) | 38.0 (10–97.5) |
| Average (min–max) exon length | 498 (57–3894) | 206 (69–795) | 407 (57–3540) | 614 (54–5673) | 522 (57–2043) |
| Average (min–max) intron length | 475 (0–17,558) | 399 (1–3609) | 530 (0–10,108) | 923 (0–19,505) | 337 (0–6400) |
| Average (min–max) supercontig length | 978 (69–22,850) | 655 (88–4368) | 931 (67–13,649) | 1563 (63–21,330) | 860 (83–7660) |
| Percent reads on target | 28.5 (MM) / 3.2 (A353) | 0.68 | 3.93 | 56.8 (MM) / 12.4 (A353) | 33.6 (MM) / 0.96 (A353) |
A353 = Angiosperms353; MM = MarkerMiner.
Figure 3Heatmap showing sequencing success of target enrichment for Memecylon and Tibouchina (Melastomataceae). Loci are on the x axis, grouped first by the method of development and then by genomic source of template sequence. Taxa are on the y axis. (A) Colors represent length (bp) of assembled sequences. (B) Shading represents the proportion of sequence length recovered relative to the template sequence. A = Angiosperms353 loci, F = functional loci, Me = Memecylon, MM = MarkerMiner loci, R = rosids, S = published SCN loci (Reginato and Michelangeli, 2016b), Ti = Tibouchina, Tr = transcriptomes.
Figure 4Percent identity of templates to recovered sequences are on the x axis with (A) mean total exon length shown on the y axis and (B) percent of taxa recovered for each template sequence shown on the y axis. Templates are grouped by the genomic source used for probe design for target enrichment of Memecylon (purple circles) and Tibouchina (blue squares).
Variation statistics for alignments from the recovered sequences of Memecylon in interspecific and intraspecific phylogenetic analyses.
| Statistics | Interspecific relationships | Intraspecific relationships | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Exons | Introns | Supercontigs | Exons | Introns | Supercontigs | |||||||||||||
| All | A353 | MM | All | A353 | MM | All | A353 | MM | All | A353 | MM | All | A353 | MM | All | A353 | MM | |
| No. of genes | 103 | 10 | 38 | 103 | 10 | 38 | 103 | 10 | 38 | 87 | 23 | 52 | 87 | 23 | 52 | 87 | 23 | 52 |
| Alignment length | 40,306 | 3381 | 25,922 | 26,109 | 4648 | 8998 | 79,170 | 8510 | 33,673 | 50,122 | 6483 | 40,735 | 42,087 | 14,253 | 23,377 | 123,983 | 26,447 | 66,082 |
| Parsimony‐informative sites (%) | 3819 (9.5) | 278 (8.2) | 2537 (9.8) | 6812 (26.1) | 1133 (24.4) | 2819 (31.3) | 11,182 (14.1) | 1534 (18.0) | 4536 (13.5) | 777 (1.6) | 86 (1.3) | 651 (1.6) | 2221 (5.3) | 557 (3.9) | 1540 (6.6) | 3385 (2.7) | 740 (2.8) | 1797 (2.7) |
| Constant sites | 30,471 | 2607 | 19,292 | 10,616 | 1894 | 3146 | 51,421 | 4492 | 21,175 | 47,826 | 6278 | 38,734 | 33,201 | 11,783 | 17,720 | 110,819 | 23,132 | 58,890 |
| Missing data, % | 4.2 | 2.1 | 0.7 | 5.6 | 3 | 2.4 | 3 | 0.2 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1.9 | 0.9 | 0.6 |
All = all selected loci for phylogenetic analysis; A353 = Angiosperms353 loci; MM = MarkerMiner loci.