| Literature DB >> 31523128 |
Hao Yuan1,2,3, Calder Atta4, Luke Tornabene4, Chenhong Li1,2,3.
Abstract
Exon capture across species has been one of the most broadly applied approaches to acquire multi-locus data in phylogenomic studies of non-model organisms. Methods for assembling loci from short-read sequences (eg, Illumina platforms) that rely on mapping reads to a reference genome may not be suitable for studies comprising species across a wide phylogenetic spectrum; thus, de novo assembling methods are more generally applied. Current approaches for assembling targeted exons from short reads are not particularly optimized as they cannot (1) assemble loci with low read depth, (2) handle large files efficiently, and (3) reliably address issues with paralogs. Thus, we present Assexon: a streamlined pipeline that de novo assembles targeted exons and their flanking sequences from raw reads. We tested our method using reads from Lepisosteus osseus (4.37 Gb) and Boleophthalmus pectinirostris (2.43 Gb), which are captured using baits that were designed based on genome sequence of Lepisosteus oculatus and Oreochromis niloticus, respectively. We compared performance of Assexon to PHYLUCE and HybPiper, which are commonly used pipelines to assemble ultra-conserved element (UCE) and Hyb-seq data. A custom exon capture analysis pipeline (CP) developed by Yuan et al was compared as well. Assexon accurately assembled more than 3400 to 3800 (20%-28%) loci than PHYLUCE and more than 1900 to 2300 (8%-14%) loci than HybPiper across different levels of phylogenetic divergence. Assexon ran at least twice as fast as PHYLUCE and HybPiper. Number of loci assembled using CP was comparable with Assexon in both tests, while Assexon ran at least 7 times faster than CP. In addition, some steps of CP require the user's interaction and are not fully automated, and this user time was not counted in our calculation. Both Assexon and CP retrieved no paralogs in the testing runs, but PHYLUCE and Hybpiper did. In conclusion, Assexon is a tool for accurate and efficient assembling of large read sets from exon capture experiments. Furthermore, Assexon includes scripts to filter poorly aligned coding regions and flanking regions, calculate summary statistics of loci, and select loci with reliable phylogenetic signal. Assexon is available at https://github.com/yhadevol/Assexon.Entities:
Keywords: Exon capture; data filtering; de novo assembly; hybrid enrichment; phylogenomics; read assembly
Year: 2019 PMID: 31523128 PMCID: PMC6732846 DOI: 10.1177/1176934319874792
Source DB: PubMed Journal: Evol Bioinform Online ISSN: 1176-9343 Impact factor: 1.625
Figure 1.Outline of assembling procedure. (1) PCR duplications are removed from trimmed reads using rmdup.pl. (2) De-duplicated reads are parsed to homologous loci using ubxandp.pl. (3) Parsed reads are separately assembled into contigs using sga_assemble.pl. (4) Contigs are elongated and then exons are extracted from contigs with best hit to references using exonerate_best.pl and merge.pl. (5) Potential paralogs are removed using reblast.pl. (6) Resulting assemblies are aligned using mafft_aln.pl. (7) Alignments can be filtered using filter.pl, flank_filter.pl, monophyly_test.pl, or clocklikeness_test.pl.
Figure 2.Diagram of ambiguous overlap between 2 contigs. Green and black lines represent reference and overlapped contigs. Blue dashed lines represent unaligned bases. Substitutions occur at the bases neighboring to the crossed region, so they cannot be aligned to reference. Ambiguous overlap consisted of crossed region, unaligned bases at the end of contig 1, and unaligned bases at the start of contig 2 in this diagram.
Summary statistics of cross-species and cross-order capture data.
| Species of sample | Trimmed reads (bp)[ | Species of reference targets (bp) | Number of target loci[ | Divergence time (Myr)[ | The closest species with genome available | |
|---|---|---|---|---|---|---|
| Test 1 |
| 1 659 770 753 |
| 13 843 | 3.2 |
|
| Test 2 |
| 915 882 266 |
| 17 688 | 128 |
|
The total base pairs of the reads after removing low-quality bases and adaptor sequences.
Number of target loci in reference species.
Divergence time between the target species and reference.
Number of recovered, accurately assembled, perfectly assembled loci, and paralogs produced using 4 pipelines in 2 tests.
| Pipelines | Recovered loci (%) | Accurately assembled loci (%) | Perfectly assembled loci (%) | Paralogs | |
|---|---|---|---|---|---|
| Test 1 | Assexon | 12 064 (87.2) | 9489 (68.6) | 8684 (62.7) | 0 |
| CP | 11 823 (85.4) | 9634 (69.6) | 9183 (66.3) | 0 | |
| PHYLUCE | 6900 (49.8) | 5638 (40.7) | 5369 (38.8) | 0 | |
| HybPiper | 9334 (67.4) | 7561 (54.6) | 7445 (53.8) | 3 | |
| Test 2 | Assexon | 6830 (38.6) | 5783 (32.7) | 4913 (27.8) | 0 |
| CP | 6891 (39.0) | 5770 (32.6) | 5288 (29.9) | 0 | |
| PHYLUCE | 2382 (13.5) | 2304 (13.0) | 2176 (12.3) | 1 | |
| HybPiper | 4205 (30.4) | 3486 (25.2) | 3405 (24.6) | 2 |
Figure 3.Number of perfectly assembled loci at different length categories of 4 pipelines in test 1 (left) and test 2 (right).
Peak RAM usage (Gb) of various steps of each pipeline in 2 tests.
| Step | Assexon | CP | PHYLUCE | HybPiper | |
|---|---|---|---|---|---|
| Test 1 | Remove PCR duplication | 1.2 | 6.5 | NA | NA |
| Parse reads to homologous loci | 1.87 | 2.3 | NA | 1.0 | |
| De novo assembly | 0.2 | 1.2 | 5.8 | 1.1 | |
| Extract exons | 0.1 | 8.5 | 0.3 | 1.0 | |
| Remove potential paralogs | 2.5 | 0.2 | NA | NA | |
| Test 2 | Remove PCR duplication | 1.1 | 5.0 | NA | NA |
| Parse reads to homologous loci | 1.5 | 1.8 | NA | 0.4 | |
| De novo assembly | 0.2 | 1.2 | 4.3 | 1.2 | |
| Extract exons | 0.3 | 2.5 | 0.2 | 0.3 | |
| Remove potential paralogs | 2.5 | 0.2 | NA | NA |
Total CPU time (m) of various steps of each pipeline in 2 tests.
| Step | Assexon | CP | PHYLUCE | HybPiper | |
|---|---|---|---|---|---|
| Test 1 | Remove PCR duplication | 4 | 10 | NA | NA |
| Parse reads to homologous loci | 358 | 1435 | NA | 1142 | |
| De novo assembly | 279 | 1831 | 2515 | 775 | |
| Extract exons | 6 | 447 | 2 | 45 | |
| Remove potential paralogs | 130 | 2320 | NA | NA | |
| Total time | 777 | 6043 | 2517 | 1962 | |
| Test 2 | Remove PCR duplication | 2 | 5 | NA | NA |
| Parse reads to homologous loci | 258 | 1441 | NA | 646 | |
| De novo assembly | 230 | 1598 | 1409 | 341 | |
| Extract exons | 4 | 178 | 2 | 20 | |
| Remove potential paralogs | 92 | 2800 | NA | NA | |
| Total time | 586 | 6022 | 1411 | 1007 |
Figure 4.Length distribution of flanking sequences generated by 4 pipelines in test 1 (left) and test 2 (right).