| Literature DB >> 29515871 |
Yu-Long Li1,2, Dong-Xiu Xue1,2, Bai-Dong Zhang1,2, Jin-Xian Liu1,2.
Abstract
Restriction site-associated DNA (RAD) sequencing is revolutionizing studies in ecological, evolutionary and conservation genomics. However, the assembly of paired-end RAD reads with random-sheared ends is still challenging, especially for non-model species with high genetic variance. Here, we present an efficient optimized approach with a pipeline software, RADassembler, which makes full use of paired-end RAD reads with random-sheared ends from multiple individuals to assemble RAD contigs. RADassembler integrates the algorithms for choosing the optimal number of mismatches within and across individuals at the clustering stage, and then uses a two-step assembly approach at the assembly stage. RADassembler also uses data reduction and parallelization strategies to promote efficiency. Compared to other tools, both the assembly results based on simulation and real RAD datasets demonstrated that RADassembler could always assemble the appropriate number of contigs with high qualities, and more read pairs were properly mapped to the assembled contigs. This approach provides an optimal tool for dealing with the complexity in the assembly of paired-end RAD reads with random-sheared ends for non-model species in ecological, evolutionary and conservation studies. RADassembler is available at https://github.com/lyl8086/RADscripts.Entities:
Keywords: RAD-seq; optimal clustering; optimized assembly; overlapping paired-end sequencing; pipeline software
Year: 2018 PMID: 29515871 PMCID: PMC5830760 DOI: 10.1098/rsos.171589
Source DB: PubMed Journal: R Soc Open Sci ISSN: 2054-5703 Impact factor: 2.963
Figure 1.Flow chart for the two-step assembly approach on RPE reads. (i) The first reads (the forward reads with enzyme cut sites) were clustered. (ii) The second reads (the reverse reads with random-sheared ends) were sorted into separated files accordingly (each locus represented by different colours contained reads from multiple individuals). Reads were assembled by a two-step assembly strategy: (iii) first step, the second reads were locally assembled into contigs and merged with the corresponded consensus sequences of the first reads; (iv) second step, the merged files were locally assembled again into the final RAD contigs. If the contigs of the second reads do not overlap with the consensus sequences, ten ‘N’ will be padded (locus in blue).
Figure 2.The selection of the optimal number of mismatches within (a) and across (b) individuals on simulation datasets. Reads from each individual were grouped into loci by ustacks, and loci were grouped together across individuals by cstacks to build the catalogue. The optimal number of mismatches within individual (ustacks) was chosen to maximize the number of loci (Y-axis on the left) with two alleles and simultaneously minimize the number of loci with one allele. In this case, six mismatches should be an appropriate value for ustacks. For cstacks, the optimal number of mismatches across individuals was chosen at the point of inflection, such that the number of incremental loci (Y-axis on the right) for each merging individual (X-axis on the right) using different mismatch thresholds (represents by different line types) changed little. In this case, four mismatches should be an appropriate value for cstacks.
Figure 3.The selection of the optimal number of mismatches within (a) and across (b) individuals on real datasets (L. polyactis). The optimal number of mismatches within individual should be 3 (a), and optimal number of mismatches across individuals should be 3 (b), although liberal values might be more appropriate.
Assembly statistics of the four tools on simulation datasets. Comparison statistics including (from left to right): number of clusters (loci) assembled, number of clusters that mapped to the reference genome (Identical Clusters), N50 (bp), mean contig length (Mean, bp), total coverage (Total Cov, bp), identical bases to the reference genome (Identical Cov, bp), identical bases to the reference genome in proportion of the total coverage (Cov Ratio), mean identity of those mapped to the reference genome (Mean Identity), total mapping rate of the read pairs (Total Mapped), proper mapping rate of the read pairs (Proper Paired).
| simulation datasets | no. clusters | Identical Clusters | N50 | Mean | Total Cov | Identical Cov | Cov Ratio (%) | Mean Identity (%) | Total Mapped (%) | Proper Paired (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| RADassembler | 29 533 | 29 521 | 698 | 661 | 19 633 933 | 19 401 812 | 98.82 | 98.78 | 99.94 | 98.60 |
| Stacks | 8717 | 8491 | 617 | 480 | 4 889 191 | 4 774 282 | 97.65 | 98.25 | 37.00 | 11.12 |
| Rainbow | 154 410 | 154 375 | 696 | 695 | 107 402 013 | 102 362 273 | 95.31 | 97.72 | 99.97 | 95.05 |
| dDocent | 20 248 | 20 222 | 262 | 261 | 5 297 706 | 4 884 498 | 92.20 | 98.67 | 77.66 | 36.62 |
Figure 4.Length distribution of contigs assembled by the four tools on simulation datasets. Program versions: Stacks 1.48, Rainbow 2.04, dDocent 2.2.20.
Assembly statistics of the four tools on real datasets of L. polyactis. The parameters of comparisons were the same as those used in the simulation datasets.
| real datasets | no. clusters | Identical Clusters | N50 | Mean | Total Cov | Identical Cov | Cov Ratio (%) | Mean Identity (%) | Total Mapped (%) | Proper Paired (%) |
|---|---|---|---|---|---|---|---|---|---|---|
| RADassembler | 303 929 | 298 866 | 539 | 511 | 157 941 578 | 144 559 665 | 91.53 | 95.85 | 98.98 | 95.99 |
| Stacks | 460 525 | 451 478 | 344 | 282 | 181 151 234 | 164 990 228 | 91.08 | 96.27 | 87.89 | 49.16 |
| Rainbow | 330 584 | 325 298 | 579 | 550 | 182 080 648 | 160 647 819 | 88.23 | 95.47 | 92.34 | 85.47 |
| dDocent | 183 763 | 179 928 | 262 | 271 | 49 820 676 | 43 255 514 | 86.82 | 96.70 | 83.44 | 61.62 |
Figure 5.Length distribution of contigs assembled by the four tools on real datasets (L. polyactis). Program versions: Stacks 1.48, Rainbow 2.04, dDocent 2.2.20.