| Literature DB >> 25009556 |
Joanna Moreton1, Stephen P Dunham2, Richard D Emes1.
Abstract
For vertebrate organisms where a reference genome is not available, de novo transcriptome assembly enables a cost effective insight into the identification of tissue specific or differentially expressed genes and variation of the coding part of the genome. However, since there are a number of different tools and parameters that can be used to reconstruct transcripts, it is difficult to determine an optimal method. Here we suggest a pipeline based on (1) assessing the performance of three different assembly tools (2) using both single and multiple k-mer (MK) approaches (3) examining the influence of the number of reads used in the assembly (4) merging assemblies from different tools. We use an example dataset from the vertebrate Anas platyrhynchos domestica (Pekin duck). We find that taking a subset of data enables a robust assembly to be produced by multiple methods without the need for very high memory capacity. The use of reads mapped back to transcripts (RMBT) and CEGMA (Core Eukaryotic Genes Mapping Approach) provides useful metrics to determine the completeness of assembly obtained. For this dataset the use of MK in the assembly generated a more complete assembly as measured by greater number of RMBT and CEGMA score. Merged single k-mer assemblies are generally smaller but consist of longer transcripts, suggesting an assembly consisting of fewer fragmented transcripts. We suggest that the use of a subset of reads during assembly allows the relatively rapid investigation of assembly characteristics and can guide the user to the most appropriate transcriptome for particular downstream use. Transcriptomes generated by the compared assembly methods and the final merged assembly are freely available for download at http://dx.doi.org/10.6084/m9.figshare.1032613.Entities:
Keywords: Illumina; RNA-seq; assembly; de novo transcriptome; high-throughput sequencing
Year: 2014 PMID: 25009556 PMCID: PMC4070175 DOI: 10.3389/fgene.2014.00190
Source DB: PubMed Journal: Front Genet ISSN: 1664-8021 Impact factor: 4.599
The effect of trimming, duplicate read removal, and filtering on the number of reads.
| Sequences in pairs | 411,488,930 | 403,760,298 | 287,704,384 | 274,607,074 |
| Orphans | 0 | 3,592,413 | 2,944,998 | 2,393,609 |
| Sum | 290,649,382 | 277,000,683 | ||
The number of reads is shown at each stage (bp, base pair).
Assembly statistics for all reads and sub-sample of reads.
| Oases SK | 39 | 153,729 | 255 | 1662 | 3659 | 92 | 64,243 | n/d |
| Oases MK | 1,208,328 | 2601 | 2153 | 3487 | 96 | 763,694 | n/d | |
| CLC SK | 25 | 220,829 | 145 | 657 | 877 | 86 | 32,545 | n/d |
| CLC MK | 201,432 | 210 | 1042 | 1707 | 95 | 62,038 | n/d | |
| Oases SK | 23 | 78,640 | 125 | 1588 | 3144 | 90 | 34,850 | 87.5/98.8 |
| Oases MK | 507,954 | 1014 | 1996 | 3068 | 94 | 319,913 | 92.7/99.6 | |
| CLC SK | 25 | 97,375 | 76 | 781 | 1346 | 87 | 16,121 | 79.8/94.0 |
| CLC MK | 144,789 | 190 | 1315 | 2635 | 95 | 51,516 | 94.8/99.2 | |
| ABySS SK | 35 | 53,368 | 47 | 878 | 1439 | 59 | 12,936 | 33.1/65.3 |
| ABySS MK | 89,457 | 108 | 1204 | 2158 | 87 | 32,797 | 83.5/96.8 | |
Abbreviations: bp, base pair; Mbp, megabase pair; kbp, kilobase pair; MK, multiple k-mer; SK, single k-mer; RMBT, reads mapped back to transcripts. Only non-redundant contigs >200 bp were assessed. CEGMA, percentage of complete and partial conserved genes identified using the CEGMA tool. .
Figure 1N50 values and assembly length in base pairs (bp) for every ABySS .
Merged assembly statistics.
| 3 SK | 48,302 | 25,573 | 63 | 2463 | 4006 | 77 | 17,101 | 79.8/88.7 |
| 6 SK | 40,805 | 24,834 | 67 | 2689 | 4155 | 84 | 18,104 | 81.9/89.1 |
Abbreviations: bp, base pair; Mbp, megabase pair; kbp, kilobase pair; SK, single k-mer; RMBT, reads mapped back to transcripts. Only non-redundant contigs >200 bp were assessed. CEGMA, percentage of complete and partial conserved genes identified using the CEGMA tool. The assemblies were generated from the sub-sample of 30 M reads. Three SK assemblies: Oases k = 23, CLC k = 25, and ABySS k = 35. Six SK assemblies: Oases k = 23 and k = 79, CLC k = 25 and k = 61, and ABySS k = 35 and k = 79. Three SK robust contigs = contigs that contained contigs from all original assemblies. Six SK robust contigs = contigs assembled by three different tools but not necessarily from all six assemblies.