| Literature DB >> 25969752 |
Chien-Chih Chen1, Wen-Dar Lin2, Yu-Jung Chang3, Chuen-Liang Chen1, Jan-Ming Ho3.
Abstract
Background. The emergence of next-generation sequencing platform gives rise to a new generation of assembly algorithms. Compared with the Sanger sequencing data, the next-generation sequence data present shorter reads, higher coverage depth, and different error profiles. These features bring new challenging issues for de novo transcriptome assembly. Methodology. To explore the influence of these features on assembly algorithms, we studied the relationship between read overlap size, coverage depth, and error rate using simulated data. According to the relationship, we propose a de novo transcriptome assembly procedure, called Euler-mix, and demonstrate its performance on a real transcriptome dataset of mice. The simulation tool and evaluation tool are freely available as open source. Significance. Euler-mix is a straightforward pipeline; it focuses on dealing with the variation of coverage depth of short reads dataset. The experiment result showed that Euler-mix improves the performance of de novo transcriptome assembly.Entities:
Year: 2012 PMID: 25969752 PMCID: PMC4417554 DOI: 10.5402/2012/816402
Source DB: PubMed Journal: ISRN Bioinform ISSN: 2090-7338
Figure 1The relationship between optimum k and coverage depth.
Figure 3Overview of Euler-mix procedure.
Figure 2The relationship between optimum k's and coverage depth for one transcriptome sequence in different error rate.
Evaluation of assemblies of mouse simulated data with Velvet using different k-mer and compare with Euler-mix.
|
| 17 | 19 | 21 | 23 | 25 | 27 | 29 | 31 | 33 | 35 | Euler-Mix |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Precision(overlap) | 99.93% | 99.98% | 99.99% | 100.00% | 99.99% | 99.99% | 99.99% | 99.99% | 99.98% | 99.99% | 99.99% |
| Recall(overlap) | 9.97% | 74.26% | 80.27% | 79.39% | 77.44% | 74.22% | 69.53% | 60.53% | 42.08% | 7.95% | 85.32% |
|
| 18.14% | 85.22% | 89.05% | 88.51% | 87.29% | 85.20% | 82.02% | 75.41% | 59.24% | 14.73% | 92.08% |
|
| |||||||||||
| Precision(consistent) | 96.59% | 81.29% | 93.71% | 95.01% | 94.50% | 93.57% | 91.86% | 89.02% | 85.28% | 85.05% | 96.31% |
| Recall(consistent) | 8.13% | 50.28% | 60.73% | 60.55% | 58.51% | 55.28% | 50.29% | 42.22% | 27.58% | 4.25% | 64.71% |
|
| 15.00% | 62.13% | 73.70% | 73.96% | 72.27% | 69.50% | 65.00% | 57.27% | 41.68% | 8.09% | 77.41% |
|
| |||||||||||
| Number of contigs >= 100 bp | 48236 | 170563 | 80166 | 65348 | 62817 | 63982 | 68447 | 76573 | 75896 | 15378 | 48183 |
| Mean size (bp) | 125.28 | 261.08 | 582.66 | 701.67 | 708.77 | 663.93 | 575.11 | 445.32 | 306.37 | 233.43 | 1001.79 |
| Largest contig | 413 | 3272 | 11696 | 27350 | 45270 | 81929 | 81794 | 27866 | 14648 | 8312 | 81929 |
|
| |||||||||||
| N50 | 121 | 311 | 1112 | 1697 | 1949 | 2011 | 1798 | 1230 | 485 | 250 | 2783 |
Evaluation for assemblies of a real transcriptome dataset of mice with Velvet and compare with Euler-mix.
| Assembly pipeline | Number of contigs >= 100bp | Mean size | Largest contig | N50 | # of transcripts detected | Precision | Recall | F-measure | Precision | Recall | F-measure |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Velvet (75–21) Velvet (39) | 80644 | 336.32 | 6668 | 477 | 10646 | 96.44% | 39.97% | 56.52% | 89.82% | 22.30% | 35.73% |
| Velvet (33) | 106546 | 191.41 | 2375 | 202 | 2544 | 97.59% | 28.27% | 43.85% | 86.23% | 16.22% | 27.31% |
Figure 4Histogram of the coverage depths (expression levels) of the 26,332 transcripts of mice.
Figure 5Illustration of overlap measures and consistent measures.