| Literature DB >> 23393031 |
Kaitlin Clarke1, Yi Yang, Ronald Marsh, Linglin Xie, Ke K Zhang.
Abstract
The fast development of next-generation sequencing technology presents a major computational challenge for data processing and analysis. A fast algorithm, de Bruijn graph has been successfully used for genome DNA de novo assembly; nevertheless, its performance for transcriptome assembly is unclear. In this study, we used both simulated and real RNA-Seq data, from either artificial RNA templates or human transcripts, to evaluate five de novo assemblers, ABySS, Mira, Trinity, Velvet and Oases. Of these assemblers, ABySS, Trinity, Velvet and Oases are all based on de Bruijn graph, and Mira uses an overlap graph algorithm. Various numbers of RNA short reads were selected from the External RNA Control Consortium (ERCC) data and human chromosome 22. A number of statistics were then calculated for the resulting contigs from each assembler. Each experiment was repeated multiple times to obtain the mean statistics and standard error estimate. Trinity had relative good performance for both ERCC and human data, but it may not consistently generate full length transcripts. ABySS was the fastest method but its assembly quality was low. Mira gave a good rate for mapping its contigs onto human chromosome 22, but its computational speed is not satisfactory. Our results suggest that transcript assembly remains a challenge problem for bioinformatics society. Therefore, a novel assembler is in need for assembling transcriptome data generated by next generation sequencing technique.Entities:
Mesh:
Year: 2013 PMID: 23393031 PMCID: PMC5778448 DOI: 10.1007/s11427-013-4444-x
Source DB: PubMed Journal: Sci China Life Sci ISSN: 1674-7305 Impact factor: 6.038
Figure 1The numbers of transcript pairs that have shared k-mers. The total number of transcript pairs is 10000. The errors bar show the standard errors.
Figure 2The distribution of transcript lengths. The histogram shows the lengths of 337 transcripts that are encoded in human chromosome 22 between position 35000000 and 40000000.
Figure 3Comparison of assembly statistics for simulated transcriptome data. The x axis indicates the number of short reads generated for each transcript. A, The number of contigs from each assembler. B, N50 statistics. C, The percentage of short reads that can be mapped to the contigs. D, The percentage of contigs that can be mapped to the original 337 transcripts.
Figure 4Comparison of assembly statistics for ERCC data. The x axis indicates the total number of selected short reads for assembly. A, The number of contigs from each assembler. B, N50 statistics. C, The percentage of short reads that can be mapped to the contigs. D, The percentage of contigs that can be mapped to the 10 RNA templates.