| Literature DB >> 27054874 |
Satshil B Rana1, Frank J Zadlock1, Ziping Zhang2, Wyatt R Murphy3, Carolyn S Bentivegna1.
Abstract
BACKGROUND: De novo assembly of non-model organism's transcriptomes has recently been on the rise in concert with the number of de novo transcriptome assembly software programs. There is a knowledge gap as to what assembler software or k-mer strategy is best for construction of an optimal de novo assembly. Additionally, there is a lack of consensus on which evaluation metrics should be used to assess the quality of de novo transcriptome assemblies. RESULT: Six different assembly strategies were evaluated from four different assemblers. The Trinity assembly was used in its default 25 single k-mer value while Bridger, Oases, and SOAPdenovo-Trans were performed with multiple k-mer strategies. Bridger, Oases, and SOAPdenovo-Trans used a small multiple k-mer (SMK) strategy consisting of the k-mer lengths of 21, 25, 27, 29, 31, and 33. Additionally, Oases and SOAPdenovo-Trans were performed using a large multiple k-mer (LMK) strategy consisting of k-mer lengths of 25, 35, 45, 55, 65, 75, and 85. Eleven metrics were used to evaluate each assembly strategy including three genome related evaluation metrics (contig number, N50 length, Contigs >1 kb, reads) and eight transcriptome evaluation metrics (mapped back to transcripts (RMBT), number of full length transcripts, number of open reading frames, Detonate RSEM-EVAL score, and percent alignment to the southern platyfish, Amazon molly, BUSCO and CEGMA databases). The assembly strategy that performed the best, that is it was within the top three of each evaluation metric, was the Bridger assembly (10 of 11) followed by the Oases SMK assembly (8 of 11), the Oases LMK assembly (6 of 11), the Trinity assembly (4 of 11), the SOAP LMK assembly (4 of 11), and the SOAP SMK assembly (3 of 11).Entities:
Mesh:
Year: 2016 PMID: 27054874 PMCID: PMC4824410 DOI: 10.1371/journal.pone.0153104
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Statistics of the raw reads after Illumina sequencing and processing.
| Killifish | |
|---|---|
| Number of nucleotide bases | 1,033,683,143 |
| Number of raw reads | 7,197,900 |
| Number of clean reads for assembly | 6,349,606 |
| Percent of used reads for assembly | 88.21% |
Statistics of the Assemblies.
| SOAP LMK | SOAP SMK | Oases LMK | Oases SMK | Bridger | Trinity | |
|---|---|---|---|---|---|---|
| Contig Number | 198,085 | 187,104 | 99,567 | 135,312 | 303,906 | 180,658 |
| N50 Length | 917 | 1,042 | 1,676 | 1,743 | 1,668 | 1,189 |
| Minimum Contig Length | 200 | 200 | 200 | 200 | 201 | 224 |
| Largest Contig Length | 11,658 | 11,658 | 16,019 | 21,489 | 15,151 | 11,023 |
| Average contig length | 622 | 672 | 1,021 | 1,035 | 879 | 711 |
| Contigs Over 1k | 33,326 | 36,136 | 33,847 | 45,992 | 81,769 | 35,527 |
| RMBT | 83.78% | 80.77% | 85.28% | 84.18% | 87.71% | 82.96% |
Fig 1Phylogenetic tree analysis.
A phylogenetic tree analysis of the 11 publically available fish genomes and killifish testis. As highlighted in red, the results of Ensembl showed that southern platyfish (X. maculatus) and Amazon molly (P. Formosa) are the closest relatives to killifish (F. heteroclitus).
BLASTX alignments from the six different assemblies against the southern platyfish and Amazon molly databases.
| Database | SOAP LMK | SOAP SMK | Oases LMK | Oases SMK | Bridger | Trinity |
|---|---|---|---|---|---|---|
| southern platyfish | 34.60% | 35.22% | 50.14% | 47.00% | 37.88% | 32.68% |
| Amazon molly | 36.46% | 37.19% | 51.82% | 48.71% | 39.35% | 34.15% |
BLASTX alignments of the six different assemblies to the CEGMA dataset.
| Assembly | CEGS | CEGs Missing | Partials | Partial Missing |
|---|---|---|---|---|
| SOAP LMK | 98.39% | KOG0261, KOG0209, KOG0462, KOG2311 | 99.60% | KOG2311 |
| SOAP SMK | 98.39% | KOG0261, KOG0209, KOG0462, KOG2311 | 99.60% | KOG2311 |
| Oases LMK | 98.79% | KOG0292, KOG2311, KOG4392 | 100% | |
| Oases SMK | 98.79% | KOG0292, KOG2311, KOG4392 | 100% | |
| Bridger | 98.39% | KOG0261, KOG0292, KOG0969, KOG2311 | 99.60% | KOG0969 |
| Trinity | 97.58% | KOG0292, KOG0209, KOG0434, KOG0469, KOG0481, KOG2623 | 98.79% | KOG0209, KOG0292, KOG0434 |
Fig 2BUSCO Analysis.
The Trinity assembly performed the best by having the least amount of missing BUSCOS.
Fig 3Full Length Transcript Analysis.
a) Trinity had the most proteins with greater than 70% alignment coverage. b) Trinity had the most proteins with greater than 90% alignment coverage.
Fig 4Open Reading Frames Analysis.
The Bridger assembly produced the most amount of open reading frames for sequences with >799 bps (red), >999 bps (blue), and >1,199 bps (green).
The Detonate’s RSEM-EVAL scores suggests that the Trinity assembly performed the best.
The higher the number value, the better the assembly.
| Assembly | SOAP LMK | SOAP SMK | Oases LMK | Oases SMK | Bridger | Trinity |
|---|---|---|---|---|---|---|
| Score | -5,488.0 x10 ˄6 | -5,715.0 x10˄6 | -5,602.0 x10 ˄6 | -6,125.0 x10 ˄6 | -5,448.0 x10 ˄6 | -5,426.0 x10 ˄6 |
Summary of the top three performers for each evaluation metric category.
| First | Second | Third | |
|---|---|---|---|
| Contig Number | Bridger | SOAP LMK | SOAP SMK |
| N50 Length | Oases SMK | Oases LMK | Bridger |
| Contigs >1kb | Bridger | Oases SMK | SOAP SMK |
| RMBT | Bridger | Oases LMK | Oases SMK |
| southern platyfish DB | Oases LMK | Oases SMK | Bridger |
| Amazon molly DB | Oases LMK | Oases SMK | Bridger |
| CEGMA | Oases LMK and Oases SMK | Bridger, SOAP LMK, and SOAP SMK | |
| BUSCO | Trinity | Bridger | Oases SMK |
| Full Length Transcripts | Trinity | Oases SMK | Oases LMK |
| ORFs | Bridger | Trinity | Oases SMK |
| Detonate | Trinity | Bridger | SOAP LMK |