| Literature DB >> 25082000 |
Abstract
BACKGROUND: The recent advance of high-throughput sequencing makes it feasible to study entire transcriptomes through the application of de novo sequence assembly algorithms. While a popular strategy is to first construct an intermediate de Bruijn graph structure to represent the transcriptome, an additional step is needed to construct predicted transcripts from the graph.Entities:
Mesh:
Year: 2014 PMID: 25082000 PMCID: PMC4120145 DOI: 10.1186/1471-2164-15-S5-S6
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Figure 1Difference between traditional strategy and our strategy.
Figure 2Illustration of two successive sets of nodes that contain SNPs. The sequences within all the nodes in the second column and the sequences within all the nodes on the fourth column must be of the same length in order to contain SNPs. Note that there can be more than one SNP within each of these columns, and all these nodes will be merged into a single node. Other incoming edges that go into the starting node and other outgoing edges that go out of the final node are allowed.
Figure 3Example of the decomposition of a connected component. Each of the three edges on the left-hand side is a strongly connected component by itself, and the subgraph containing these three edges represents the simpler region. The cycle on the right-hand side is one single strongly connected component that represents the complicated region.
Figure 4Example of junction adjustment.
Statistics of the simulated transcriptome assemblies of Drosophila using its known complete genome over different values of k and k-mer coverage cutoff c with 0.1% mismatches in the reads.
|
| initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max nodes | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25_3 | 38884 | 17900 | 9937 | 15713 | 37380 | 2366 | 1361 | 3106 | 10 | 883 | 12731 | 10162 | 643 | 27 | 80,3 | 21,2 |
| 25_5 | 34822 | 16979 | 9255 | 15521 | 37380 | 2374 | 1351 | 266 | 7 | 517 | 12708 | 10160 | 643 | 27 | 80,3 | 21,2 |
| 25_10 | 34494 | 16712 | 9057 | 15486 | 37380 | 2373 | 1345 | 194 | 7 | 481 | 12699 | 10158 | 639 | 27 | 80,3 | 21,2 |
| 31_3 | 28342 | 5037 | 2080 | 13819 | 45158 | 2704 | 1719 | 1007 | 7 | 496 | 12523 | 11112 | 546 | 12 | 76,3 | 18,2 |
| 31_5 | 27307 | 4971 | 1898 | 13740 | 45158 | 2714 | 1717 | 167 | 6 | 381 | 12494 | 11110 | 552 | 13 | 76,3 | 18,2 |
| 31_10 | 27265 | 4947 | 1885 | 13829 | 45158 | 2704 | 1698 | 161 | 6 | 377 | 12536 | 11109 | 542 | 13 | 76,3 | 18,2 |
Initial nodes denotes the number of nodes that are in the initial assembly. Largest tangle denotes the number of nodes of the largest connected component. Largest SCC denotes the number of nodes of the largest strongly connected component. Splicing graphs denotes the number of splicing graphs. Max length denotes the length (in nucleotides) of the longest path over all splicing graphs. N50 denotes the N50 value of the length (in nucleotides) of the longest path in each graph. >1-node graphs denotes the number of graphs with more than one node. Max nodes denotes the maximum number of nodes in these non-linear graphs. Avg nodes denotes the average number of nodes in these non-linear graphs. SNPs denotes the number of SNPs recovered. Total hits denotes the total number of hits from translated BLAST search of each node to Drosophila (isoforms are considered the same gene, only the top hit with E-value below 10−7 is included for each node in a splicing graph, and hits from nodes within the same splicing graph to the same gene are counted once). Unique hits denotes the number of unique hits to different genes. >1-hit graphs denotes the number of splicing graphs that have BLAST hits to more than one gene. Max hits denotes the maximum number of different genes that have BLAST hits to a splicing graph. Time (mins) denotes the computational time in minutes, with the values to the left and to the right of "," indicating the running time of Velvet and our postprocessing algorithm respectively. Memory (GB) denotes the memory requirement in gigabytes, with the values to the left and to the right of "," indicating the memory requirement of Velvet and our postprocessing algorithm respectively.
Statistics of the simulated transcriptome assemblies of Drosophila using its known complete genome over different values of k and k-mer coverage cutoff c with 0.2% mismatches in the reads. The notations are the same as in Table 1.
|
| initial nodes | largest tangle | largest SCC | splicing graphs | max length | N50 | >1-node graphs | max nodes | avg nodes | SNPs | total hits | unique hits | >1-hit graphs | max hits | time (mins) | memory (GB) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 25_3 | 45305 | 23504 | 15883 | 13240 | 26909 | 2255 | 634 | 8671 | 27 | 2049 | 8258 | 6188 | 315 | 16 | 94,3 | 30,2 |
| 25_5 | 29090 | 16349 | 11411 | 11734 | 27251 | 2321 | 606 | 1832 | 11 | 337 | 8156 | 6180 | 321 | 12 | 94,3 | 30,2 |
| 25_10 | 26297 | 15235 | 10367 | 11606 | 27251 | 2329 | 595 | 165 | 8 | 257 | 8116 | 6176 | 319 | 13 | 94,3 | 30,2 |
| 31_3 | 23544 | 5604 | 2331 | 11993 | 44990 | 2536 | 583 | 1520 | 12 | 611 | 9561 | 8488 | 281 | 17 | 83,3 | 21,2 |
| 31_5 | 19869 | 4299 | 2097 | 11650 | 44990 | 2545 | 571 | 253 | 7 | 248 | 9548 | 8488 | 281 | 13 | 83,3 | 21,2 |
| 31_10 | 19541 | 4222 | 2056 | 11642 | 44990 | 2545 | 572 | 96 | 7 | 233 | 9544 | 8484 | 281 | 13 | 83,3 | 21,2 |
Figure 5Comparisons of the protein sequence BLAST results in the simulated transcriptome assemblies of . Sensitivity is defined to be the percentage of coding positions in the genome that are recovered in the assembly considering only Drosophila gene transcripts that are found in BLAST hits (each position that is within some coding region is counted once). Specificity is defined to be the percentage of predicted transcript positions in the assembly that are included in BLAST alignments considering only predictions that have BLAST hits.
Figure 6Comparisons of the alternative splicing junction results in the simulated transcriptome assemblies of . Sensitivity is defined to be the percentage of junctions in the gene transcripts of Drosophila that appear somewhere in the assembly. Specificity is defined to be the percentage of junctions in the assembly that appear somewhere in the gene transcripts of Drosophila. Junctions in the gene transcripts of Drosophila are defined by concatenating the two sequences of length k that are immediately to the left and immediately to the right of all alternative splicing locations to obtain a sequence of length 2k. Junctions in the assembly are defined by concatenating the two k-mers at the beginning and ending nodes of an edge to obtain a sequence of length 2k after the elimination of overlapping sequence fragments between adjacent nodes. Up to three mismatches are allowed when looking for these sequence occurrences.
Comparisons of the Drosophila transcriptome assemblies of our postprocessing algorithm, Oases and Trans-ABySS using six publicly available libraries over different values of k-mer coverage cutoff c.
| postprocess | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 35_3 | 227614 | 178545 | 88094 | 75367 | 10539 | 544 | 2048 | 124 | 6 | 16703 | 38448 | 10719 | 392 | 5 | 86,18 | 22,2 |
| 35_5 | 125414 | 87895 | 41654 | 47958 | 8678 | 705 | 1720 | 93 | 6 | 11334 | 27010 | 9889 | 429 | 13 | 86,17 | 22,2 |
| 35_10 | 57978 | 31785 | 12695 | 27695 | 6383 | 705 | 1020 | 63 | 6 | 5034 | 17271 | 8070 | 308 | 5 | 86,16 | 22,2 |
| 35_3 | 39584 | 15586 | 801 | 3824 | 13 | 3 | 29928 | 10898 | 256 | 4 | 94,28 | 29,32 | ||||
| 35_5 | 28537 | 15586 | 936 | 2616 | 16 | 3 | 22460 | 10103 | 245 | 4 | 94,26 | 29,30 | ||||
| 35_10 | 17075 | 11104 | 982 | 1377 | 14 | 3 | 13800 | 8201 | 185 | 5 | 94,24 | 29,26 | ||||
| 35_3 | 91365 | 15586 | 898 | 50467 | 60 | 8 | 33600 | 10639 | 205,1 | 4,1 | ||||||
| 35_5 | 55164 | 10582 | 997 | 27763 | 46 | 7 | 25779 | 9944 | 195,1 | 4,1 | ||||||
| 35_10 | 28455 | 8865 | 929 | 13665 | 43 | 6 | 16032 | 8154 | 178,1 | 4,1 | ||||||
The k-mer length is fixed to 35 because Oases is only capable of assembling these libraries on machines with 32 GB physical memory when k is large. For our postprocessing algorithm, the notations are the same as in Table 1. For Oases, locus denotes the number of predicted locus, max length denotes the length of the longest predicted transcript, N50 denotes the N50 value of the longest transcript length in a predicted locus, >1-trans locus denotes the number of predicted locus with more than one transcript, max trans denotes the maximum number of transcripts in a predicted locus, avg trans denotes the average number of transcripts in predicted locus with more than one transcript, total hits denotes the total number of hits from translated BLAST search of each predicted transcript to Drosophila (isoforms are considered the same gene, only the top hit with E-value below 10−7 is considered for each transcript in a predicted locus, and hits from transcripts within the same predicted locus to the same gene are counted once), unique hits denotes the number of unique hits to different genes, >1-hit locus denotes the number of predicted locus that has BLAST hits to more than one gene, max hits denotes the maximum number of different genes that have BLAST hits to a predicted locus, time (mins) denotes the computational time in minutes, with the values to the left and to the right of "," indicating the running time of Velvet (without setting cov_cutoff) and Oases respectively, and memory (GB) denotes the memory requirement in gigabytes, with the values to the left and to the right of "," indicating the memory requirement of Velvet (without setting cov_cutoff) and Oases respectively. For Trans-ABySS, trans denotes the total number of predicted transcripts, max length denotes the length of the longest predicted transcript, N50 denotes the N50 value of the length of predicted transcripts, >1-node trans denotes the number of predicted transcripts that are the concatenation of more than one node, max nodes denotes the maximum number of nodes in a predicted transcript, avg nodes denotes the average number of nodes in predicted transcripts with more than one node, total hits denotes the total number of predicted transcripts that have BLAST hits, unique hits denotes the number of unique hits to different genes, time (mins) denotes the computational time in minutes, with the values to the left and to the right of "," indicating the running time of ABySS and Trans-ABySS respectively, and memory (GB) denotes the memory requirement in gigabytes, with the values to the left and to the right of "," indicating the memory requirement of ABySS and Trans-ABySS respectively.
Figure 7Comparisons of the protein sequence BLAST results in the . The notations are the same as in Figure 5.
Figure 8Comparisons of the alternative splicing junction results in the . The notations are the same as in Figure 6.
Comparisons of the Drosophila transcriptome assemblies of our postprocessing algorithm, Oases and Trans-ABySS using four publicly available libraries over different values of k and k-mer coverage cutoff c. The notations are the same as in Table 3.
| postprocess | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31_3 | 293034 | 251819 | 132958 | 87306 | 7571 | 542 | 1914 | 36 | 5 | 13216 | 37135 | 10752 | 516 | 7 | 81,24 | 20,2 |
| 31_5 | 153123 | 115511 | 60504 | 53199 | 9708 | 748 | 1881 | 98 | 5 | 8419 | 27103 | 9868 | 683 | 8 | 81,23 | 20,2 |
| 31_10 | 70809 | 36861 | 19839 | 35955 | 7393 | 621 | 1224 | 108 | 5 | 3746 | 22037 | 8399 | 442 | 8 | 81,21 | 20,2 |
| 35_3 | 175184 | 123605 | 85923 | 73584 | 7525 | 559 | 2246 | 79 | 6 | 10311 | 37115 | 10565 | 737 | 8 | 81,22 | 20,2 |
| 35_5 | 98897 | 58409 | 40689 | 47081 | 9382 | 731 | 1808 | 134 | 6 | 6741 | 26560 | 9631 | 743 | 12 | 81,21 | 20,2 |
| 35_10 | 48595 | 19438 | 13375 | 28269 | 7008 | 706 | 1062 | 90 | 5 | 2967 | 17829 | 7883 | 461 | 8 | 81,19 | 20,2 |
| 31_3 | 35587 | 15986 | 994 | 4881 | 18 | 3 | 26559 | 10819 | 410 | 5 | 87,24 | 25,27 | ||||
| 31_5 | 26679 | 15906 | 1109 | 3234 | 20 | 3 | 21084 | 9944 | 336 | 5 | 87,21 | 25,21 | ||||
| 31_10 | 21283 | 8174 | 877 | 1637 | 16 | 3 | 17225 | 8449 | 188 | 4 | 87,19 | 25,21 | ||||
| 35_3 | 37377 | 9826 | 846 | 3724 | 16 | 3 | 28492 | 10652 | 346 | 6 | 75,14 | 17,17 | ||||
| 35_5 | 27573 | 12562 | 979 | 2644 | 14 | 3 | 21992 | 9751 | 332 | 5 | 75,13 | 17,17 | ||||
| 35_10 | 18072 | 7934 | 939 | 1389 | 12 | 3 | 14614 | 7953 | 194 | 5 | 75,12 | 17,17 | ||||
| 31_3 | 113157 | 14353 | 1149 | 72266 | 56 | 6 | 33780 | 10527 | 201,1 | 4,1 | ||||||
| 31_5 | 62292 | 14395 | 1282 | 37656 | 72 | 6 | 24614 | 9810 | 193,1 | 4,1 | ||||||
| 31_10 | 32509 | 17057 | 1075 | 19837 | 50 | 5 | 16676 | 8313 | 172,1 | 4,1 | ||||||
| 35_3 | 76220 | 14351 | 1142 | 40606 | 79 | 6 | 31619 | 10288 | 179,1 | 4,1 | ||||||
| 35_5 | 46431 | 14385 | 1239 | 23632 | 38 | 5 | 23451 | 9603 | 172,1 | 4,1 | ||||||
| 35_10 | 24956 | 9139 | 1095 | 12968 | 30 | 5 | 15057 | 7956 | 154,1 | 4,1 | ||||||
Figure 9Comparisons of the protein sequence BLAST results in the . The notations are the same as in Figure 5.
Figure 10Comparisons of the alternative splicing junction results in the . The notations are the same as in Figure 6.