| Literature DB >> 22891638 |
Jialei Duan1, Chuan Xia, Guangyao Zhao, Jizeng Jia, Xiuying Kong.
Abstract
BACKGROUND: Rapid advances in next-generation sequencing methods have provided new opportunities for transcriptome sequencing (RNA-Seq). The unprecedented sequencing depth provided by RNA-Seq makes it a powerful and cost-efficient method for transcriptome study, and it has been widely used in model organisms and non-model organisms to identify and quantify RNA. For non-model organisms lacking well-defined genomes, de novo assembly is typically required for downstream RNA-Seq analyses, including SNP discovery and identification of genes differentially expressed by phenotypes. Although RNA-Seq has been successfully used to sequence many non-model organisms, the results of de novo assembly from short reads can still be improved by using recent bioinformatic developments.Entities:
Mesh:
Substances:
Year: 2012 PMID: 22891638 PMCID: PMC3485621 DOI: 10.1186/1471-2164-13-392
Source DB: PubMed Journal: BMC Genomics ISSN: 1471-2164 Impact factor: 3.969
Statistics of trimmed reads
| AF | 33,067,246 (1.5%) | 2,426,965,166 (23.3%) | 73.4 | 39.15% (52.7%) |
| AS | 80,879,822 (4.6%) | 6,204,420,998 (18.7%) | 76.7 | 40.03% (50.8%) |
| SF | 32,678,050 (1.3%) | 2,410,831,266 (22.8%) | 73.8 | 38.95% (51.9%) |
| SS | 65,967,360 (4.5%) | 5,148,749,102 (17.1%) | 78.0 | 40.48% (50.3%) |
The numbers in the parentheses indicated the trimmed proportions of reads, bases and the GC content before trimming.
Figure 1N50 and total length of the assemblies produced by ABySS with different k-mers. Every odd number starting from 45 to 87 were used as k-mer sizes, and different assembly strategies were compared. For example, AF refers to the use of reads from the AF sample for assembly, whereas AFSFASSS refers to the use of reads from all four samples together. A) N50, B) total length. The total length of each assembly is shown on a logarithmic scale.
Statistics of initial assembly
| | | | | | | | |
| No. of contigs | 227,879 | 110,162 | 198,382 | 69,721 | 133,556 | 76,474 | 126,848 |
| Total Bases | 217,485,766 | 89,132,635 | 184,623,552 | 3,9925,904 | 115,757,220 | 55,932,501 | 100,283,444 |
| No. of contigs (> = 1 kbp) | 69,402 | 28,016 | 58,759 | 6,817 | 36,825 | 16,409 | 30,694 |
| Total Bases (in contigs > = 1 kbp) | 135,112,080 | 47,002,017 | 112,433,788 | 9,818,210 | 65,495,871 | 25,146,108 | 50,760,639 |
| Max contig length | 12,488 | 7,710 | 10,506 | 5,723 | 11,199 | 5,670 | 5,294 |
| Mean contig length | 954 | 809 | 930 | 572 | 866 | 731 | 790 |
| N50 | 1,370 | 1,060 | 1,323 | 600 | 1,165 | 903 | 1,015 |
| No. of contigs in N50 | 46,815 | 25,597 | 41,204 | 20,241 | 29,723 | 19,351 | 30,035 |
| | | | | | | | |
| No. of contigs | 571,835 | 258,087 | 478,056 | 157,039 | 334,220 | 158,856 | 279,549 |
| Total Bases | 35,941,1943 | 152,450,028 | 296,430,486 | 88,785,112 | 202,564,854 | 90,257,267 | 165,879,838 |
| No. of contigs (> = 1 kbp) | 66,393 | 22,844 | 52,783 | 11276 | 33,779 | 11,757 | 25,615 |
| Total Bases (in contigs > = 1 kbp) | 88,342,878 | 29,539,332 | 69,895,282 | 14,500,455 | 44,327,763 | 14,876,118 | 33,225,449 |
| Max contig length | 8,216 | 4,077 | 5,694 | 5,031 | 6,687 | 4,725 | 4,804 |
| Mean contig length | 628 | 590 | 620 | 565 | 606 | 568 | 593 |
| N50 | 686 | 634 | 674 | 597 | 653 | 604 | 637 |
| No. of contigs in N50 | 178,083 | 82,822 | 149,467 | 51,187 | 105,491 | 51,861 | 89,188 |
Contigs less than 300 bp were excluded.
Statistics of each assembly step
| | ||||||
|---|---|---|---|---|---|---|
| | | | | | | |
| No. of contigs | 227,879 | 571,835 | 297,319 | 636,921 | 379,837 | 720,131 |
| Total Bases | 217,485,766 | 359,411,943 | 267,708,300 | 398,478,666 | 298,801,419 | 444,024,673 |
| N50 | 1,370 | 686 | 1,243 | 683 | 1,000 | 670 |
| No. of contigs in N50 | 46,815 | 178,083 | 64,054 | 199,771 | 89,338 | 228,145 |
| | | | | | | |
| No. of contigs | 165,174 | 152,963 | 179,876 | 138,487 | 191,858 | 128,990 |
| Total Bases | 139,479,895 | 117,699,005 | 151,369,559 | 104,950,724 | 151,376,565 | 94,829,611 |
| N50 | 1,146 | 935 | 1,134 | 914 | 1,016 | 878 |
| No. of contigs in N50 | 34,500 | 39,253 | 38,937 | 35,969 | 44,201 | 33,877 |
| | | | | | | |
| No. of contigs | 162,090 | 152,694 | 176,983 | 138,452 | 188,653 | 129,464 |
| Total Bases | 139,459,722 | 120,117,056 | 151,354,485 | 107,614,336 | 151,363,021 | 97,866,545 |
| N50 | 1,177 | 960 | 1,161 | 942 | 1,041 | 907 |
| No. of contigs in N50 | 33,941 | 39,124 | 38,409 | 35,915 | 43,481 | 33,927 |
| Proportion of Ns | 0.12% | 0.13% | 0.12% | 0.13% | 0.12% | 0.13% |
| GC content | 47.93% | 48.69% | 48.18% | 48.76% | 48.51% | 48.81% |
a Trinity_AFSFASSS and Trans-ABySS_AFSFASSS had not been merged.
Detail statistics can be found in Additional file 2. Contigs less than 300 bp were excluded.
Figure 2Effects of removal of redundancy. The influence of removal of redundancy on the six assemblies, shown as a changed proportion.
Figure 3Cumulative scaffold lengths generated by different assembly programs and strategies. For the six assemblies, scaffolds shorter than 300 bp were filtered.
Figure 4Pairwise comparisons between the six assemblies. Assemblies were compared in a pair-wise fashion using BLAT, and the proportions covered are shown.
Statistics of reads mapped to assemblies
| AF aligned | 83.83% | 87.53% | 89.17% | 93.62% | 92.44% | 91.48% |
| aligned more than 3 hits | 4.18% | 8.12% | 9.63% | 2.93% | 3.89% | 4.91% |
| uniquely aligned | 51.85% | 48.67% | 48.48% | 71.74% | 67.46% | 63.89% |
| AS aligned | 82.21% | 86.50% | 88.27% | 92.82% | 91.85% | 90.73% |
| aligned more than 3 hits | 4.23% | 8.35% | 10.45% | 3.00% | 3.85% | 4.93% |
| uniquely aligned | 50.67% | 48.48% | 47.70% | 70.94% | 66.89% | 62.74% |
| SF aligned | 80.06% | 86.13% | 88.61% | 93.55% | 92.23% | 91.41% |
| aligned more than 3 hits | 3.88% | 8.30% | 11.46% | 3.92% | 4.79% | 6.08% |
| uniquely aligned | 50.48% | 48.73% | 47.06% | 71.63% | 67.09% | 62.58% |
| SS aligned | 74.98% | 86.39% | 88.47% | 92.67% | 91.68% | 90.81% |
| aligned more than 3 hits | 3.81% | 7.83% | 10.71% | 3.19% | 3.98% | 5.07% |
| uniquely aligned | 46.73% | 51.45% | 49.93% | 72.35% | 68.42% | 64.23% |
| All aligned | 79.89% | 86.57% | 88.52% | 93.01% | 91.95% | 90.98% |
| aligned more than 3 hits | 4.04% | 8.15% | 10.56% | 3.19% | 4.04% | 5.15% |
| uniquely aligned | 49.60% | 49.47% | 48.41% | 71.61% | 67.48% | 63.35% |
Comparisons between full-length cDNA transcripts and assembled transcripts
| % of fl-cDNA hit | 92.94% | 92.83% | 92.62% | 91.33% | 90.39% | 89.69% |
| (% of bases covered) | 82.84% | 82.64% | 81.68% | 79.52% | 76.71% | 74.09% |
| % of fl-cDNA hit with at least 90% of its length | 92.06% | 92.05% | 92.13% | 90.83% | 89.73% | 88.90% |
| (% of bases covered) | 82.08% | 82.02% | 81.14% | 79.24% | 76.40% | 73.79% |
Comparisons between Chinese Spring cDNA transcripts and assembled transcripts
| % of CS ESTs hit | 92.64% | 93.97% | 93.60% | 92.89% | 91.65% | 90.63% |
| (% of bases covered) | 85.82% | 87.97% | 87.91% | 87.73% | 85.72% | 83.99% |
| % of CS ESTs hit with at least 98% of identity | 51.56% | 65.21% | 76.07% | 73.98% | 71.58% | 69.19% |
| (% of bases covered) | 59.46% | 65.15% | 69.54% | 81.52% | 79.25% | 77.08% |
Comparisons between public ESTs and assembled transcripts
| | ||||||
|---|---|---|---|---|---|---|
| % of ESTs hita | 87.83% | 88.89% | 88.86% | 88.64% | 87.91% | 87.29% |
| (% of bases covered) | 77.76% | 79.76% | 80.04% | 80.49% | 79.50% | 78.61% |
| % of assembled transcripts hitb | 59.54% | 65.43% | 69.82% | 78.05% | 80.58% | 80.78% |
| (% of bases covered) | 57.09% | 62.59% | 66.09% | 70.91% | 74.24% | 75.90% |
a Proportion of public ESTs matched by assembled transcripts.
b Proportion of assembled transcripts matched by public ESTs.
Statistic of transcripts aligned to the draft diploid . genome
| | ||||||||
|---|---|---|---|---|---|---|---|---|
| Mean coverage | 1.9 | 2.2 | 2.3 | 2 | 2.1 | 2.1 | 11.0 | 2.9 |
| Coverage range | 1 - 279 | 1 - 339 | 1 - 254 | 1 - 26 | 1 - 24 | 1 - 24 | 1 - 10,122 | 1 - 658 |
| Bases covered | 62,103,048 | 58,923,609 | 56,105,411 | 51,499,767 | 45,678,989 | 41,571,199 | 46,506,263 | 63,042,809 |
Assembled transcripts were aligned to the draft genomic sequence. For the covered regions of the reference by assembled transcripts, mean coverage indicated how many transcripts covered on average for a certain region.