| Literature DB >> 27938328 |
Nicolas Cerveau1, Daniel J Jackson2.
Abstract
BACKGROUND: Next-generation sequencing (NGS) technologies are arguably the most revolutionary technical development to join the list of tools available to molecular biologists since PCR. For researchers working with nonconventional model organisms one major problem with the currently dominant NGS platform (Illumina) stems from the obligatory fragmentation of nucleic acid material that occurs prior to sequencing during library preparation. This step creates a significant bioinformatic challenge for accurate de novo assembly of novel transcriptome data. This challenge becomes apparent when a variety of modern assembly tools (of which there is no shortage) are applied to the same raw NGS dataset. With the same assembly parameters these tools can generate markedly different assembly outputs.Entities:
Keywords: De novo assembly; Eukaryote; Merge; Protein coding; Redundant; Transcriptome
Mesh:
Year: 2016 PMID: 27938328 PMCID: PMC5148890 DOI: 10.1186/s12859-016-1406-x
Source DB: PubMed Journal: BMC Bioinformatics ISSN: 1471-2105 Impact factor: 3.169
Fig. 1Schematic outline of our pipeline. Schematic representation of the steps involved in our pipeline
Assembly statistics
| Illumina derived datasets | Simulated datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Stepa |
|
|
|
|
|
|
|
| |
| Number of concatenated transcripts | 6 | 576,412 | 25,854 | 152,491 | 184,892 | 278,987 | 885,944 | 42,535 | 15,340 |
| CDS number | 7 | 139,727 | 22,180 | 112,813 | 81,598 | 137,601 | 379,596 | 37,920 | 41,103 |
| uniCDS numberb | 8 | 59,178 (58%) | 9,942 (55%) | 40,116 (64%) | 27,735 (66%) | 63,092 (54%) | 131,656 (65%) | 12,118 (68%) | 14,890 (64%) |
| Total transcript number | 9 | 58,185 | 9,744 | 39,022 | 26,968 | 61,798 | 127,526 | 11,582 | 14,283 |
| Total CDS number | 9 | 64,659 | 11,605 | 51,416 | 34,363 | 68,288 | 153,118 | 14,231 | 15,412 |
| Transcripts with multiple CDSsc | 9 | 5,759 (10%) | 1,529 (15%) | 9,756 (19%) | 5,838 (22%) | 5,999 (10%) | 21,060 (17%) | 2,218 (19%) | 949 (7%) |
| Redundant CDSsd | 9 | 5,481 (9%) | 1,663 (14%) | 11,300 (22%) | 6,628 (19%) | 5,196 (8%) | 21,462 (14%) | 2,113 (15%) | 522 (3%) |
| Transcriptome size (bp) | 9 | 131,591,076 | 16,164,888 | 69,689,679 | 69,421,322 | 86,181,833 | 206,036,224 | 34,121,269 | 19,765,122 |
| Smallest transcript (bp) | 9 | 300 | 300 | 300 | 300 | 300 | 300 | 300 | 300 |
| Largest transcript (bp) | 9 | 35,470 | 15,061 | 21,466 | 51,362 | 13,117 | 19,833 | 29,220 | 26,756 |
| N50 | 9 | 3,483 | 2,414 | 2,366 | 3,866 | 1,823 | 2,116 | 4,479 | 1,666 |
aStep number in Fig. 1
bProportion of discarded CDSs is indicated in brackets
cProportion of transcripts with >1 CDS is indicated in brackets
dProportion of none unique CDSs is indicated in brackets
Comparison between original and assembled transcriptomes derived from simulated reads generated from D. melanogaster and C. elegans datasets
| Organism | Assembler | Original transcripts | Assembled transcripts | ||
|---|---|---|---|---|---|
| Total number | Lacking a BLASTn hit in assembled transcripts | Total number | Lacking a BLASTn hit in original transcripts | ||
|
| Concatenated | 11,856 | 1,277 (11%) | 12,273 | 2,685 (22%) |
| CLC | 2,513 (21%) | 8,113 | 1,972 (24%) | ||
| IDBA_tran | 1,775 (15%) | 8,282 | 1,150 (14%) | ||
| Trinity | 1,843 (16%) | 11,395 | 2,810 (25%) | ||
|
| Concatenated | 16,513 | 4,774 (29%) | 14,922 | 4,087 (27%) |
| CLC | 5,614 (34%) | 11,853 | 2,948 (25%) | ||
| IDBA_tran | 5,018 (30%) | 11,069 | 1,557 (14%) | ||
| Trinity | 5,473 (33%) | 12,843 | 2,869 (22%) | ||
Fig. 2Categorization of concatenated clusters according to their presence/absence in the individual sub-assemblies. Category 1: clusters found in all three assemblers; category 2: clusters found in CLC and Trinity; category 3: clusters found in CLC and IDBA; category 4: clusters found in IDBA-tran and Trinity; category 5: clusters found in CLC; category 6: clusters found in Trinity and category 7: clusters found in IDBA
In vitro validation of L. stagnalis clusters
| Number | Positive | Incongruent | Negative | |
|---|---|---|---|---|
| Category 1 | 10 | 8 | 2 | 0 |
| Category 5 | 10 | 2 | 5 | 3 |
| Category 6 | 10 | 2 | 2 | 6 |
| Category 7 | 10 | 4 | 1 | 5 |
Fig. 3Characterisation of CDSs present in the final concatenated assembly and their presence/absence in the individual sub-assemblies. a Proportion of CDSs present in the final concatenated assembly present in each individual assembler for each dataset. b For those CDSs present in each sub-assembler (as in A), the proportion of CDSs from each individual assembler that matches (or exceeds) the length of the final concatenated CDS
Effect of concatenating assemblies on CDS length
| Illumina derived datasets | Simulated datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Assembler |
|
|
|
|
|
|
|
| |
| Number of extended CDSsa | CLC | 8,195 (38%) | 1,333 (23%) | 7,255 (44%) | 5,490 (34%) | 11,508 (38%) | 24,218 (47%) | 1,888 (23%) | 1,534 (13%) |
| IDBA_tran | 13,401 (27%) | 957 (16%) | 11,717 (38%) | 7,586 (25%) | 11,427 (24%) | 48,830 (36%) | 1,960 (23%) | 2,054 (18%) | |
| Trinity | 23,532 (34%) | 2,879 (27%) | 28,918 (44%) | 11,638 (33%) | 21,032 (35%) | 78,038 (40%) | 7,817 (38%) | 5,808 (33%) | |
| Cumulated extended CDS length (bp) | CLC | 9,289,434 | 913,872 | 6,758,496 | 5,921,460 | 6,816,978 | 23,238,501 | 2,662,092 | 1,107,756 |
| IDBA_tran | 23,112,789 | 121,116 | 9,431,703 | 5,795,478 | 5,143,413 | 36,408,087 | 2,311,569 | 1,433,037 | |
| Trinity | 33,055,113 | 2,749,041 | 30,317,865 | 14,029,893 | 16,390,158 | 53,633,142 | 10,201,314 | 4,291,554 | |
| Mean extended CDS length (bp) | CLC | 1,134 | 686 | 932 | 1,079 | 592 | 960 | 1,410 | 722 |
| IDBA_tran | 1,725 | 127 | 805 | 764 | 450 | 746 | 1,179 | 698 | |
| Trinity | 1,405 | 955 | 1,048 | 1,206 | 779 | 687 | 1,305 | 739 | |
aThe proportion of CDSs with an extension are indicated in brackets
Comparisons of assembly annotatability
| Illumina derived datasets | Simulated datasets | ||||||||
|---|---|---|---|---|---|---|---|---|---|
| Assembler |
|
|
|
|
|
|
|
| |
| Number of uniCDS | Concatenated | 59,178 | 9,942 | 40,116 | 27,735 | 63,092 | 131,656 | 12,118 | 14,890 |
| CLC | 21,527 | 5,673 | 15,466 | 15,988 | 29,898 | 51,151 | 8,113 | 11,853 | |
| IDBA_tran | 36,726 | 5,608 | 21,384 | 19,556 | 38,353 | 79,612 | 8,271 | 11,069 | |
| Trinity | 44,545 | 9,339 | 34,356 | 23,509 | 46,571 | 88,428 | 11,351 | 12,838 | |
| Overall number of BLASTx hits | Concatenated | 38,838 | 9,922 | 25,502 | 19,789 | 49,565 | 93,781 | 9,007 | 9,751 |
| CLCa | 14,034 | 5,666 | 9,983 | 11,221 | 23,523 | 36,587 | 5,777 | 7,740 | |
| (36%) | (57%) | (39%) | (57%) | (47%) | (39%) | (64%) | (79%) | ||
| IDBA_trana | 23,634 | 5,598 | 13,996 | 14,107 | 30,491 | 56,665 | 6,218 | 7,489 | |
| (61%) | (56%) | (55%) | (71%) | (62%) | (60%) | (69%) | (77%) | ||
| Trinitya | 30,134 | 9,320 | 21,730 | 16,765 | 36,232 | 63,079 | 8,342 | 8,559 | |
| (78%) | (94%) | (85%) | (85%) | (73%) | (67%) | (93%) | (88%) | ||
| Number of unique BLASTx hits | Concatenated | 15,232 | 5,404 | 9,242 | 9,575 | 15,524 | 16,700 | 4,957 | 5,767 |
| CLCa | 10,958 | 5,094 | 7,492 | 8,376 | 12,405 | 12,902 | 4,664 | 5,529 | |
| (72%) | 94%) | (81%) | (87%) | (80%) | (77%) | (94%) | (96%) | ||
| IDBA_trana | 12,893 | 5,223 | 8,069 | 8,868 | 13,840 | 14,900 | 4,655 | 5,302 | |
| (85%) | (97%) | (87%) | (93%) | (89%) | (89%) | (94%) | (92%) | ||
| Trinitya | 14,124 | 5,243 | 9,051 | 9,174 | 14,314 | 14,855 | 4,817 | 5,340 | |
| (93%) | (97%) | (98%) | (96%) | (92%) | (89%) | (97%) | (93%) | ||
aEach value is also expressed as a percentage of the corresponding Concatenated dataset value (numbers in brackets)
Results of BUSCO annotations
|
|
|
|
|
|
|
|
| ||
|---|---|---|---|---|---|---|---|---|---|
| Illumina derived datasets | Simulated datasets | ||||||||
| BUSCO dataset | Metazoa | Fungi | Metazoa | Arthropods | Plants | Arthropods | Metazoa | ||
| Number of BUSCO entries | 843 | 1,438 | 843 | 2,675 | 956 | 2,675 | 843 | ||
| Detected BUSCO entries | Concatenated | 822 | 1,357 | 720 | 2,455 | 903 | 934 | 1,204 | 442 |
| CLC | 779 | 1,207 | 655 | 2,159 | 843 | 811 | 1,079 | 437 | |
| IDBA_tran | 813 | 1,195 | 699 | 2,242 | 881 | 903 | 1,137 | 428 | |
| Trinity | 811 | 1,356 | 710 | 2,424 | 887 | 927 | 1,143 | 415 | |
| Duplicated copies | Concatenated | 344 | 430 | 445 | 945 | 525 | 745 | 368 | 49 |
| CLC | 59 | 35 | 52 | 89 | 224 | 284 | 48 | 25 | |
| IDBA_tran | 210 | 70 | 189 | 662 | 361 | 569 | 149 | 28 | |
| Trinity | 259 | 155 | 324 | 520 | 389 | 633 | 291 | 42 | |
| Fragmented copies | Concatenated | 19 | 53 | 16 | 143 | 20 | 5 | 89 | 54 |
| CLC | 66 | 106 | 39 | 230 | 88 | 167 | 94 | 57 | |
| IDBA_tran | 20 | 59 | 16 | 174 | 33 | 63 | 93 | 54 | |
| Trinity | 35 | 136 | 24 | 189 | 52 | 8 | 100 | 53 | |
Comparison of BLASTx annotation rate of both N. benthamiana cumulative transcriptomes
| Our study | Nakasugi et al | |
|---|---|---|
| Total number of transcripts | 127,526 | 234,526 |
| Number of transcripts with BLASTx hits against Swiss-prot | 95,929 (75%) | 176,540 (75%) |
| Number of transcripts with a unique Swiss-prot hit | 16,472 | 16,222 |
| Number of shared transcripts with a unique Swiss-prot hit | 13,938 | |
| Number unique transcripts with a unique Swiss-prot hit | 2,534 | 2,284 |