| Literature DB >> 23991119 |
Rajesh Ghangal1, Saurabh Chaudhary, Mukesh Jain, Ram Singh Purty, Prakash Chand Sharma.
Abstract
Seabuckthorn (Hippophaerhamnoides L.) is known for its medicinal, nutritional and environmental importance since ancient times. However, very limited efforts have been made to characterize the genome and transcriptome of this wonder plant. Here, we report the use of next generation massive parallel sequencing technology (Illumina platform) and de novo assembly to gain a comprehensive view of the seabuckthorn transcriptome. We assembled 86,253,874 high quality short reads using six assembly tools. At our hand, assembly of non-redundant short reads following a two-step procedure was found to be the best considering various assembly quality parameters. Initially, ABySS tool was used following an additive k-mer approach. The assembled transcripts were subsequently subjected to TGICL suite. Finally, de novo short read assembly yielded 88,297 transcripts (> 100 bp), representing about 53 Mb of seabuckthorn transcriptome. The average length of transcripts was 610 bp, N50 length 1198 BP and 91% of the short reads uniquely mapped back to seabuckthorn transcriptome. A total of 41,340 (46.8%) transcripts showed significant similarity with sequences present in nr protein databases of NCBI (E-value < 1E-06). We also screened the assembled transcripts for the presence of transcription factors and simple sequence repeats. Our strategy involving the use of short read assembler (ABySS) followed by TGICL will be useful for the researchers working with a non-model organism's transcriptome in terms of saving time and reducing complexity in data management. The seabuckthorn transcriptome data generated here provide a valuable resource for gene discovery and development of functional molecular markers.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23991119 PMCID: PMC3749127 DOI: 10.1371/journal.pone.0072516
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Figure 1Flow chart showing the strategy used for generation of de novo short read transcriptome assembly.
Summary of Illumina HiSeq 2000 short read data used for seabuckthorn transcriptome assembly.
|
|
|
|
|---|---|---|
| Total number of reads | 44791536 | 49222400 |
| Read length | 91 | 91 |
| Total number of HQ reads | 41067522 | 45186352 |
| Percentage of HQ reads | 91.69% | 91.80% |
| Total number of bases | 4031238240 (4.0 Gb) | 4430016000 (4.4 Gb) |
| Total number of bases in HQ reads | 3696076980 (3.7 Gb) | 4066771680 (4.0 Gb) |
| Total number of HQ bases in HQ reads | 3619124516 | 3985259341 |
| Percentage of HQ bases in HQ reads | 97.92% | 98% |
HQ stands for high quality obtained after running NGS QC toolkit on raw data
Comparison of different short read assemblers and strategies employed on the basis of various assembly parameters.
|
|
|
| |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 77022 | 62804 | 102826 | 97306 | 279673 | 240723 | 72302 | 55238 | 139923 | 83813 | 283777 | 234329 | |
|
| 3293 | 8567 | 6650 | 3548 | 7070 | 14182 | 4182 | 8849 | 5001 | 3793 | 6257 | 9473 | |
|
| 378 | 733 | 698 | 330 | 310 | 548 | 403 | 794 | 335 | 368 | 300 | 460 | |
|
| 454 | 1314 | 1144 | 429 | 486 | 1284 | 485 | 1378 | 482 | 474 | 434 | 791 | |
|
| 73.79 | 68.28 | 89.21 | 76.65 | 85.77 | 90.8 | 77.69 | 72.99 | 69.42 | 81.55 | 83.41 | 90.24 | |
|
| 71.15 | 66.95 | 41.93 | 73.2 | 84.74 | 52.27 | 75.5 | 72.16 | 68.91 | 79.01 | 82.64 | 55.95 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 347958 | 115006 | 111040 | 314576 | 279673 | 240723 | 332830 | 97691 | 167892 | 295569 | 283777 | 234329 | |
|
| 4668 | 10607 | 7659 | 4654 | 7070 | 14182 | 4931 | 10387 | 6552 | 5384 | 6257 | 9473 | |
|
| 321 | 798 | 804 | 339 | 310 | 548 | 330 | 856 | 421 | 352 | 300 | 460 | |
|
| 419 | 1504 | 1317 | 458 | 486 | 1284 | 435 | 1533 | 647 | 484 | 434 | 791 | |
|
| 93.73 | 94.89 | 94.14 | 94.07 | 85.77 | 90.8 | 94.53 | 94.77 | 94.05 | 95.05 | 83.41 | 90.24 | |
|
| 66.1 | 46.97 | 45.27 | 69.28 | 84.74 | 52.27 | 66.17 | 48.9 | 71.54 | 70.28 | 82.64 | 55.95 | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228750 | 101055 | 85902 | 211863 | 270245 | 212093 | 219896 | 86124 | 88297 | 202713 | 274248 | 202089 | |
|
| 7608 | 16199 | 14243 | 11102 | 9026 | 14182 | 9030 | 17565 | 10252 | 9015 | 6257 | 9474 | |
|
| 374 | 871 | 930 | 396 | 317 | 475 | 385 | 932 | 610 | 407 | 307 | 444 | |
|
| 631 | 1606 | 1539 | 690 | 513 | 1035 | 658 | 1629 | 1198 | 725 | 456 | 766 | |
|
| 93.43 | 94.79 | 94.08 | 93.69 | 85.83 | 91.66 | 94.22 | 94.76 | 93.21 | 94.77 | 83.46 | 91.03 | |
|
| 90.07 | 50.25 | 54.11 | 90.67 | 85.15 | 67.2 | 90.7 | 52.15 | 91.03 | 91.16 | 83.07 | 70.15 | |
Figure 2Top hit species representation of seabuckthorn transcriptome based on sequence similarity search (BLASTX).
Data on distribution of Simple Sequence Repeats (SSRs) in seabuckthorn transcriptome.
|
|
|
|---|---|
| Total number of sequences screened | 88297 |
| Total number of identified SSRs | 13299 |
| Number of sequences containing SSRs | 10980 (12.4) |
| Number of sequences containing more than one SSR | 1850 (16.8) |
| Number of SSRs present in compound formation | 1099 (8.2) |
|
| |
| Mono-nucleotides | 7502 (56.4) |
| Di-nucleotides | 2860 (21.5) |
| Tri-nucleotides | 2520 (18.9) |
| Tetra-nucleotides | 213 (1.6) |
| Penta-nucleotides | 62 (0.46) |
| Hexa-nucleotides | 142 (1.06) |
Figure 3Distribution of seabuckthorn transcripts in different transcription factor families.