| Literature DB >> 23875008 |
Wei Zhou1, Yiyi Hu, Zhenghong Sui, Feng Fu, Jinguo Wang, Lianpeng Chang, Weihua Guo, Binbin Li.
Abstract
Gracilariopsis lemaneiformis has a high economic value and is one of the most important aquaculture species in China. Despite it is economic importance, it has remained largely unstudied at the genomic level. In this study, we conducted a genome survey of Gp. lemaneiformis using next-generation sequencing (NGS) technologies. In total, 18.70 Gb of high-quality sequence data with an estimated genome size of 97 Mb were obtained by HiSeq 2000 sequencing for Gp. lemaneiformis. These reads were assembled into 160,390 contigs with a N50 length of 3.64 kb, which were further assembled into 125,685 scaffolds with a total length of 81.17 Mb. Genome analysis predicted 3490 genes and a GC% content of 48%. The identified genes have an average transcript length of 1,429 bp, an average coding sequence size of 1,369 bp, 1.36 exons per gene, exon length of 1,008 bp, and intron length of 191 bp. From the initial assembled scaffold, transposable elements constituted 54.64% (44.35 Mb) of the genome, and 7737 simple sequence repeats (SSRs) were identified. Among these SSRs, the trinucleotide repeat type was the most abundant (up to 73.20% of total SSRs), followed by the di- (17.41%), tetra- (5.49%), hexa- (2.90%), and penta- (1.00%) nucleotide repeat type. These characteristics suggest that Gp. lemaneiformis is a model organism for genetic study. This is the first report of genome-wide characterization within this taxon.Entities:
Mesh:
Substances:
Year: 2013 PMID: 23875008 PMCID: PMC3713064 DOI: 10.1371/journal.pone.0069909
Source DB: PubMed Journal: PLoS One ISSN: 1932-6203 Impact factor: 3.240
Summary of two paired-end libraries used for HiSeq 2000 Sequencing and paired-end sequencing datasets.
| Library | Sex of algae | Insert Size/bp | Read Length/bp | Data/Mb | Sequence Depth/X |
| L1 | female | 500 | 95 | 9,087.49 | 53.93 |
| L2 | female | 170 | 95 | 9,608.67 | 60.05 |
| Total | 18,696.16 | 113.98 | |||
Statistics of the genome assembly.
| Contig | Scaffold | |||
| Size(bp) | Number | Size(bp) | Number | |
| N90 | 127 | 94,867 | 127 | 57,716 |
| N80 | 193 | 38,389 | 560 | 7,482 |
| N70 | 648 | 15,203 | 5,754 | 2,422 |
| N60 | 1,674 | 7,523 | 12,961 | 1,511 |
| N50 | 3,638 | 4,379 | 20,007 | 1,013 |
| Longest | 51,340 | - | 159,753 | - |
| Total size | 77,537,041 | - | 81,167,384 | - |
| Total number(≥100 bp) | 160,390 | - | 125,685 | |
| Total number(≥2 kb) | 6649 | - | 3,704 | |
Figure 1GC content and average sequencing depth of the genome data used for assembly.
(The x-axis was GC content percent across every 10-kb non-overlapping sliding window).
Figure 2Distribution of 17-mer frequency in the 3.5 Gb sequences (3.5 Gb = 18.70 Gb*30X/162X).
Estimation of Gp. lemaneiformis based on K-mer statistics.
| K-mer value | K-mer number | Depth | Genome size (bp) | Used bases | Used reads | Depth (X) |
| 17 | 2,910,526,453 | 30 | 97,017,548 | 3,500,000,165 | 36,842,107 | 36.07 |
Note: Generally, using 30X data to estimate the size of the genome are preference because of its accuracy (based on evaluation experience of the Beijing Genomics Institute); Used bases were calculated by 18.70 Gb*30X/162X; Genome Size = K-mer_num/Peak_depth.
Percentage of the genome masked as each class of transposable elements.
| Type | Repbase TEs | TE protiens | De novo | Combined TEs | ||||
| Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome | Length (bp) | % in genome | |
| DNA | 228,529 | 0.28 | 2,138,557 | 2.63 | 6,682,133 | 8.23 | 6,950,958 | 8.56 |
| LINE | 63,656 | 0.08 | 117,208 | 0.14 | 1,210,955 | 1.49 | 1,342,261 | 1.65 |
| LTR | 2,255,161 | 2.78 | 7,687,608 | 9.47 | 20,671,304 | 25.47 | 21,425,837 | 26.40 |
| SINE | 3,153 | 0.004 | 0 | 0.00 | 0 | 0.00 | 3,153 | 0.004 |
| Other | 63 | 0.00 | 0 | 0.00 | 0 | 0.00 | 63 | 0.00 |
| Unknown | 0 | 0.00 | 0 | 0.00 | 16,608,826 | 20.46 | 16,608,826 | 20.46 |
| Total | 2,535,513 | 3.12 | 9,941,908 | 12.25 | 43,807,750 | 53.97 | 44,351,914 | 54.64 |
Note: RepBase TEs and TE proteins were obtained, using RepeatMasker and RepeatProteinMask respectively, based on the RepBase library; De novo repeat prediction identified repetitive DNA using RepeatMasker against the de novo repeat library of Gp. lemaneiformis, which was constructed by the programs LTR-FINDER, Piler and RepeatScout; Combined TEs were the integration and filtering redundancies of the above three methods.
General statistics of gene prediction for Gp. lemaneiformis.
| Method | Gene set | Number | Average transcript length (bp) | Average CDS length (bp) | Average exon per gene | Average exon length (bp) | Average intron length (bp) |
| De novo | Augustus | 3369 | 631.62 | 558.01 | 1.35 | 413.99 | 211.59 |
| Genscan | 3363 | 2330.66 | 1688.73 | 2.68 | 631.02 | 382.97 | |
| Homolog |
| 1860 | 766.51 | 699.30 | 1.37 | 510.34 | 181.52 |
|
| 1858 | 773.80 | 702.59 | 1.36 | 515.61 | 196.36 | |
|
| 1878 | 764.10 | 696.32 | 1.37 | 509.31 | 184.59 | |
|
| 4222 | 879.14 | 843.62 | 1.32 | 638.08 | 110.27 | |
|
| 4039 | 808.82 | 746.46 | 1.27 | 586.79 | 229.19 | |
|
| 4393 | 910.23 | 802.41 | 1.34 | 597.55 | 314.52 | |
|
| 3851 | 787.06 | 746.58 | 1.28 | 583.54 | 144.88 | |
| GLEAN | 3490 | 1429.57 | 1369.79 | 1.36 | 1008.36 | 191.13 |
Note: Gene length included the exon and intron regions but excluded UTRs.
Comparison of general genome characteristics from four red algae.
| Species |
|
|
|
|
| Average CDS length (bp) | 1370 | - | 1552 | 1247 |
| Average exon length (bp) | 1008 | 789 | 1540 | 755 |
| Average intron length (bp) | 191 | 123 | 248 | 300 |
| Introns per gene | 0.36 | 0.32 | 0.005 | 0.7 |
| Exons per gene | 1.36 | 1.32 | 1.005 | 1.7 |
| Intron-containing genes (%) | 28 | 12 | 0.6 | ∼40 |
Figure 3GO category comparison among Gp. lemaneiformis, P. yezoensis, Cy. merolae and Ch. crispus.
Summary of match sequence between Gracilariopsis lemaneiformis and Arabidopsis thalianai.
| Species | Match length (bp) | Total length (bp) | Coverage (%) |
|
| 1,671,544 | 119,146,348 | 1.40 |
|
| 900,618 | 81,167,384 | 1.11 |
Summary of match sequence between Gracilariopsis lemaneiformis and Chlorella variabili.
| Species | Match length (bp) | Total length (bp) | Coverage (%) |
|
| 762,232 | 46,159,512 | 1.65 |
|
| 345,067 | 81,167,384 | 0.43 |
Figure 4Frequency of SSR types in the Genome Survey of Gp. lemaneiformis.
Figure 5Percentage of different motifs in dinucleotide repeats in Gp. lemaneiformis.
Figure 6Percentage of different motifs in trinucleotide repeats in Gp. lemaneiformis.