| Literature DB >> 30696919 |
Yujiao Chen1,2,3,4,5, Yuqian Wu1,3,4,5,6, Li Liu3,5, Jianhua Feng3, Tiancheng Zhang3, Sheng Qin7, Xingyu Zhao7, Chaoxia Wang3, Dongmei Li3, Wei Han3, Minghui Shao3, Ping Zhao1, Jianfeng Xue3, Xiaomin Liu2,4, Hongjie Li2, Enwei Zhao2,4, Wen Zhao3, Xijie Guo7, Yongfeng Jin8, Yaming Cao9, Liwang Cui3,10, Zeqi Zhou11, Qingyou Xia6, Zihe Rao12, Yaozhou Zhang13,14,15,16,17,18.
Abstract
The complete genome of Cordyceps militaris was sequenced using single-molecule real-time (SMRT) sequencing technology at a coverage over 300×. The genome size was 32.57 Mb, and 14 contigs ranging from 0.35 to 4.58 Mb with an N50 of 2.86 Mb were assembled, including 4 contigs with telomeric sequences on both ends and an additional 8 contigs with telomeric sequences on either the 5' or 3' end. A methylome database of the genome was constructed using SMRT and m4C and m6A methylated nucleotides, and many unknown modification types were identified. The major m6A methylation motif is GA and GGAG, and the major m4C methylation motif is GC or CG/GC. In the C. militaris genome DNA, there were four types of methylated nucleotides that we confirmed using high-resolution LCMS-IT-TOF. Using PacBio Iso-Seq, a total of 31,133 complete cDNA sequences were obtained in the fruiting body. The conserved domains of the nontranscribed regions of the genome include TATA boxes, which are the initial regions of genome replication. There were 406 structural variants between the HN and CM01 strains, and there were 1,114 structural variants between the HN and ATCC strains.Entities:
Mesh:
Substances:
Year: 2019 PMID: 30696919 PMCID: PMC6351555 DOI: 10.1038/s41598-018-38021-4
Source DB: PubMed Journal: Sci Rep ISSN: 2045-2322 Impact factor: 4.379
Figure 1Characteristics of the de novo assembly genomic features in C. militaris (HN strain). (a) The 14 contigs with the complete chromosomes are labeled with red stars. (b) Plots of the m6A motif distributions. The orange color represents m6A in the plus strain, the brown color represents m6A in the minus strain and the black color shows the overlap of brown with orange and shows that m6A exists in both strains. (c) Plots of the m4C motif distributions. The purple color represents m4C in the plus strain and the blue color represents m4C in the minus strain. (d) The distribution of the GC content: the red color represents a GC content >50% and the green color represents a GC content <50%. (e) Density distribution of repeat elements. (f) Density distribution of genes. (g) Genome duplication: regions sharing >90% sequence similarity over 5 kb are connected by red lines; those with >90% similarity over 10 kb are connected by blue lines.
Assembly summary statistics of C. militaris HN compared with the ATCC 34164 and CM01 C. militaris genomes.
|
|
|
| |
|---|---|---|---|
| Assembly size (MBa) | 32.57 | 33.62 | 32.27 |
| Sequencing platform | PacBio RS II | PacBio RS II | Roche 454 |
| Coverage fold | 317.7x | 149.5× | 147x |
| Number of Scaffolds (Contigs) | 14 | 7 | 31/597 |
| N50 (Mb) Scaffold/Contigs | 2.86 | 5.78 | 4.55/0.11 |
| GC Content (%) | 51.55 | 50.92 | 51.41 |
| Repeat Content (%) | 7.72 | 9.41 | 8.10 |
| Predicted Genes | 10095 | 9287 | 9651 |
| Number of Exons | 29663 | 26026 | 28872 |
| Number of Introns | 19568 | 16739 | 19221 |
| Total Gene Length(Mb) | 18.9 | 16.1 | 16.8 |
| Mean Intergenic Region Length | 1432 | 1866 | 1596 |
| Gene Density (Genes/Mbp) | 309.9 | 278.9 | 301 |
| Mean Gene Length | 1872/1385 | 1739 | 1743 |
| Mean Exon Length | 471 | 550 | 507 |
| Mean Intron Length | 116 | 109 | 113 |
| Mean Introns Per-gene | 1.9 | 1.8 | 2.0 |
| NCBI Accession | SUB1679810 | PRJNA323705 | PRJNA225510 |
Figure 2Comparison of the SMRT assembly to a previous CM01 genome and sequencing depth distribution of the genome and transcriptome. (a) Dot plots comparing the SMRT assembly to a previous CM01 genome identified large genomic variants between the two strains. (b) The distribution of the sequencing depth of the genome and transcriptome. (c) Contigs in HN mapped to the previous CM01 genome.
Figure 3Distribution of m4C and m6A methylated sites in 14 contigs in the HN genome.
Figure 4Distribution of methylation in the HN genome. (a) The distributions of m4C and m6A in different parts of the genome. (b) Density distribution of coverage and quality in the m4C and m6A motifs. (c) SMRT sequencing identified motifs associated with m6A or m4C. NGCNC, GGCG and CNCCN are associated with m4C methylation. NGAGG is associated with m6A methylation. (d) Representative interpulse duration (IPD) ratios of SMRT sequencing data of the gene Cm10g008316. IPD ratio is defined as the change in the IPD distribution in the sample compared with the unmodified bases. Red, positive strand; blue, negative strand. (e) Circos plots of m4C, m6A and motif distributions; from outer ring to inner rings: the density distribution of m6A, the density distribution of m4C, the genome location of cordycepin pathway genes, the location of the ergosterol pathway genes in the genome, genomic location of the NGCNG motif in m4C, genomic location of the GGCGN motif in m4C, genomic location of the CNCCN motif in m4C, and genomic location of the NGAGG motif in m6A.
Distributions and methylation motifs in 14 contigs in HN.
| Contigs | Total length | Number of m6A (%) | Number of m4C (%) | Number of unknown (T or G) (%) |
|---|---|---|---|---|
| contig1 | 4576244 | 682 (0.0149) | 4067 (0.0889) | 98395 (2.1501) |
| contig2 | 4088212 | 694 (0.017) | 3658 (0.0895) | 87643 (2.1438) |
| contig3 | 3973962 | 524 (0.0132) | 2503 (0.063) | 57715 (1.4523) |
| contig4 | 2891972 | 508 (0.0176) | 2491 (0.0861) | 56909 (1.9678) |
| contig5 | 2861787 | 433 (0.0151) | 2778 (0.0971) | 65612 (2.2927) |
| contig6 | 2561359 | 464 (0.0181) | 2177 (0.085) | 50147 (1.9578) |
| contig7 | 1944127 | 392 (0.0202) | 1580 (0.0813) | 38851 (1.9984) |
| contig8 | 1886322 | 329 (0.0174) | 1380 (0.0732) | 34734 (1.8414) |
| contig9 | 1861270 | 311 (0.0167) | 1593 (0.0856) | 39143 (2.103) |
| contig10 | 1812096 | 278 (0.0153) | 1746 (0.0964) | 41142 (2.2704) |
| contig11 | 1600700 | 321 (0.0201) | 1475 (0.0921) | 32845 (2.0519) |
| contig12 | 1519743 | 197 (0.013) | 1361 (0.0896) | 32135 (2.1145) |
| contig13 | 643026 | 93 (0.0145) | 611 (0.095) | 13736 (2.1362) |
| contig14 | 352336 | 114 (0.0324) | 147 (0.0417) | 3021 (0.8574) |
| Total | 32573156 | 5340 (0.0164) | 27567 (0.0846) | 652028 (2.0017) |
Figure 5Annotation of the methylated genes in the HN genome. (a) GO annotation information of the methylated genes and top 14 GO enrichment terms. (b) KEGG annotation information of the methylated genes.
Figure 6(a) Four types of nucleotides and hypothetical methylated nucleotides in the genomic DNA separated by HPLC. (b) Methylated single nucleotides in the genomic DNA as detected by MS based on their molecular weights.
Figure 7Complexity of the HN fruiting body transcriptome based on PacBio Iso-Seq. (a) Comparison of the length of the ToFU transcript set in this study, Illumina short-read assembly transcript set, C. militaris GenBank reference mRNAs and gene annotation in this study. (b) Comparison of the number of isoforms between the Illumina short-read assembly transcript set and the ToFU transcript set. (c) Length distributions of the coding and long noncoding ToFU transcript sequences. (d) Alignment of the reference annotated transcript (blue) of the Cm02g002286.1 gene with 35 distinct PacBio isoforms. (e) Visualization of the alternative splicing of the Cm01g001055.1 gene; the exon-skip (ES) and intron-retain (IR) AS are highlighted. (f) Visualization of the extended UTR of the Cm01g000354.1 gene.
Figure 8Distribution of mRNA alternative splicing events in the HN strain detected in the PacBio Iso-Seq full-length transcripts.
Figure 9Distributions of the transcriptome and nontranscriptome in the genome of the HN fruiting body. Top 5 motifs in the palindrome structure of the conserved sequence in the nontranscribed region. (a) HN fruiting body genomic nontranscribed region. (b) Conserved sequence.
Size distribution of the structural variants in the SMRT assembly relative to the CM01 genome.
| Comparing objects | Size range (bp) | Variant type | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Insertion | Deletion | Tandem_expansion | Tandem_contraction | Repeat_expansion | Repeat_contraction | ||||||||
| Count | Total (bp) | Count | Total (bp) | Count | Total (bp) | Count | Total (bp) | Count | Total (bp) | Count | Total (bp) | ||
| HN_vs_CM01 | 2–10 | 1551 | 3545 | 533 | 1628 | 0 | 0 | 0 | 0 | 12 | 67 | 18 | 86 |
| 10–50 | 17 | 292 | 25 | 382 | 0 | 0 | 0 | 0 | 43 | 1159 | 36 | 1030 | |
| 50–500 | 191 | 18181 | 2 | 361 | 8 | 1384 | 19 | 3202 | 20 | 3517 | 114 | 19484 | |
| 500–10,000 | 2 | 1989 | 1 | 1587 | 0 | 0 | 0 | 0 | 2 | 1194 | 47 | 60505 | |
| Total | 1761 | 24007 | 561 | 3958 | 8 | 1384 | 19 | 3202 | 77 | 5937 | 215 | 81105 | |
| HN_vs_ATCC | 2–10 | 18915 | 74099 | 18027 | 71902 | 0 | 0 | 0 | 0 | 14 | 91 | 18 | 90 |
| 10–50 | 2937 | 48659 | 2830 | 46906 | 0 | 0 | 1 | 27 | 29 | 754 | 37 | 1091 | |
| 50–500 | 201 | 34403 | 197 | 26993 | 3 | 552 | 7 | 1534 | 120 | 22934 | 132 | 24858 | |
| 500–10,000 | 105 | 367150 | 76 | 266652 | 0 | 0 | 0 | 0 | 159 | 530194 | 114 | 395766 | |
| Total | 22158 | 524311 | 21130 | 412453 | 3 | 552 | 8 | 1561 | 322 | 553973 | 301 | 421805 | |
| ATCC_vs_CM01 | 2–10 | 18624 | 74100 | 19135 | 75435 | 0 | 0 | 0 | 0 | 25 | 134 | 16 | 94 |
| 10–50 | 2874 | 47367 | 2995 | 49499 | 1 | 26 | 0 | 0 | 62 | 1722 | 64 | 1715 | |
| 50–500 | 302 | 38634 | 189 | 32080 | 4 | 653 | 4 | 514 | 144 | 25866 | 176 | 32966 | |
| 500–10,000 | 85 | 284457 | 89 | 301499 | 0 | 0 | 0 | 0 | 105 | 349498 | 198 | 661564 | |
| Total | 21885 | 444558 | 22408 | 458513 | 5 | 679 | 4 | 514 | 336 | 377220 | 454 | 696339 | |
Figure 10Most variations in the SMRT assembly relative to CM01 are small insertions. Variants ranging from 2 bp to 10 kb in size were called using Assemblytics. (a) Size distribution analysis of variants from 2 bp to 10 kb in size; the x-axis represents the variant size in base-pairs and the y-axis represents the variant number. (a) Variants from 2 to 10 bp (b). Variants from 10 to 50 bp (c). Variants from 50 to 500 bp (d). Variants from 500 bp to 10 kb. (b) Cumulative sequence length plot showing the nearly identical contiguity and total size of the SMRT assembly (query; in green) versus the reference (in blue). The length of each individual sequence is indicated on the y-axis and the cumulative sum of the sorted sequence lengths is indicated on the x-axis.