| Literature DB >> 35701376 |
Jiaoyu He1,2,3, Shanfei Bao1,2,3, Junhang Deng1,2,3, Qiufu Li1,2,3, Shiyu Ma1,2,3, Yiran Liu1,2,3, Yanru Cui1,2,3, Yuqi Zhu1,2,3,4, Xia Wei1,2,3, Xianping Ding1,2,3, Kehui Ke5, Chaojie Chen5.
Abstract
Artocarpus nanchuanensis (Moraceae), which is naturally distributed in China, is a representative and extremely endangered tree species. In this study, we obtained a high-quality chromosome-scale genome assembly and annotation information for A. nanchuanensis using integrated approaches, including Illumina, Nanopore sequencing platform, and Hi-C. A total of 128.71 Gb of raw Nanopore reads were generated from 20-kb libraries, and 123.38 Gb of clean reads were obtained after filtration with 160.34× coverage depth and a 17.48-kb average read length. The final assembled A. nanchuanensis genome was 769.44 Mb with a 2.09 Mb contig N50, and 99.62% (766.50 Mb) of the assembled data was assigned to 28 pseudochromosomes. In total, 39,596 genes (95.10%, 39,596/41,636) were successfully annotated, and 129 metabolic pathways were detected. Plants disease resistance/insect resistance genes, plant-pathogen interaction metabolic pathways, and abundant biosynthesis pathways of vitamins, flavonoid, and gingerol were detected. Unigene reveals the basis of species-specific functions, and gene family in contraction and expansion generally implies strong functional differences in the evolution. Compared with other related species, a total of 512 unigenes, 309 gene families in contraction, and 559 gene families in expansion were detected in A. nanchuanensis. This A. nanchuanensis genome information provides an important resource to expand our understanding of the unique biological processes, nutritional and medicinal benefits, and evolutionary relationship of this species. The study of gene function and metabolic pathway in A. nanchuanensis may reveal the theoretical basis of a special trait in A. nanchuanensis and promote the study and utilization of its rare medicinal value.Entities:
Keywords: A. nanchuanensis; Hi-C; Illumina; Nanopore; gene annotation; gene family; genome assembly; sequencing
Mesh:
Year: 2022 PMID: 35701376 PMCID: PMC9197682 DOI: 10.1093/gigascience/giac042
Source DB: PubMed Journal: Gigascience ISSN: 2047-217X Impact factor: 7.658
Figure 1:The flowchart of A. nanchuanensis genome assembly and annotation process.
Figure 2:The A. nanchuanensis sample and genomic interaction analysis.
Sequence statistics of Artocarpus nanchuanensis
| Illumina | Nanopore | Hi-C | |||
|---|---|---|---|---|---|
| Data* | 51.76 Gb | Data* | 123.38 Gb | Data* | 137.5 Gb |
| Depth/genome coverage | 68.01× | Depth/genome coverage | 160.34× | Depth | 62× |
| Total k-mer | 45,202,482,693 | MaxLen | 216,661 bp | Total Read Pairs | 458,907,479 |
| Genome | 761.07 Mb | SeqNum | 7,057,335 | Genome | 769.44 Mb |
| Heterozygosity | 0.93% | N50Len | 19,177 bp | Contig N50 | 1.78 Mb |
| Repeated | 55.80% | N90Len | 11,029 bp | Scaffold N50 | 25.15 Mb |
| Mapping rate | 99.41% | ||||
Data* mean the data have been filtered to be clean data. Depth/genome coverage means depth of sequencing data. MaxLen means the longest read length of sequencing data. SeqNum means the total read number of sequencing data. N50Len means the N50 length of sequencing data reads. N90Len means the N90 length of sequencing data reads.
Nanopore and Hi-C genome assembly statistics of Artocarpus nanchuanensis
| Nanopore assembly results | Hi-C assembly results | ||
|---|---|---|---|
| Contig number | 1,087 | Scaffold/contig number | 809/1,364 |
| Contig length | 769,440,982 bp | Scaffold/contig length (bp) | 769,496,482/769,440,982 |
| Contig N50 | 2,094,024 bp | Scaffold/contig N50 (bp) | 25,150,906/1,778,064 |
| Contig N90 | 402,757 bp | Scaffold /contig N90 (bp) | 20,179,149/200,000 |
| Contig max | 8,879,419 bp | Scaffold/contig max (bp) | 32,505,427/8,646,128 |
| Gap total length (bp) | 55,500 | ||
| GC content (%) | 32.34 | ||
Contig represents the contig after error correction. Scaffold represents the scaffold generated after connection, and scaffold length exceeds 1 kb. Scaffold/contig number represents the number of scaffold and contig in the scaffold. Scaffold/contig length represents the length of scaffold and contig in the scaffold. Scaffold/contig N50 represents length of scaffold N50 and contig N50. Scaffold/contig N90 represents length of scaffold N90 and contig N90. Scaffold/contig max represents the length of the longest scaffold and longest contig. GC content represents the GC content percentage.
Genome assembly quality comparison of A. nanchuanensis and its related Moraceae plants
|
|
|
| |
|---|---|---|---|
| Sequencing technology | Illumina HiSeq 2000 | Illumina, PacBio RS II, Hi-C | Illumina, Nanopore, Hi-C |
| Sequencing depth | 236.82× (78.34 Gb, Illumina) | 86.55× (36.87 Gb, Pacbio) | 160.34× (123.83 Gb, Nanopore) |
| Contig/scaffold N50 | 34,476 bp/390,115 bp | 907,868 bp/none | 2.09 Mb/25.15 Mb |
| Contig/scaffold N90 | 2,231 bp/11,563 bp | 113,961 bp/none | 402.76 kb/20.18 Mb |
| Annotated genes | 29,338 | 29,416 | 41,636 |
| Repeat composition | 127.98 Mb | 198.23 Mb | 422.78Mb |
| Unique k-mer in genome | 2,198,905 | 5,732,202 | 26,038,922 |
| k-mer in genome and reads | 303168,905 | 425,981,208 | 769,420,329 |
| QV | 34.6019 | 31.9049 | 27.6449 |
| Error rate | 0.000346583 | 0.000644926 | 0.00171993 |
| Solid k-mer in genome | 210,228,161 | 250,755,719 | 520,815,448 |
| Total solid k-mer in reads | 220,698,486 | 317,069,158 | 756,078,105 |
| Complete (%) | 95.2558 | 79.0855 | 68.8838 |
Figure 3:The analysis of Hi-C library construction and heatmap.
The Hi-C assembly statistics table of A. nanchuanensis
| Group | Cluster number | Cluster length (bp) | Order number | Order length (bp) |
|---|---|---|---|---|
| LG01 | 46 | 26,514,107 | 24 | 24,676,255 |
| LG02 | 41 | 26,638,661 | 16 | 24,489,134 |
| LG03 | 30 | 24,254,703 | 16 | 23,044,270 |
| LG04 | 34 | 22,404,888 | 13 | 20,644,200 |
| LG05 | 33 | 21,646,681 | 16 | 20,177,649 |
| LG06 | 35 | 29,133,579 | 18 | 27,822,153 |
| LG07 | 69 | 32,924,820 | 27 | 29,467,719 |
| LG08 | 45 | 29,858,101 | 20 | 27,605,363 |
| LG09 | 77 | 29,556,483 | 29 | 25,185,028 |
| LG10 | 45 | 22,896,788 | 20 | 20,243,522 |
| LG11 | 67 | 25,833,105 | 20 | 21,750,724 |
| LG12 | 37 | 24,385,337 | 15 | 22,370,729 |
| LG13 | 47 | 23,481,896 | 24 | 21,098,278 |
| LG14 | 46 | 29,162,015 | 19 | 26,857,340 |
| LG15 | 61 | 28,431,484 | 30 | 25,341,045 |
| LG16 | 32 | 21,965,556 | 16 | 20,879,538 |
| LG17 | 41 | 25,915,114 | 19 | 24,032,910 |
| LG18 | 49 | 34,941,454 | 27 | 32,502,827 |
| LG19 | 54 | 29,520,137 | 21 | 25,685,935 |
| LG20 | 50 | 32,513,478 | 18 | 29,815,261 |
| LG21 | 50 | 28,639,915 | 21 | 25,613,043 |
| LG22 | 42 | 27,392,871 | 24 | 25,873,084 |
| LG23 | 42 | 28,655,389 | 16 | 26,447,344 |
| LG24 | 52 | 27,753,222 | 24 | 25,148,606 |
| LG25 | 46 | 23,720,417 | 16 | 21,152,151 |
| LG26 | 63 | 33,995,937 | 28 | 30,220,329 |
| LG27 | 58 | 28,458,315 | 24 | 25,577,647 |
| LG28 | 44 | 25,907,258 | 22 | 23,985,053 |
| Total (ratio %) | 1336 (97.95%) | 766,501,711 (99.62%) | 583 (43.64%) | 697,707,137 (91.02%) |
The statistics do not include 100 Ns added by artificially connected pseudochromosomes.
The prediction analysis of A. nanchuanensis coding gene
| Prediction style and proportion | |||
|---|---|---|---|
| Gene number | 41,636 | Coding DNA sequence (CDS) length | 50,445,441 bp |
| Gene length | 158,114,419 bp | CDS average length | 1,211.58 bp |
| Gene average length | 3,797.54 bp | CDS number | 226,727 |
| Exon length | 62,835,343 bp | CDS average number | 5.45 |
| Exon average length | 1,509.16 bp | Intron length | 95,279,076 bp |
| Exon number | 233,559 | Intron average length | 2,288.38 bp |
| Exon average number | 5.61 | Intron number | 191,923 |
| Intron average number | 4.61 | ||
Figure 4:The Nr homologous species distribution of A. nanchuanensis.
Statistical classification of gene families
| Name | Total gene | Cluster | Total family | Unifamily |
|---|---|---|---|---|
|
| 27,369 | 23,106 | 12,753 | 726 |
|
| 16,986 | 15,058 | 11,147 | 254 |
|
| 41,335 | 33,270 | 14,725 | 950 |
|
| 39,040 | 25,888 | 12,648 | 1,327 |
|
| 26,346 | 19,238 | 12,682 | 665 |
|
| 26,965 | 20,423 | 14,794 | 524 |
|
| 21,432 | 20,070 | 13,810 | 176 |
|
| 41,636 | 33,925 | 15,436 | 512 |
Total gene: the number of total genes. Cluster: the number of genes involved in family classification. Total family number: the number of gene families that can be divided. Unifamily: the number of unique gene families.
Figure 5:The phylogenetic and gene family analysis of A. nanchuanensis and related species.
The annotation of protein gene family
| Gene family | Pfam | Function |
|---|---|---|
| GF_12 673 | PF00646.28 | F-box domain |
| GF_10 548 | PF00031.16 | Cystatin domain |
| GF_8 | PF00069.20 | Protein kinase domain |
| GF_13 176 | PF13639.1 | Ring finger domain |
Gene family refers to the gene family cluster. Pfam refers to the ID of protein family alignment to the Pfam database. Function indicates the function of the protein family that can be aligned.
The rapidly evolving genes selected by CodeML
| GeneID |
| Sites |
|---|---|---|
| EVM0035972.1 | 0.05 | 298,G,0.993** |
| EVM0031735.1 | 0.06 | 74,E,0.984* |
| EVM0026117.1 | 0.35 | 68,K,0.997** |
| EVM0015119.1 | 0.00 | 232,E,0.990** |
GeneID means the ID of gene, ω0 indicates the ka/ks for the studied species, ω1 is the average ka/ks for other species, and ω2 is ka/ks for the whole evolutionary tree. * represents a posteriori probability ≥0.95, ** represents a posteriori probability ≥0.99.
Figure 6:The 4DTv distribution and LTR insertion time analysis among A. nanchuanensis and other related species.