Jing Fan1, Yan Chen2, MingHua Luo3, Zi Liang1, Xiang Nong1. 1. College of Life Sciences, Leshan Normal University, Leshan, China. 2. Ecological Security and Protection Key Laboratory of Sichuan Province, Mianyang Normal University, Mianyang, China. 3. School of Life Science and Biotechnology, Mianyang Normal University, Mianyang, China.
Abstract
Celtis is a large genus in Cannabaceae family, with more than 70 species in the world. However, the intraspecific variabilities of morphological features make it difficult for some species to be distinguished based on their morphological characteristics. To supply the chloroplast (cp) genome resources of Celtis for species identification, the plastome of Celtis sinensis Persoon 1805 was newly sequenced and comparative genomics was analyzed. The chloroplast genome was 159,085 bp in length and had a quadripartite structure consisting of two inverted repeats (IRs) separated by a small single copy (SSC) and a large single copy (LSC) region. A total of 133 genes were annotated, including 88 protein-coding genes, eight rRNA genes, and 37 tRNA genes. Among the protein-coding genes, the frequency of the leucine codon is the highest and that of the cysteine codon is the lowest. Comparative genomic analysis showed that the IRS region was more conservative than the LSC and SSC regions, with most sequence variations located in the intergenic spacer rather than the protein-coding region. Moreover, sixteen highly divergent hotspots were identified. The ML phylogenetic tree showed that all involved Celtis species were clustered together, and the plastome reported in this paper has high enough resolution to distinguish C. sinensis (Pers.) from other Celtis plants. This study provides useful genetic resources for the identification of C. sinensis (Pers.) and is also of great significance for the phylogeny study of Celtis plants in the future.
Celtis is a large genus in Cannabaceae family, with more than 70 species in the world. However, the intraspecific variabilities of morphological features make it difficult for some species to be distinguished based on their morphological characteristics. To supply the chloroplast (cp) genome resources of Celtis for species identification, the plastome of Celtis sinensis Persoon 1805 was newly sequenced and comparative genomics was analyzed. The chloroplast genome was 159,085 bp in length and had a quadripartite structure consisting of two inverted repeats (IRs) separated by a small single copy (SSC) and a large single copy (LSC) region. A total of 133 genes were annotated, including 88 protein-coding genes, eight rRNA genes, and 37 tRNA genes. Among the protein-coding genes, the frequency of the leucine codon is the highest and that of the cysteine codon is the lowest. Comparative genomic analysis showed that the IRS region was more conservative than the LSC and SSC regions, with most sequence variations located in the intergenic spacer rather than the protein-coding region. Moreover, sixteen highly divergent hotspots were identified. The ML phylogenetic tree showed that all involved Celtis species were clustered together, and the plastome reported in this paper has high enough resolution to distinguish C. sinensis (Pers.) from other Celtis plants. This study provides useful genetic resources for the identification of C. sinensis (Pers.) and is also of great significance for the phylogeny study of Celtis plants in the future.
Celtis is a genus of Cannabaceae plants with about 70 species, mainly distributed in the temperate and tropical regions (Hwang et al. 2003). These plants are rich in total fibers, proteins, vitamins, minerals, and phenols, and are commonly used as the source of industrial wood and for the extraction of medicinal substances with antioxidant and antibacterial properties (Ota et al. 2017; Shokrzadeh et al. 2018; Temiz et al. 2021). Although the Celtis species have similar traditional medical functions, their phytochemical composition and efficacy vary by species (El-Alfy et al. 2011; Yıldırım et al. 2017). Therefore, accurate species identification can facilitate the effective management and development of Celtis plants. However, some species in the genus are difficult to identify due to their highly variable morphological characteristics. For example, leaves from individuals with different states, even if they have the same genotype, may have different shapes, making identification very complicated (Whittemore 2008). With the development of DNA sequencing technologies, phylogeny generated by a single or several gene combinations is being replaced by phylogenetic relationships constructed from the whole genome. The cp genome is more conservative in gene structure and composition than mitochondrial and nuclear genomes and is therefore commonly used in population genetics and phylogeny studies (Fan et al. 2020). Although several cp genomes of Celtis have been reported (Wang et al. 2019a, 2019b; Liu et al. 2021), the lack of resources for the cp genome of Celtis still hinders our understanding of Celtis species identification and phylogeny.Celtis sinensis Persoon 1805, commonly known as Chinese hackberry, active ingredients extracted from its leaves and barks have been shown to be effective in treating gastrointestinal diseases and lung abscesses (Wei and Guo 2020). It is also often used as folk medicine to treat abdominal pain, urticaria, and eczema (Kim et al. 2005). Here, the complete cp genome of C. sinensis (Pers.) was newly assembled and characterized. In addition, a plastome comparative analysis of C. sinensis (Pers.) and its related species was performed. This study provides important information for the molecular identification, phylogeny, and development of genetic markers of C. sinensis (Pers.).
Materials and methods
Plant materials, DNA extraction, and library construction
Fresh leaves of C. sinensis (Pers.) were collected from Chengdu City, Sichuan Province, China (103°59′2″E, 30°45′52″N). A few experimental leaf samples used in this study were all from cultivated materials, permission was not required for sample collection as this tree is a common and non-national protected plant species that is abundantly available in the Chengdu region of China. The samples were also collected without causing harm to the trees themselves and the plant habitat. The voucher specimen was deposited in the plant molecular laboratory of Leshan Normal University (http://www.lsnu.edu.cn/, contact person: Jing Fan, email: fanjing972001@126.com) under the voucher number CS201911. DNA was extracted by the CTAB method, and the Illumina library was then constructed using total DNA and sequenced by the Illumina NovaSeq 6000 platform.
Chloroplast genome assembly and annotation
The sequenced raw reads were quality-controlled by NGSQC Toolkit v2.3.3 software (Patel and Jain 2012). Subsequently, a total of 19,862,546 high-quality PE150 clean reads were assembled into contigs by SPAdes3.11.0 software, using a k-mer set of 93, 105, 117, and 121 (Bankevich et al. 2012). After being annotated by the Plann and GeSeq softwares (Huang and Cronk 2015; Tillich et al. 2017), the successfully annotated genome was then submitted to the GenBank database, and the annotation file in GenBank format was submitted to OGDRAW to draw the organelle genome map (Lohse et al. 2007).
Condon usage and comparative plastome analysis
Codon usage and amino acid frequency were analyzed using codonW1.4.2 (Peden 1999). Expansion and contraction of the IR region were visualized using IRscope software (Amiryousefi et al. 2018). The similarity of cp genomes of five Celtis species was compared in Shuffle-LAGAN mode and visualized by mVISTA software (Frazer et al. 2004). Nucleotide diversity (pi) was calculated by DnaSP V6.0 software, with a window length of 600 bp and step size of 200 bp (Rozas et al. 2017).
Phylogenetic analysis
To determine the phylogeny of C. sinensis (Pers.), the cp genome of which was compared with 46 plastomes from GenBank using the automatic alignment model in MAFFT 7.037 software (Katoh and Standley 2013). The aligned sequences were optimized using Gblocks 0.91 b tools, allowing the gap position parameter to be set to all (Castresana 2000). Subsequently, the ML phylogenetic tree was constructed by MEGAX with the following parameters: general time-reversible (GTR) model, gamma-distributed with invariant sites (G + I), partial deletion of gaps/missing data, and 1,000 bootstrap repetitions (Kumar et al. 2018).
Results
Plastome features of C. sinensis Persoon
The cp genome of C. sinensis (Pers.) was a 159,085 bp circular quadripartite structure with GC content of 36.32% (GenBank accession number MN877379), in which two inverted repeats (IRA and IRB, 26,894 bp each) were separated by a large single copy (LSC, 86,137 bp) and a small single copy (SSC, 19,160 bp) region. A total of 133 genes were annotated, including 88 protein-coding genes, 37 tRNA genes, and eight rRNA genes. Of these, eight protein-coding genes (ndhB, rpl2, rpl22, rpl23, rps7, rps19, ycf1, ycf2), four rRNA genes (rrn23, rrn16, rrn5, rrn4.5), and seven tRNA genes (trnA-UGC, trnI-CAU, trnI-GAU, trnL-CAA, trnN-GUU, trnR-ACG, trnV-GAC) were duplicated in the IRs region, the rps12 gene was trans-spliced.
Codon usage analysis
To determine the codon usage patterns in the plastome of C. sinensis (Pers.), the relative synonymous codon usage (RSCU) was calculated using the nucleotide sequences of protein-coding genes. By analyzing 26,761 codons from 85 genes starting with ATG, it was found that leucine codons (Leu) were used the most frequently, accounting for 10.68% of the total codon utilization. We also found that the synonymous codons of leucine include TTA, TTG, CTT, CTC, CTA, and CTG, among which TTA (RSCU = 1.92) was used more frequently than CTC (RSCU = 0.4). However, cysteine codons were the least frequently used in the genome, accounting for only 1.17% of the total frequency. All amino acids except methionine (Met) and tryptophan (Trp) had two or more codons. These results suggest that the cp genome of C. sinensis (Pers.) has a certain codon preference (Figure 1).
Figure 1.
Codon usages in the plastome of Celtis sinensis Persoon. Note: symbols on the abscissa represent 20 amino acids, RSCU values on the ordinate indicate the frequency of codon usage, stacked bars of different colors above the ordinate correspond to codons of the same color below the ordinate.
Codon usages in the plastome of Celtis sinensis Persoon. Note: symbols on the abscissa represent 20 amino acids, RSCU values on the ordinate indicate the frequency of codon usage, stacked bars of different colors above the ordinate correspond to codons of the same color below the ordinate.
Contraction and expansion of IRs
To identify the contraction and expansion of the IR regions, the cp genome of C. sinensis (Pers.) was compared with previously reported four Celtis plastomes. The results showed that the LSC-IRb junction of the five Celtis species was located in the rps3 gene, IRb-SSC junction was located within the ndhF gene, and the ends of IRb region extended into the rps3 gene and ndhF gene, respectively, with lengths of 124–139 and 28 bp. The ycf1 gene spanned the SSC/IRa boundary, and the IRa region extended into the ycf1 gene with a length of 1093 bp. The trnH gene was located near the IRa-LSC boundary. In general, the genes at the IR/SC junction are relatively conserved and share the same gene order, especially in C. sinensis (Pers.), C. tetrandra, and C. julianae. Comparative genome analysis showed that the five species ranged in size from 159,001 bp (C. biondii) to 159,092 bp (C. Sinensis isolated MS), with length differences caused by single copy (SC) regions of different lengths (Figure 2).
Figure 2.
Comparison of the IR/SC boundaries among five Celtis plastomes.
Comparison of the IR/SC boundaries among five Celtis plastomes.
Comparative plastome analysis of Celtis plants
To compare the plastome divergence between Celtis species, a global comparison was performed by mVISTA program, with the cp genome of C. sinensis (Pers.) as the reference. The results showed that most sequence variations were located in LSC and SSC regions, indicating that IR regions were more conservative than the single-copy regions. Further analysis found that the non-coding sequences hold most of the sequence variation sites (Figure 3). To identify highly divergent hotspots, nucleotide diversity (Pi) values were calculated by DnaSP V6.0 software. Sixteen divergent hotspots with Pi values ≥0.006 were discovered from the cp genome of Celtis, including trnH-psbA, psbA-matK, rps16, rpsl6-trnQ, trnG-atpA, trnC-petN, trnE-trnT-psbD, psbC-psbZ, rps4-trnL, clpP, rps8-rpl16, rpl16-rps3, ndhF, ndhF-rpl32, rpl32-trnL and ycf1 (Figure 4).
Figure 3.
Sequence comparison of the cp genomes of five Celtis plants. Note: the annotation of Celtis sinensis Persoon was selected as the reference, gray arrows indicate the direction and location of each gene, colored regions represent exons, UTR and conservative non-coding regions, vertical scales display the identity ranging from 50 to 100%.
Figure 4.
Sliding window test of the cp genomes of five Celtis species. Note: Genes in the divergent hotspots where nucleotide diversity (Pi) exceeds the threshold of 0.006 are highlighted in red.
Sequence comparison of the cp genomes of five Celtis plants. Note: the annotation of Celtis sinensis Persoon was selected as the reference, gray arrows indicate the direction and location of each gene, colored regions represent exons, UTR and conservative non-coding regions, vertical scales display the identity ranging from 50 to 100%.Sliding window test of the cp genomes of five Celtis species. Note: Genes in the divergent hotspots where nucleotide diversity (Pi) exceeds the threshold of 0.006 are highlighted in red.To understand the phylogenetic relationship of C. sinensis (Pers.), a maximum likelihood (ML) phylogenetic tree was constructed from the cp genome of C. sinensis (Pers.) and 46 other species. The ML phylogenetic tree showed that the cp genome reported in this paper can distinguish C. sinensis (Pers.) from other plants at the genomic level. All plants were divided into three clades, among which Celtis plants were clustered together, C. sinensis (Pers.), C. tetrandra, and C. julianae showed a closer relationship (Figure 5).
Figure 5.
The ML phylogenetic tree constructed from 47 chloroplast genomes. Note: Numbers near each node represent bootstrap percentage obtained by 1,000 bootstrap analyses, the symbols (I)–(III) indicate that the phylogenetic tree evolved toward three branches, the GenBank accession numbers are shown in parentheses.
The ML phylogenetic tree constructed from 47 chloroplast genomes. Note: Numbers near each node represent bootstrap percentage obtained by 1,000 bootstrap analyses, the symbols (I)–(III) indicate that the phylogenetic tree evolved toward three branches, the GenBank accession numbers are shown in parentheses.
Discussion
The genus Celtis is a large group in the Cannabaceae family (Spitaler et al. 2009). However, chloroplast genome sequences of many Celtis members have not yet been revealed, the insufficient sequence resources hinder our comprehensive understanding of the phylogeny and species identification of Celtis plants. In this study, a new cp genome sequence was reported and its gene structure and comparative genome were analyzed. Most angiosperms have a quadripartite plastome ranging in size from 115 to 165 kb (Cheng et al. 2020). The cp genome of C. sinensis (Pers.) reported in this paper is 159,085 bp in length and contains a quadripartite structure with 133 genes. RSCU statistics is a preferences assessment of synonymous codons in species (Somaratne et al. 2019). The codon usage frequency of C. sinensis (Pers.) was calculated using the protein-coding genes with ATG as the starting codon. Statistical analysis showed that all protein-coding genes except methionine and tryptophan contained synonymous codons, among which leucine (Leu) codon had the highest frequency and cysteine (Cys) codon had the lowest frequency. The codon bias (RSCU > 1) was found in 30 codons from C. sinensis (Pers.), and most of them tended to use A/T endings (Figure 1). A comparison of Celtis plastomes revealed that the gene arrangement at the IR/SC junction was relatively conserved. Our results can be explained that the high conservation of the cp genome leads to the similarity of gene distribution at the SC/IR junction (Zhou et al. 2020). In addition, there are some differences in the location of genes, indicating slight expansion and contraction in the IR regions (Figure 2). Some relevant DNA fragments may also be less similar in closely related individuals, making it necessary to screen for loci with high mutation frequencies (Guo et al. 2020). We found that the difference in LSC/SSC region was greater than those in the IR region, and the variation in intergenic spacer was more obvious than that of protein-coding genes (Figure 3). A total of 16 hypervariable regions suitable for development as molecular markers were detected in this study (Figure 4). This is consistent with other reports that highly differentiated regions of cp genomes are usually located in the intergenic regions (Zhou et al. 2020). Most of the protein-coding genes in the cp genome are quite conservative, but there may be exceptions for some protein genes (Cheng et al. 2020). In our study, four protein genes (rps16, clpP, ndhF, and ycf1) showed higher differences in Celtis plants (Figure 4). In this study, phylogenetic analysis indicated that C. sinensis (Pers.) was closely related to C. tetrandra and C. julianae, all the five Celtis species were clustered together (Figure 5). This paper provides new cp genome resources for C. sinensis (Pers.) and excavates potential molecular markers, all the above work will be valuable for molecular identification, phylogeny and population genetic research of C. sinensis (Pers.) in the future.
Authors: Julio Rozas; Albert Ferrer-Mata; Juan Carlos Sánchez-DelBarrio; Sara Guirao-Rico; Pablo Librado; Sebastián E Ramos-Onsins; Alejandro Sánchez-Gracia Journal: Mol Biol Evol Date: 2017-12-01 Impact factor: 16.240
Authors: Kelly A Frazer; Lior Pachter; Alexander Poliakov; Edward M Rubin; Inna Dubchak Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971
Authors: Michael Tillich; Pascal Lehwark; Tommaso Pellizzer; Elena S Ulbricht-Jones; Axel Fischer; Ralph Bock; Stephan Greiner Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971