Literature DB >> 28713393

The Complete Chloroplast Genome of Chinese Bayberry (Morella rubra, Myricaceae): Implications for Understanding the Evolution of Fagales.

Lu-Xian Liu^1,2, Rui Li³, James R P Worth⁴, Xian Li⁵, Pan Li², Kenneth M Cameron⁶, Cheng-Xin Fu².

Abstract

Morella rubra (Myricaceae), also known as Chinese bayberry, is an economically important, subtropical, evergreen fruit tree. The phylogenetic placement of Myricaceae within Fagales and the origin of Chinese bayberry's domestication are still unresolved. In this study, we report the chloroplast (cp) genome of M. rubra and take advantage of several previously reported chloroplast genomes from related taxa to examine patterns of evolution in Fagales. The cp genomes of three M. rubra individuals were 159,478, 159,568, and 159.586 bp in length, respectively, comprising a pair of inverted repeat (IR) regions (26,014-26,069 bp) separated by a large single-copy (LSC) region (88,683-88,809 bp) and a small single-copy (SSC) region (18,676-18,767 bp). Each cp genome encodes the same 111 unique genes, consisting of 77 different protein-coding genes, 30 transfer RNA genes and four ribosomal RNA genes, with 18 duplicated in the IRs. Comparative analysis of chloroplast genomes from four representative Fagales families revealed the loss of infA and the pseudogenization of ycf15 in all analyzed species, and rpl22 has been pseudogenized in M. rubra and Castanea mollissima, but not in Juglans regia or Ostrya rehderiana. The genome size variations are detected mainly due to the length of intergenic spacers rather than gene loss, gene pseudogenization, IR expansion or contraction. The phylogenetic relationships yielded by the complete genome sequences strongly support the placement of Myricaceae as sister to Juglandaceae. Furthermore, seven cpDNA markers (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) with relatively high levels of variation and variable cpSSR loci were identified within M. rubra, which will be useful in future research characterizing the population genetics of M. rubra and investigating the origin of domesticated Chinese bayberry.

Entities: Chemical Disease Gene Species

Keywords: Fagales; Morella rubra; chloroplast genome; genomic structure; phylogenomics

Year: 2017 PMID： 28713393 PMCID： PMC5492642 DOI： 10.3389/fpls.2017.00968

Source DB: PubMed Journal: Front Plant Sci ISSN： 1664-462X Impact factor: 5.753

Introduction

Chloroplasts (cp) are essential organelles in plant cells for photosynthesis and perform other functions comprising synthesizing starch, fatty acids, pigments and amino acids (Neuhaus and Emes, 2000). Typically, the sizes of chloroplast genomes and their gene arrangement in angiosperms are highly conserved and usually have a circular structure ranging from 120 to 160 kb, with two copies of inverted repeats (IR) region separated by a large single-copy (LSC) region and a small single-copy (SSC) region (Palmer, 1991; Raubeson and Jansen, 2005). Chloroplast genomes generally contain 110–130 distinct genes and these genes exhibit a highly conserved gene order with a majority of which (∼79) encoding proteins that are mostly involved in photosynthesis, whereas the rest of the genes encode approximately 30 transfer RNA (tRNA)s and four ribosomal RNA (rRNA)s (Jansen et al., 2005). Compared with nuclear and mitochondrial genomes, chloroplast genomes are largely conserved in term of gene content, organization and structure (Raubeson and Jansen, 2005), and the nucleotide substitution rate of chloroplast genes is higher than that of mitochondrial genes, but lower than that of nuclear genes (Wolfe et al., 1987; Drouin et al., 2008). However, evolutionary events such as mutations, duplications, losses and rearrangements of genes have been reported in a number of studies (Lee et al., 2007; Dong et al., 2013; Choi et al., 2016). Due to its relatively small size, simple structure and conserved gene content, the chloroplast genome has been used as ideal research model for evolutionary and comparative genomic studies (Dong et al., 2013). In recent years, comparative studies of chloroplast genomes have been applied to a number of focal species (Young et al., 2011), genera (Greiner et al., 2008a,b), or plant families (Daniell et al., 2006). At higher taxonomic levels, comparative analyses of chloroplast genomes are useful for phylogenetic studies (Moore et al., 2007; Moore et al., 2010), as well as for understanding the genome evolution relating genome size variations, gene and intron losses and nucleotide substitutions. Moreover, chloroplasts have their own independent genome encoding an array of specific proteins, and the nature of non-recombinant and uniparental inheritance makes it a primarily useful tool in genomics and evolutionary research (Cho et al., 2015). Single nucleotide polymorphsims (SNPs) and indels, resulting from translocations, inversions, copy number variation of tandem repeats and rearrangements, are suitable for applying to phylogeny reconstruction (De Las Rivas et al., 2002), DNA barcoding (Hollingsworth et al., 2011), as well as investigating the geographic origin of some important domesticated crops (Arroyo-Garcia et al., 2006; Londo et al., 2006; Delplancke et al., 2013). In this study, we analyzed the chloroplast genome of Morella rubra Lour. (Myricaceae), also known as Chinese bayberry, which is one of the most popular and valuable fruits in eastern China because of its appealing color, texture, delicious taste and nutritional value (Cheng et al., 2015). From the whole family Myricaceae, M. rubra is the only species to be domesticated as a fruit crop (Lu and Bornstein, 1999). Due to its long cultivation history (>2000 years) in China, as many as 305 accessions have been recorded, of which 268 have been named as cultivars (Zhang and Miao, 1999; Zhang et al., 2009). Wild populations of M. rubra, which are important germplasm resources for Chinese bayberry breeding, are distributed in the subtropical evergreen forests in China, Japan, South Korea and Philippines. Despite the economic importance of Chinese bayberry, its population genetics and domestication origin are still unclear. In fact, even the phylogenetic placement of Morella within Myricaceae, and the family within the order Fagales, remains ambiguous. This is one of the most economically and ecologically important flowering plant orders since it contains a number of domesticated nut and timber species, as well as dominant forest tree species (e.g., chestnut, walnut, hickory, oak, southern beech, birch). Before 1990, Fagales was generally considered to contain only two families: Betulaceae and Fagaceae (Takhtajan, 1980; Cronquist, 1988). However, several large-scale phylogenetic analyses using DNA sequences (Chase et al., 1993; Soltis et al., 2000; Chen et al., 2016) and cpDNA restriction sites (Manos et al., 1993) have provided evidence for the monophyly of an expanded Fagales, which now comprises seven families: Nothofagaceae, Fagaceae, Myricaceae, Juglandaceae (including Rhoipteleaceae), Casuarinaceae, Ticodendraceae, and Betulaceae (APG III, 2009; APG IV, 2016). Most of the relationships within Fagales are well resolved, but the position of Myricaceae still remains uncertain. For example, some studies placed Myricaceae as sister to (Casuarinaceae + (Ticodendraceae + Betulaceae)) (Manos and Steele, 1997, matK/matK + rbcL; Cook and Crisp, 2005; Sauquet et al., 2012; Xiang et al., 2014; Sun et al., 2016), whereas others supported a sister relationship between Myricaceae and Juglandaceae (Li et al., 2004; Soltis et al., 2007; Larson-Johnson, 2016). Still others found that Myricaceae is sister to all Fagales except Nothofagaceae and Fagaceae (Manos and Steele, 1997, rbcL; Li et al., 2002). Thus, previous studies appear to have been based on insufficient information and thus could not fully resolve the phylogenetic position of Myricaceae. Here, three individuals of M. rubra (Myricaceae) were selected for complete chloroplast genome sequencing. By comparing these three chloroplast genomes to each other and to previously published chloroplast genomes from other taxa in Fagales, we aim to: (1) characterize and compare the cp genomes among select representatives of Fagales in order to gain insights into evolutionary patterns within the order; (2) resolve the phylogenetic position of Myricaceae; (3) screen and identify appropriate markers of the M. rubra genome for future studies on population genetics and domestication origin.

Materials and Methods

DNA Sequencing and Genome Assembly

Total genomic DNA was isolated from silica-dried leaves of three wild M. rubra plants collected in Guangdong (GZMZ), Fujian (FJZS), and Yunnan (YNML) using a modified CTAB method (Li et al., 2013). The high molecular weight DNA was sheared using a Covaris S220-DNA Sonicator (Covaris, INC., Woburn, MA, United States), yielding fragments of ≤800 bp in length. The quality of fragmentation was checked on an Agilent Bioanalyzer 2100 (Agilent Technologies). Short-insert (500 bp) paired-end libraries were generated by using Genomic DNA Sample Prep Kit (Illumina) according to the manufacturer’s protocol and then sequenced using an Illumina HiSeq 2500 (Beijing Genomics Institute, Shenzhen, China). Resulting sequence fragments were screened by quality in order to remove low-quality sequences (Phred score <30, 0.001 probability error), and all remaining high quality sequences were assembled into contigs using the CLC de novo assembler beta 4.06 (CLC Inc., Rarhus, Denmark) with parameters as follows: minimum contig length of 200, deletion and insertion costs of 3, mismatch cost of 2, bubble size of 98, length fraction, and similarity fraction of 0.9. We obtained the principal contigs representing the chloroplast genome from the total assembled contigs using a BLAST (NCBI BLAST v2.2.31) search with the cp genome sequence of J. regia (GenBank accession number: KT870116) as a reference sequence (Peng et al., 2015). The representative chloroplast sequence contigs were ordered and oriented according to the reference chloroplast genome, and the complete chloroplast sequence of M. rubra was constructed by connecting overlapping terminal sequences.

Genome Annotation and Molecular Marker Identification

The cp genomes of M. rubra were annotated through the online program Dual Organellar Genome Annotator (DOGMA; Wyman et al., 2004). Initial annotation, putative starts, stops, and intron positions were determined according to comparisons with homologous genes of J. regia and Castanea mollissima (GenBank accession number: HQ336406) cp genomes using Geneious v9.0.5 software (Biomatters, Auckland, New Zealand). In addition, all of the identified tRNA genes were further verified by using the corresponding structures predicted by tRNAscan-SE version 1.21 (Schattner et al., 2005) with default settings. The cp genome map of M. rubra was constructed utilizing the OGDRAW program (Lohse et al., 2013). The three completed chloroplast genome sequences of M. rubra were aligned using MAFFT (Katoh et al., 2002). In order to screen various polymorphic regions among individuals of M. rubra (i.e., below the species level), the average number of nucleotide differences (K) and total number of mutations (Eta) were determined to analyze nucleotide diversity (Pi) using DnaSP v5.0 (Librado and Rozas, 2009).

Repeat Structure and Sequence Analysis

We used the online REPuter software to visualize and locate forward, palindrome, reverse and complement sequences with a minimum repeat size of 30 bp and a sequence identity greater than 90% (Kurtz and Schleiermacher, 1999). Microsatellite (mono-, di-, tri-, tetra-, penta-, and hexanucleotide repeats) detection was performed using Msatcommander v0.8.2 (Faircloth, 2008). We applied a threshold nine, five, five, three, three, and three repeat units for mono-, di-, tri-, tetra-, penta-, and hexanucleotide SSRs, respectively.

Comparative Chloroplast Genomic Analysis

We downloaded Castanea mollissima, Juglans regia, and Ostrya rehderiana (GenBank accession number: KT454094) chloroplast genome sequences from GenBank, in order to compare the overall similarities among different chloroplast genomes in Fagales. Pairwise alignments among four Fagales cp genomes were implemented in the mVISTA program with LAGAN mode (Frazer et al., 2004) using the annotation of Cucumis sativus (Cucurbitaceae, Cucurbitales; GenBank accession number: DQ865976) as the reference.

Synonymous (KS) and Non-synonymous (KA) Substitution Rates Analysis

The DnaSP v5.0 (Librado and Rozas, 2009) software was employed to analyze the relative rates of sequence divergence in the four Fagales species and the reference sequence. In order to analyze synonymous (KS) and non-synonymous (KA) substitution rates, we extracted the same individual functional protein-coding exons and aligned separately using Geneious v9.0.5. Genes with the same functions were grouped and analyses were carried out on (1) datasets corresponding to those with the same functions, i.e., for atp, pet, ndh, psa, psb, rpl, rpo, and rps; (2) datasets corresponding to singular genes, i.e., for cemA, matK, ccsA, clpP, rbcL, and ycf1; and (3) concatenated common protein-coding genes, except for pseudogenes or lost genes from any species.

Phylogeny Inference

The complete chloroplast genome sequences of eight species from Fagales (10 accessions) were used for phylogenetic analysis, including representatives of five genera of Fagaceae, one genus of Betulaceae, one genus of Juglandaceae, and the three newly sequenced individuals of M. rubra used to represent Myricaceae (Supplementary Table S1). Two species from Cucurbitales (Corynocarpus laevigata and Cucumis sativus) were chosen as outgroup taxa to orient the Fagales tree. In order to investigate the utility of different regions, the phylogeny was inferred using two datasets: (1) the complete chloroplast genome sequences; and (2) a set of 69 protein-coding genes shared by the chloroplast genomes of the 12 accessions. All the gaps were excluded after alignment in both analyses. All phylogenetic analyses were conducted using maximum-likelihood (ML) and Bayesian inference (BI) methods. ML analyses were implemented in RAxML-HPC v8.1.11 on the CIPRES cluster[1] (Miller et al., 2010) using the best-fit nucleotide substitution model (GTR+I+G) determined from jModelTest v2.1.4 (Posada, 2008) for the cp genome dataset and a partitioned model for protein-coding regions. BI analyses were performed in MrBayes v3.2.3 (Ronquist and Huelsenbeck, 2003) using the same model selection criteria for both data sets. Two independent parallel runs of four Metropolis-coupled Monte Carlo Markov Chains (MCMCs) were run with trees sampling every 1000 generations for five million total generations.

Results and Discussion

Genome Content and Organization in M. rubra

We generated a total of 8.5 million paired-end (PE) reads (200 million nucleotides) for M. rubra-GZMZ, and then trimmed and assembled them using the CLC genome assembler pipeline (CLC Bio, Aarhus, Denmark). A total of 290,501 PE reads were concordantly mapped to the final assembly and the mapped cp contigs were selected to merge for constructing a complete M. rubra-GZMZ cp genome map using BLAST (NCBI BLAST v2.2.31). Four initial contigs (contigs 16, 39, 79, and 883 respectively) were selected to generate the M. rubra-GZMZ cp genome sequence with no gaps and no Ns. The cp genome sequence was registered into GenBank with the accession number KY476637. The complete chloroplast genome of M. rubra-GZMZ is 159,478 bp in length and shares the common feature of comprising two copies of IR (26,014 bp each) that divide the genome into two single-copy regions (LSC 88,683 bp; SSC 18,767 bp; Figure ). The overall GC content of the total length, LSC, SSC, and IR regions is 36.1, 33.8, 29.2, and 42.6%, respectively. Coding regions (91,795 bp), comprising protein-coding genes (79,949 bp), tRNA genes (2,798 bp) and rRNA genes (9,048 bp) account for 57.56% of the genome, whereas non-coding regions (67,683 bp), including intergenic spaces (49,558 bp) and introns (18,125 bp) account for the remaining 42.44% of the genome. Chloroplast genome map of Morella rubra (Myricaceae). The genes inside and outside of the circle are transcribed in the counterclockwise and clockwise directions, respectively. Genes belonging to different functional groups are shown in different colors. The thick lines indicate the extent of the inverted repeats (IRA and IRB) that separate the genomes into small single-copy (SSC) and large single-copy (LSC) regions. Within the chloroplast genome of M. rubra there are in total 111 genes, including 77 protein-coding genes, 30 tRNA genes, four rRNA genes and 18 duplicated genes (Figure and Tables ). Among the 111 unique genes, 15 contain one intron (six tRNA genes and nine protein-coding genes) and three (rps12, clpP, and ycf3) contain two introns. The 5′-end exon of the rps12 gene is located in the LSC region, and the intron and 3′-end exon of the gene are situated in the IR region. In addition to the GZMZ accession, we also sequenced the complete cp genomes of M. rubra-FJZS (GenBank accession number: KY476636) and M. rubra-YNML (GenBank accession number: KY476635). These are 159,568 and 159,586 bp in size, respectively, and the genome content and organization of them is nearly the same as the cp genome of M. rubra-GZMZ (Figure and Table ). Comparative analysis of the chloroplast genomes among four families of Fagales, including three different accessions of Morella rubra (Myricaceae) sequenced for this study. List of genes present in the M. rubra chloroplast genome.

Genome Organization of Fagales

The chloroplast genome organization is rather conserved within Fagales (Figure ). We did not detect either translocations or inversions among any of the compared genomes. The IR region in these species is more conserved than the LSC and SSC regions, consistent with other angiosperms (Dong et al., 2013;Lu R. et al., 2016). Variations were detected with the following factors: genome size, gene losses, the pseudogenization of protein-coding genes, and IR expansion and contraction. Identity plot comparing the chloroplast genomes of four Fagales families using Cucumis sativus as the reference sequence. The vertical scale indicates the percentage of identity, ranging from 50 to 100%. The horizontal axis indicates the coordinates within the chloroplast genome. Genome regions are color codes as protein-coding, rRNA, tRNA, intron, and conserved non-coding sequences (CNS).

Genome Size

Among the representative Fagales species, O.rehderiana exhibits the smallest genome size after comparing with the other three chloroplast genomes. The genome of Castanea mollissima (160,799 bp) is approximately 1.45 kb larger than that of O. rehderiana, 1.32 kb larger than that of M. rubra, and 0.26 kb larger than that of J. regia, as well as it is 5.28 kb larger than that of Cucumis sativus, an outgroup species. The detected sequence length difference is predominantly attributable to the variation in the length of the non-coding regions, especially in terms of intergenic spacer size (Table ). The M. rubra-GZMZ genome exhibits the smallest non-coding region among the six analyzed chloroplast genomes.

Gene Loss

A single gene, infA, has been lost from all the four analyzed chloroplast genomes. After comparisons with the chloroplast genomes of other Fagales species, this gene also appears to be missing in Castanea pumila (GenBank accession number: KM360048) and Trigonobalanus doichangensis (GenBank accession number: NC023959), although it is present in Quercus edithiae (GenBank accession number: KU382355), Q. rubra (GenBank accession number: NC020152), Castanopsis echinocarpa (GenBank accession number: NC023801), Lithocarpus balansae (GenBank accession number: NC026577), Q. aliena (Lu S. et al., 2016), Q. spinosa (GenBank accession number: NC026907), Q. aquifolioides (GenBank accession number: NC026913), and Q. baronii (GenBank accession number: NC029490). InfA gene was thought to have functions as a translation initiation factor, which assists in the assembly of the translation initiation complex (Wicke et al., 2011). This gene is also possibly transferred to the nucleus and loss of which appears to have independently occurred multiple times during the evolution of land plants (Millen et al., 2001). Dong et al. (2013) reported the two genes including infA and rpl32 had been lost from the chloroplast genome of Paeonia obovata. Therefore, the loss of infA does not represent a unique phenomenon in some species of Fagales.

Gene Pseudogenization

ycf15 has been pseudogenized in all four representatives of Fagales, and rpl22 has been pseudogenized in M. rubra and Castanea mollissima but not in J. regia and O. rehderiana. The ycf15 gene, which has been paid great attention to its function by previous workers (Raubeson et al., 2007; Shi et al., 2013), is located immediately downstream of the ycf2 gene (Dong et al., 2013). Some studies have shown that the ycf15 gene is potentially functional (Shinozaki et al., 1986), but the validity of ycf15 as a protein-coding gene in angiosperms has long been questioned (Tangphatsornruang et al., 2011). The ycf15 presents a pseudogene in all the sequenced chloroplast genome of Fagales except Q. rubra. In Fagales, rpl22 appears as a pseudogene in Myricaceae and Fagaceae because there remain some internal stop codons within the coding region, and not to be pseudogenized in Juglandaceae and Betulaceae. Jansen et al. (2011) reported that rpl22 has been transferred to the nucleus in Fagaceae, whether the rpl22 gene has been transferred to the nucleus in Myricaceae remains to be investigated.

IR Expansion and Contraction

The expansions and contractions of the IR regions and the single-copy (SC) boundary regions often results in genome size variations among various plant lineages (Wang et al., 2008), and may reflect phylogenetic history. For this reason, we paid careful attention to the exact IR/SC border positions and their adjacent genes among the four Fagales species chloroplast genomes that we studied in detail (Figure ). The ycf1 gene spanned the SSC/IRA region and the pseudogene fragment of ψycf1 varies from 1058 to 1158 bp. The ndhF gene is separated from ψycf1 by spacers except in Castanea mollissima which does not contain a spacer (53 bp in M. rubra, 104 bp in J. regia and 165 bp in O. rehderiana) but shares some nucleotides (6 bp) with the ycf1 pseudogene in our outgroup taxon, Cucumis sativus. The trnH-GUG gene is generally located downstream of the IRA/LSC border, and this gene is separated from the IRB/LSC border by a spacers varies from 8 to 47 bp. However, the rps19 gene does not extend to the IR region among the sampled representatives of Fagales. Thus, the rps19 pseudogene is not observed in Fagales. Although there are expansions and/or contractions of the IR regions detected among the sampled representatives of Fagales, they contribute little to the overall size variations in the chloroplast genomes of these plants. Comparison of junction positions between the single copy and IR regions among four Fagales genomes and Cucumis sativus.

Repeat Sequence Analysis and Molecular Marker Identification

Repeat motifs are thought to play an important role in phylogenetic studies and are very useful in the analysis of genome rearrangement (Cavalier-Smith, 2002; Nie et al., 2012). In the chloroplast genome of M. rubra-GZMZ, 39 pairs of repeats (30 bp or longer) containing 22 palindromic repeats, 15 forward repeats, one complement repeat and one reverse repeat were detected using the program REPuter (Kurtz and Schleiermacher, 1999) (Figure ). Among these repeats, 33 are 30–40 bp long, four repeats are 41 bp long, one repeat is 44 bp long and one repeat is 57 bp long (Figure ). Most of these repeats (53.8%) are distributed in non-coding regions (Table ), whereas some are found in genes such as ycf1, ycf2, ycf3, psaB, and pasA. Further information about the repeat motifs of M. rubra-FJZS and M. rubra-YNML can be found in Supplementary Tables S2, S3. Analysis of repeated sequences in the three M. rubra chloroplast genomes. (A) Frequency of repeat types. (B) Frequency of repeats by length. Repeated sequences in the M. rubra-GZMZ chloroplast genome. Simple sequence repeats (SSR), also known as microsatellites, are widely distributed over the genome (Chen et al., 2015) and have a high degree of polymorphism (Weber, 1990). As a result, SSRs are widely used as a molecular marker for breeding (Rafalski and Tingey, 1993), population genetics (Perdereau et al., 2014), genetic linkage map construction, and gene mapping (Pugh et al., 2004). In the current study, the distribution, type and presence of microsatellites were studied among the cp genomes of three M. rubra accessions. We did this, in part, because we are interested in developing markers that may be useful in future studies that will address intraspecific variation among natural populations and cultivars of M. rubra across East Asia. A total of 155 perfect microsatellites were identified in the M. rubra-GZMZ cp genome. Among them, 118 were located in the LSC regions, whereas 16 and 21 were found in the IR and SSC regions, respectively (Figure ). In addition, 22 SSRs were found in the protein-coding regions, 16 were in the introns and 117 were in intergenic spacers of the M. rubra-GZMZ cp genome (Figure ). The distribution and type of microsatellites of M. rubra-FJZS and M. rubra-YNML is shown in Supplementary Figure . Among these SSRs, 131 are mononucleotides, 18 are dinucleotides, five are tetranucleotides, and one is a pentanucleotide (Figure ). Trinucleotide SSRs are not found in M. rubra-GZMZ or M. rubra-YNML but were detected in M. rubra-FJZS. A majority of the mononucleotides (98.47%) are composed of A/T and most of the dinucleotides (88.89%) are composed of AT/TA (Figure ). These results are consistent with the contention that cp SSRs are generally composed of short polyA or polyT repeats (Kuang et al., 2011; Chen et al., 2015). The higher A/T content in cp SSRs also contributes to a bias in base composition, resulting in A/T enrichment (63.9%) in the M. rubra-GZMZ cp genome. The distribution, type, and presence of simple sequence repeats (SSRs) in the cp genome of M. rubra. (A) Presence of SSRs in the LSC, SSC, and IR regions (M. rubra-GZMZ). (B) Presence of SSRs in the protein-coding regions, intergenic spacers and introns of LSC, SSC, and IR regions (M. rubra-GZMZ). (C) Presence of polymers in the cp genome of M. rubra. The coding genes, non-coding regions and intron regions were compared among the three individuals of M. rubra divergence hotspots. We generated 90 loci (28 coding genes, 52 intergenic spacers, and 10 intron regions) with more than 200 bp in length from three M. rubra individuals and the nucleotide variability (Pi) values calculated with the DnaSP v5.0 software. Among the values received from the three individuals of M. rubra (M. rubra-GZMZ, M. rubra-FJZS, and M. rubra-YNML) ranged from 0.00029 (ycf2 gene) to 0.01867 (psbA-trnK region) (Figure ). The IR region is much more conserved than the LSC and SSC regions, and the lower sequence divergence observed in the IRs compared to the SSC or LSC regions for Morella species and other angiosperms is likely due to copy correction between IR sequences by gene conversion (Khakhlova and Bock, 2006; Lu R. et al., 2016). Seven of these variable loci, including trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron, showed high levels of variation. Five of them (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, and petD-rpoA) are located in the LSC, whereas two (ndhE-ndhG and ndhA intron) are in the SSC region (Figure ). Comparative analysis of the nucleotide variability (Pi) values among three M. rubra individuals. All seven of these variable loci (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) show great potential as highly informative phylogenetic markers in M. rubra. The results presented here will be helpful to the study on the domestication origin of Chinese bayberry in the future.

Synonymous (KS) and Non-synonymous (KA) Substitution Rate Analysis

Nonsynonymous (KA) and synonymous (KS) substitutions and their ratio (KA/KS) are important to indicate the rates of evolution and natural selection (Yang and Nielsen, 2000). Synonymous nucleotide substitutions have occurred more frequently than nonsynonymous substitutions, and the KA/KS value is usually less than one in most protein-coding regions (Makalowski and Boguski, 1998). In this study, these parameters were compared among the protein-coding chloroplast genes of the four-representative species of Fagales to investigate genome evolution, with the cp genome of Cucumis sativus as a reference (Table ). The KA values of the four-representative species ranged from 0.0879 to 0.0962, as well as the KS values ranged from 0.01489 to 0.01605. Both the KA and KS values consistently indicated that Castanea mollissima has evolved a little rapidly than the other three species in Fagales. The KA/KS values of these Fagales species are less than 1, providing the evidence of purifying selection on the chloroplast protein-coding genes of Fagales species. Substitution rates of 75 protein-coding genes in four Fagales chloroplast genomes. Variations in evolutionary rates can be related to the function of genes and genome structure (Chang et al., 2006; Jansen et al., 2007; Dong et al., 2013). In Fagales species, the four-sampled genome structure are quite conserved, without any remarkable restructuring being detected. Comparing with the outgroup Cucumis sativus, the KA (F = 293.17, P < 0.001) and KS (F = 245.86, P < 0.001) values shown differ significantly among gene groups classified according to gene functions (Figure ). The psb, pet, and rbcL genes show the lowest KA values, while the ycf1 gene exhibits the highest KA values. Moreover, the psa gene shows the lowest KS values, whereas ccsA gene exhibits the highest KS values. According to the KA/KS values, we found that the psa, rpo, atp, clpP, and ycf1 genes are under positive selection in Fagales. Non-synonymous substitution (KA), synonymous substitution (KS), and KA/KS values for individual Fagales genes and groups of genes. The rpl22 is not included in the rpl group due to its pseudogenization in some species. Relationships within Fagales are fairly well resolved in previously published studies, but the position of Myricaceae still remains somewhat uncertain (Manos and Steele, 1997; Cook and Crisp, 2005; Li et al., 2016). Most of these earlier studies have used sequences from only one or more chloroplast loci. In the present study, we explored two datasets: the complete chloroplast genome and a restricted matrix of 69 commonly shared protein-coding genes to perform phylogenetic analysis. For the analysis with the complete chloroplast genome data, the tree topologies from both the ML and the Bayesian analysis were found to be consistent with each other (Figure ). All the analyzed families within Fagales have MLBS = 100%. Fagaceae are sister to the remaining Fagales (MLBS = 100%), followed by Betulaceae, which are subsequently sister to the remainder of the Fagales, with full support (MLBS = 100%). The remaining two families, Juglandaceae and Myricaceae, form one clade with BS = 100%, as well as the three Myricaceae individuals forming one clade with MLBS = 100%. The relationships among them are identical with the system of classification proposed by APG III (APG III, 2009). Phylogenetic tree reconstruction of Fagales using maximum likelihood (ML) based on whole chloroplast genome sequences. Relative branch lengths are indicated. Numbers above the lines represent ML bootstrap values / BI posterior probability. The hyphen indicates that a ML bootstrap <50%. A phylogenetic tree resulting from analysis of 69 protein-coding genes was fully congruent with this topology. Most phylogenomic studies have not used entire plastome sequences, but rather have used a subset of common protein coding genes (Jansen et al., 2007; Moore et al., 2010; Xi et al., 2012). In this study, the tree topologies inferred from ML and BI using a restricted cp gene matrix were consistent with the trees inferred from the whole cp genome data (Supplementary Figure ), but the support values for some nodes in the phylogenetic trees were lower. In this study, we proved that complete chloroplast DNA sequences were more effective than common protein coding genes for the phylogenetic reconstruction of Fagales, as evaluated by higher bootstrap values and posterior probabilities. Therefore, we suggest that complete chloroplast genomes should be used more regularly for inferring the backbone relationships among other ordinal clades of angiosperms, as well as for resolving the phylogenetic position of various questionable lineages.

Conclusion

The complete chloroplast genome sequence of M. rubra, was determined using Illumina next-generation DNA sequencing technology. This is the first chloroplast genome sequenced in the Myricaceae family. The chloroplast genome of M. rubra shows a very similar size and organization comparing with the other sequenced angiosperms. The chloroplast genomes of Fagales species have experienced evolution at the gene level, rather than the genome level, because no significant structural changes are detected among their genomes. In addition, the examined genomes differ in size, and the detected genome size variations are mainly due to the length of intergenic spacers, instead of gene losses, gene pseudogenization, IR expansion or contraction. Inferred phylogenetic relationships based on the compete genome sequences from representatives of Fagales strongly support the placement of Myricaceae as sister to Juglandaceae. Furthermore, seven variable regions (trnH-psbA, psbA-trnK, rps2-rpoC2, ycf4-cemA, petD-rpoA, ndhE-ndhG, and ndhA intron) and variable cpSSR loci identified among multiple individuals of M. rubra will be useful in future studies characterizing the population genetics of this species and investigating the domestication origin of Chinese bayberry.

Author Contributions

LL, PL, CF, and XL conceived the ideas; LL and JW contributed to the sampling; LL performed the experiment; LL and RL analyzed the data. The manuscript was written by LL, PL, and KC.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Table 1

Comparative analysis of the chloroplast genomes among four families of Fagales, including three different accessions of Morella rubra (Myricaceae) sequenced for this study.

	M. rubra-GZMZ	M. rubra-FJZS	M. rubra-YNML	Juglans regia	Castanea mollissima	Ostrya rehderiana
Total cpDNA size	159,478	159,568	159,586	160,537	160,799	159,347
Length of large single copy (LSC) region	88,683	88,809	88,772	90,059	90,432	88,177
Length of inverted repeat (IR) region	26,014	26,015	26,069	26,033	25,686	26,131
Length of small single copy (SSC) region	18,767	18,706	18,676	18,412	18,995	18,908
Coding size	91,795	91,239	91,818	90,810	90,465	91,041
Intron size	20,647	20,667	20,705	20,712	19,957	20,640
Spacer size	47,036	47,662	47,063	49,015	50,377	47,666
Total GC content (%)	36.10	36.10	36.10	36.20	36.80	36.50
LSC	33.80	33.80	33.80	33.60	34.60	34.30
IR	42.60	42.60	42.60	42.60	42.80	42.50
SSC	29.20	29.20	29.20	29.80	30.80	29.80
Total number of genes	111	111	111	113	111	112
Protein encoding	77	77	77	80	77	78
tRNA	30	30	30	30	30	30
rRNA	4	4	4	4	4	4
Number of genes duplicated in IR	18	18	18	17	16	17

Table 2

List of genes present in the M. rubra chloroplast genome.

Category	Gene group	Gene name
Self-replication	Ribosomal RNA genes	rrn4.5^a	rrn5^a	rrn16^a	rrn23^a
	Transfer RNA genes	trnA-UGC^a,b trnF-GAA trnH-GUG trnL-CAA^a trnN-GUU^a trnR-UCU trnT-GGU trnW-CCA	trnC-GCA trnfM-CAU trnI-CAU^a trnL-UAA^b trnP-UGG trnS-GCU trnT-UGU trnY-GUA	trnD-GUC trnG-GCC trnI-GAU^a,b trnL-UAG trnQ-UUG trnS-GGA trnV-GAC^a	trnE-UUC trnG-UCC^b trnK-UUU^b trnM-CAU trnR-ACG^a trnS-UGA trnV-UAC^b
	Small subunit of ribosome	rps2rps8rps15	rps3rps11rps16^b	rps4rps12^a,c,drps18	rps7^arps14rps19
	Large subunit of ribosome	rpl2^a,brpl23^a	rpl14rpl32	rpl16^brpl33	rpl20rpl36
	DNA-dependent RNA polymerase	rpoA	rpoB	rpoC1^b	rpoC2
Photosynthesis	Subunits of photosystem I	psaApsaJ	psaBycf3^c	psaCycf4	psaI
	Subunits of photosystem II	psbApsbEpsbJpsbN	psbBpsbFpsbKpsbT	psbCpsbHpsbLpsbZ	psbDpsbIpsbM
	Subunits of cytochrome	petApetL	petB^bpetN	petD^b	petG
	Subunits of ATP synthase	atpAatpH	atpBatpI	atpE	atpF^b
	Large subunit of Rubisco	rbcL
	Subunits of NADHDehydrogenase	ndhA^bndhEndhI	ndhB^a,bndhFndhJ	ndhCndhGndhK	ndhDndhH
Other genes	Maturase	matK
	Envelope membrane protein	cemA
	Subunit of acetyl-CoA	accD
	C-type cytochrome synthesis gene	ccsA
	Protease	clpP^c
	Proteins of unknown function	ycf1^a	ycf2^a
Pseudogenes		ycf15	rpl22

Table 3

Repeated sequences in the M. rubra-GZMZ chloroplast genome.

Repeat no.	Repeat size (bp)	Repeat start 1	Repeat start 2	Type	Location of repeat 1	Location of repeat 2
1	30	136188	136219	F	rrn5S/rrn4.5S^∗	rrn5S/rrn4.5S^∗
2	30	133974	133974	P	ycf1	ycf1
3	30	114157	114157	P	ycf1	ycf1
4	30	114157	133974	F	ycf1	ycf1
5	30	111943	136219	P	rrn4.5S/rrn5S^∗	rrn5S/rrn4.5S^∗
6	30	111912	111943	F	rrn4.5S/rrn5S^∗	rrn4.5S/rrn5S^∗
7	30	111912	136188	P	rrn4.5S/rrn5S^∗	rrn5S/rrn4.5S^∗
8	30	47244	102983	F	ycf3	rps12/trnV-GAC^∗
9	30	47244	145148	P	ycf3	trnV-GAC/rps12^∗
10	30	42178	44402	F	psaB	psaA
11	30	39386	39402	F	psbZ/trnG-GCC^∗	psbZ/trnG-GCC^∗
12	30	38558	49028	P	trnS-UGA	trnS-GGA
13	30	34469	34469	P	trnT-GGU/psbD^∗	trnT-GGU/psbD^∗
14	30	9267	49028	P	trnS-GCU	trnS-GGA
15	31	126455	126455	P	ndhA (intron)	ndhA (intron)
16	31	117134	117134	P	ndhF/rpl32^∗	ndhF/rpl32^∗
17	31	34994	118170	C	trnT-GGU/psbD^∗	rpl32/trnL-UAG^∗
18	31	34994	118161	R	trnT-GGU/psbD^∗	rpl32/trnL-UAG^∗
19	31	9263	38554	F	trnS-GCU	psbC/trnS-UGA^∗
20	32	154596	154617	F	ycf2	ycf2
21	32	132804	132804	P	ycf1	ycf1
22	32	93533	154617	P	ycf2	ycf2
23	32	93512	93533	F	ycf2	ycf2
24	32	93512	154596	P	ycf2	ycf2
25	33	125664	125664	P	ndhA (intron)	ndhA (intron)
26	34	125872	125878	P	ndhA (intron)	ndhA (intron)
27	34	124227	124227	P	ndhG/ndhI^∗	ndhG/ndhI^∗
28	34	16216	16228	F	atpH/atpI^∗	atpH/atpI^∗
29	37	47232	125426	F	ycf3	ndhA (intron)
30	39	125424	145153	P	ndhA (intron)	trnV-GAC/rps12^∗
31	39	102969	125424	F	rps12/trnV-GAC^∗	ndhA (intron)
32	39	47232	102971	F	ycf3	rps12/trnV-GAC^∗
33	39	47232	145151	P	ycf3	trnV-GAC/rps12^∗
34	41	152170	152188	F	ycf2	ycf2
35	41	95950	152188	P	ycf2	ycf2
36	41	95932	95950	F	ycf2	ycf2
37	41	95932	152170	P	ycf2	ycf2
38	44	78648	78648	P	psbT/psbN^∗	psbT/psbN^∗
39	57	6935	6935	P	rps16/trnQ-UUG^∗	rps16/trnQ-UUG^∗

Table 4

Substitution rates of 75 protein-coding genes in four Fagales chloroplast genomes.

Taxa	Nonsynonymous (K_A)	Synonymous (K_S)	K_A/K_S
Morella rubra	0.0901 ± 0.0196	0.1547 ± 0.0258	0.7561
Juglans regia	0.0889 ± 0.0205	0.1489 ± 0.0234	0.7442
Castanea mollissima	0.0962 ± 0.0217	0.1605 ± 0.0248	0.7859
Ostrya rehderiana	0.0879 ± 0.0188	0.1556 ± 0.0239	0.7248

58 in total

1. Automatic annotation of organellar genomes with DOGMA.

Authors: Stacia K Wyman; Robert K Jansen; Jeffrey L Boore
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

2. Using plastid genome-scale data to resolve enigmatic relationships among basal angiosperms.

Authors: Michael J Moore; Charles D Bell; Pamela S Soltis; Douglas E Soltis
Journal: Proc Natl Acad Sci U S A Date: 2007-11-28 Impact factor: 11.205

3. The chloroplast genome of Phalaenopsis aphrodite (Orchidaceae): comparative analysis of evolutionary rate with that of grasses and its phylogenetic implications.

Authors: Ching-Chun Chang; Hsien-Chia Lin; I-Pin Lin; Teh-Yuan Chow; Hong-Hwa Chen; Wen-Huei Chen; Chia-Hsiung Cheng; Chung-Yen Lin; Shu-Mei Liu; Chien-Chang Chang; Shu-Miaw Chaw
Journal: Mol Biol Evol Date: 2005-10-05 Impact factor: 16.240

4. Complete plastid genome sequences of three Rosids (Castanea, Prunus, Theobroma): evidence for at least two independent transfers of rpl22 to the nucleus.

Authors: Robert K Jansen; Christopher Saski; Seung-Bum Lee; Anne K Hansen; Henry Daniell
Journal: Mol Biol Evol Date: 2010-10-08 Impact factor: 16.240

5. Characterization of the complete chloroplast genome of Hevea brasiliensis reveals genome rearrangement, RNA editing sites and phylogenetic relationships.

Authors: Sithichoke Tangphatsornruang; Pichahpuk Uthaipaisanwong; Duangjai Sangsrakru; Juntima Chanprasert; Thippawan Yoocha; Nukoon Jomchai; Somvong Tragoonrung
Journal: Gene Date: 2011-01-15 Impact factor: 3.688

Review 6. Human DNA polymorphisms and methods of analysis.

Authors: J L Weber
Journal: Curr Opin Biotechnol Date: 1990-12 Impact factor: 9.740

7. Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns.

Authors: Robert K Jansen; Zhengqiu Cai; Linda A Raubeson; Henry Daniell; Claude W Depamphilis; James Leebens-Mack; Kai F Müller; Mary Guisinger-Bellian; Rosemarie C Haberle; Anne K Hansen; Timothy W Chumley; Seung-Bum Lee; Rhiannon Peery; Joel R McNeal; Jennifer V Kuehl; Jeffrey L Boore
Journal: Proc Natl Acad Sci U S A Date: 2007-11-28 Impact factor: 11.205

8. High levels of gene flow and genetic diversity in Irish populations of Salix caprea L. inferred from chloroplast and nuclear SSR markers.

Authors: Aude C Perdereau; Colin T Kelleher; Gerry C Douglas; Trevor R Hodkinson
Journal: BMC Plant Biol Date: 2014-08-07 Impact factor: 4.215

9. OrganellarGenomeDRAW--a suite of tools for generating physical maps of plastid and mitochondrial genomes and visualizing expression data sets.

Authors: Marc Lohse; Oliver Drechsel; Sabine Kahlau; Ralph Bock
Journal: Nucleic Acids Res Date: 2013-04-22 Impact factor: 16.971

10. Comparative chloroplast genomics: analyses including new sequences from the angiosperms Nuphar advena and Ranunculus macranthus.

Authors: Linda A Raubeson; Rhiannon Peery; Timothy W Chumley; Chris Dziubek; H Matthew Fourcade; Jeffrey L Boore; Robert K Jansen
Journal: BMC Genomics Date: 2007-06-15 Impact factor: 3.969

48 in total

1. Comparative Genomic and Phylogenetic Analysis of Chloroplast Genomes of Hawthorn (Crataegus spp.) in Southwest China.

Authors: Xien Wu; Dengli Luo; Yingmin Zhang; Congwei Yang; M James C Crabbe; Ticao Zhang; Guodong Li
Journal: Front Genet Date: 2022-07-04 Impact factor: 4.772

2. The complete chloroplast genome sequence of Actinidia arguta using the PacBio RS II platform.

Authors: Miaomiao Lin; Xiujuan Qi; Jinyong Chen; Leiming Sun; Yunpeng Zhong; Jinbao Fang; Chungen Hu
Journal: PLoS One Date: 2018-05-24 Impact factor: 3.240

3. Complete chloroplast genome sequence of Dryopteris fragrans (L.) Schott and the repeat structures against the thermal environment.

Authors: Rui Gao; Wenzhong Wang; Qingyang Huang; Ruifeng Fan; Xu Wang; Peng Feng; Guangming Zhao; Shuang Bian; Hongli Ren; Ying Chang
Journal: Sci Rep Date: 2018-11-09 Impact factor: 4.379

4. Interspecific delimitation and relationships among four Ostrya species based on plastomes.

Authors: Yanyou Jiang; Yongzhi Yang; Zhiqiang Lu; Dongshi Wan; Guangpeng Ren
Journal: BMC Genet Date: 2019-03-12 Impact factor: 2.797

5. A systematic comparison of eight new plastome sequences from Ipomoea L.

Authors: Jianying Sun; Xiaofeng Dong; Qinghe Cao; Tao Xu; Mingku Zhu; Jian Sun; Tingting Dong; Daifu Ma; Yonghua Han; Zongyun Li
Journal: PeerJ Date: 2019-03-11 Impact factor: 2.984

6. Identification and phylogenetic analysis of five Crataegus species (Rosaceae) based on complete chloroplast genomes.

Authors: Liwei Wu; Yingxian Cui; Qing Wang; Zhichao Xu; Yu Wang; Yulin Lin; Jingyuan Song; Hui Yao
Journal: Planta Date: 2021-06-28 Impact factor: 4.116

7. Chloroplast genome analyses and genomic resource development for epilithic sister genera Oresitrophe and Mukdenia (Saxifragaceae), using genome skimming data.

Authors: Luxian Liu; Yuewen Wang; Peizi He; Pan Li; Joongku Lee; Douglas E Soltis; Chengxin Fu
Journal: BMC Genomics Date: 2018-04-04 Impact factor: 3.969

8. Comparative Plastid Genomes of Primula Species: Sequence Divergence and Phylogenetic Relationships.

Authors: Ting Ren; Yanci Yang; Tao Zhou; Zhan-Lin Liu
Journal: Int J Mol Sci Date: 2018-04-01 Impact factor: 5.923

9. Different Natural Selection Pressures on the atpF Gene in Evergreen Sclerophyllous and Deciduous Oak Species: Evidence from Comparative Analysis of the Complete Chloroplast Genome of Quercus aquifolioides with Other Oak Species.

Authors: Kangquan Yin; Yue Zhang; Yuejuan Li; Fang K Du
Journal: Int J Mol Sci Date: 2018-03-30 Impact factor: 5.923

10. The Chloroplast Genome of Lilium henrici: Genome Structure and Comparative Analysis.

Authors: Hai-Ying Liu; Yan Yu; Yi-Qi Deng; Juan Li; Zi-Xuan Huang; Song-Dong Zhou
Journal: Molecules Date: 2018-05-26 Impact factor: 4.411