Literature DB >> 25522147

Complete chloroplast genome of Macadamia integrifolia confirms the position of the Gondwanan early-diverging eudicot family Proteaceae.

Catherine J Nock, Abdul Baten, Graham J King.   

Abstract

BACKGROUND: Sequence data from the chloroplast genome have played a central role in elucidating the evolutionary history of flowering plants, Angiospermae. In the past decade, the number of complete chloroplast genomes has burgeoned, leading to well-supported angiosperm phylogenies. However, some relationships, particulary among early-diverging lineages, remain unresolved. The diverse Southern Hemisphere plant family Proteaceae arose on the ancient supercontinent Gondwana early in angiosperm history and is a model group for adaptive radiation in response to changing climatic conditions. Genomic resources for the family are limited, and until now it is one of the few early-diverging 'basal eudicot' lineages not represented in chloroplast phylogenomic analyses.
RESULTS: The chloroplast genome of the Australian nut crop tree Macadamia integrifolia was assembled de novo from Illumina paired-end sequence reads. Three contigs, corresponding to a collapsed inverted repeat, a large and a small single copy region were identified, and used for genome reconstruction. The complete genome is 159,714 bp in length and was assembled at deep coverage (3.29 million reads; ~2000 x). Phylogenetic analyses based on 83-gene and inverted repeat region alignments, the largest sequence-rich datasets to include the basal eudicot family Proteaceae, provide strong support for a Proteales clade that includes Macadamia, Platanus and Nelumbo. Genome structure and content followed the ancestral angiosperm pattern and were highly conserved in the Proteales, whilst size differences were largely explained by the relative contraction of the single copy regions and expansion of the inverted repeats in Macadamia.
CONCLUSIONS: The Macadamia chloroplast genome presented here is the first in the Proteaceae, and confirms the placement of this family with the morphologically divergent Plantanaceae (plane tree family) and Nelumbonaceae (sacred lotus family) in the basal eudicot order Proteales. It provides a high-quality reference genome for future evolutionary studies and will be of benefit for taxon-rich phylogenomic analyses aimed at resolving relationships among early-diverging angiosperms, and more broadly across the plant tree of life.

Entities:  

Mesh:

Substances:

Year:  2014        PMID: 25522147      PMCID: PMC4290595          DOI: 10.1186/1471-2164-15-S9-S13

Source DB:  PubMed          Journal:  BMC Genomics        ISSN: 1471-2164            Impact factor:   3.969


Background

Chloroplasts are the plastid organelles responsible for photosynthesis, and their genomes have proven to be a valuable resource for plant phylogenetics, population genetics, species identification and genetic engineering. High-throughput next generation sequencing (NGS) technologies have led to a rapid growth in the number of available chloroplast (cp) genomes, including representatives of most major lineages of green plants, Viridiplantae [1]. The quadripartite structure of the plant cp genome is highly conserved, with an inverted repeat region separating the small and large single repeat regions in most species [2]. Molecular phylogenomic studies utilising cp genome sequence data from the genes and slowly-evolving inverted repeat regions have been applied to unravel the deep-level evolutionary relationships of plant taxa [1-5], producing robust phylogenies that are corroborated by sequence data from mitochondrial and nuclear genomes [6]. Although cp genome phylogenies have been enormously important in resolving relationships among the flowering plants Angiospermae, the position of some lineages remains unresolved. Relationships among early-diverging lineages, including basal angiosperms, Magnoliidae (magnoliids), Monocotyledoneae (monocots) and basal Eudicotyledoneae (eudicots) have been among the most problematic due to rapid diversification early in the history of flowering plants [7]. Increased taxon sampling, particularly for taxa representing deep-level divergences, may provide resolution. [8,9]. The basal eudicot order Proteales contains the families Nelumbonaceae, Platanaceae and Proteaceae [10]. Fossil evidence and fossil-calibrated molecular dating indicate family-level divergence within the order by the early Cretaceous, over 110 million years ago [11]. There is evidence for long-term morphological and molecular stasis in the Nelumbonaceae and Platanaceae, and the only extant genera Nelumbo and Platanus are both regarded as 'living fossils' [12]. By contrast, the Southern Hemisphere family Proteaceae is morphologically and ecologically diverse. Approximately 79 genera and 1700 species are recognised, including the Australian Banksia and Macadamia and African Protea. Current distribution is the result of both vicariance during Gondwanan breakup and long-distance dispersal [13]. The Proteaceae exhibits remarkably variable levels of endemism and species-richness, notably in the Mediterranean climate biodiversity hotspots of Southwest Australia and the Cape Floristic Region [14,15]. It is, therefore, a family of great interest for studies of speciation, diversification, biogeography and evolution [16-18]. However, genomic resources for the Proteaceae are limited and little is known of the composition and organisation of the genomes and their evolution. Here, as part of an ongoing effort to establish a comprehensive understanding of the macadamia genomes, we present the complete and annotated DNA sequence for the chloroplast from Macadamia integrifolia, to our knowledge the first in the Proteaceae. Given that the closest reference sequences of Platanus and Nelumbo are over 100 million years divergent, the Macadamia cp genome was assembled de novo at deep coverage.

Results

De novo genome assembly

After trimming for low quality bases and adapter sequences, there were 1.54 × 108 reads with an average read length of 105 base pairs (bp). De novo assembly produced 540,582 contiguous sequences (contigs) with an N50 of 2,540. The maximum and average contig lengths were 300,523 and 1,032 respectively. Three chloroplast contigs were identified, with greatest similarity to Platanus occidentalis based on total alignment score and percentage sequence identity. These contigs totalled 133,617 bp in length and corresponded to the large single copy (88,300 bp), small single copy (18,888 bp) and a double-coverage, collapsed consensus of the inverted repeat regions (26,429). They were aligned to the Platanus cp genome using MUMmer as a starting point to order and assemble the draft genome (Fig. S1 in Additional File 1). The single collapsed inverted repeat (IR) contig was separated into two repeat regions. Assembly of the two IR and the large single copy (LSC) and small single copy (SSC) contigs covered the complete sequence without gaps. Iterations of assembly, realignment and editing using BWA, MUMmer and Gap5 were performed to complete the genome assembly. Sanger sequences spanning the inverted repeat and de novo contig junctions confirmed those in the final assembly. Reference mapping of paired-end reads was used to determine quality and coverage of the finished Macadamia cp genome. Following re-assembly of reads, the 26,429 nucleotide positions of each inverted repeat region were examined for differences and found to be identical. In total, 3.29 million reads (2.12%) were mapped. Median coverage was 1,999 times and the minimum coverage of any position was 600.

Chloroplast genome of Macadamia integrifolia and comparative analyses

The cp genome of M. integrifolia is 159,714 bp in length with a typical quadripartite structure [Genbank:KF862711, Figure 1]. The LSC, SSC and IR regions are 88,093, 18,813 and 26,404 bp respectively and GC content is 38.1%. Gene content and order is identical in Macadamia, Platanus and Nelumbo with each sharing 79 protein-coding, 30 tRNA and 4 rRNA genes. Size differences among Proteales cp genomes are primarily due to expansion of the IR and corresponding reductions in the LSC and SSC regions in Macadamia relative to Platanus and Nelumbo. Indels are located primarily in noncoding regions with the largest a 1,749 bp deletion in Macadamia relative to Platanus in the ndhC to trnV-UAC inter-genic spacer (Table 1, Figure 2). Based on internal stop codons, ycf68 in the Macadamia cp genome is a pseudogene, as in Platanus and Nelumbo and many other angiosperms. In Macadamia ycf15 is intact, with an amino acid sequence identical to many other angiosperms including the magnoliid Calycanthus floridus. In Platanus, ycf15 is a pseudogene [19], likely due to a 597 bp deletion in the ycf15 coding region relative to Macadamia (Figure 2). The rps19 gene is located at the 3'-end of the IR regions in Macadamia and Nelumbo, and spans the IRA-LSC and LSC-IRB junction in Platanus only. The presence of ACG start codons in ndhD, psbL and rpl2 suggests that RNA editing is required for translation of these genes in Macadamia and Platanus.
Figure 1

Chloroplast genome map of . Organisation of the chloroplast genome of Macadamia integrifolia (Proteaceae) showing annotated genes. The inner circle shows the location of the large and small single copy (LSC and SSC) and inverted repeat regions (IRA and IRB).

Table 1

Characteristics of Proteales chloroplast genomes and primary noncoding indels contributing to length differences, in base pairs

Region Macadamia Platanus Nelumbo
Chloroplast genome159714161791163307
Inverted Repeats, IRA and IRB264042506626065
ycf2 to trnL-CAA spacer10144241022
trnI-GAU intron946942760
Small Single Copy, SSC188131950919330
ndhG to ndhI spacer362380467
ndhA intron219421871915
rpl32 to trnL-UAG spacer11496471070
Large Single Copy, LSC880939215091847
rps16 to trnQ-UUG spacer178412992051
ndhC to trnV-UAC spacer51722662156
ycf3 to trnS-GGA spacer308889926
petA to psbJ spacer8419841105
trnT-UGU to trnL-UAA spacer64914191031
IRB-SSC Junction
trnN-GUU to ndhF spacer15557461668
SSC-IRA Junction
ycf 1553857485520
Figure 2

Sequence identity plot comparing the chloroplast genome of . Pairwise comparisons between and chloroplast genomes from the Proteales genera (top) and (bottom) using mVISTA. The -axis represents % identity ranging from 50-100%. Coding and non-coding regions are marked in purple and pink respectively. Also shown are the genes discussed in the text (red), large single copy (LSC, green), small single copy (SSC, blue) and inverted repeat (IR, grey) regions.

Chloroplast genome map of . Organisation of the chloroplast genome of Macadamia integrifolia (Proteaceae) showing annotated genes. The inner circle shows the location of the large and small single copy (LSC and SSC) and inverted repeat regions (IRA and IRB). Characteristics of Proteales chloroplast genomes and primary noncoding indels contributing to length differences, in base pairs Sequence identity plot comparing the chloroplast genome of . Pairwise comparisons between and chloroplast genomes from the Proteales genera (top) and (bottom) using mVISTA. The -axis represents % identity ranging from 50-100%. Coding and non-coding regions are marked in purple and pink respectively. Also shown are the genes discussed in the text (red), large single copy (LSC, green), small single copy (SSC, blue) and inverted repeat (IR, grey) regions.

Characterisation of cpSSR loci

In total, 59 chloroplast simple sequence repeat (cpSSR) regions were identified in Macadamia. Of these, 57 were mononucleotide (A/T) and two were dinucleotide (AT/TA) repeats. The majority (79%), were located in noncoding sections of the LSC region. However, 14 cpSSR are located in exons including two in ycf1 replicated in the inverted repeat regions. No tri- or tetranucleotide repeats over 15 bp in length were found. Of particular interest for population genetics studies are regions of the clpP intron (660bp) and trnK to rps16 intergenic spacer (818bp) containing multiple SSRs as they are co-located in short sections amenable to PCR amplification and Sanger sequencing. The 13 cpSSRs in noncoding regions shared with Platanus are also of interest as they are likely to be present and may be variable in other Proteaceae species (Table 2).
Table 2

Distribution of Macadamia integrifolia chloroplast simple sequence repeat (cpSSR) regions

cpSSRrepeat motifLength bpstartendRegioncpSSRrepeat motifLength bpstartendRegion
1A15273288LSCtrnH-psbA32A106761767627LSCpsbF exon
2A1018451855LSCtrnK-matK33A136888768900LSCpsbE-petL
3A1344624475LSCtrnK-rps1634T127098370995LSCpsaJ-rpl33
4 bT1152695280LSCtrnK-rps1635A107291472924LSCrpl20-rps12
5T1569977012LSCrps16-trnQ36A167294372959LSCrpl20-rps12
6A1071417151LSCrps16-trnQ37 a,bT107437974389LSCclpP intron
7A1299359947LSCtrnS-trnG38T107463374643LSCclpP intron
8 bT141134511359LSCtrnG-trnR39A117502875039LSCclpP intron
9A171151411531LSCtrnR-atpA40 aA148169181705LSCpetD-rpoA
10 aA101318613196LSCatpA-atpF41 aT138384283855LSCinfa-rps8
11 bA111434914360LSCatpF intron42T178494184958LSCrpl14-rpl16
12T111549715508LSCatpH-atpI43T128642186433LSCrps16-rps3
13T101783917849LSCrps2 exon44AT1647824798LSCtrnK-rps16
14A101810518115LSCrps2-rpoC245AT163606836084LSCtrnT-psbD
15 bT102018520195LSCrpoC2 exon
16 a,bT112031620327LSCrpoC2 exon46T118810088111IRBrps19-rpl2
17T122293422946LSCrpoC1 exon47T10114378114388IRBycf1 exon
18A113032530336LSCtrnC-petN48T10114477114487IRBycf1 exon
19T103231232322LSCpsbM-trnD
20 aT103436534375LSCtrnE-trnT49 bT18117406117424SSCndhF-rpl32
21A123594435956LSCtrnT-psbD50A10125592125602SSCndhA intron
22 aT103908239092LSCpsbC-trnS51A14125805125819SSCndhA intron
23A163992939945LSCpsbZ-trnG52T10128234128244SSCndhH-rps15
24 aA154800748022LSCycf3 intron53T12129915129927SSCycf1 exon
25A124827148283LSCycf3-trnS54A15130143130158SSCycf1 exon
26T144989849912LSCtrnT-trnL55 aA11132293132304SSCycf1 exon
27T145105751071LSCtrnL-trnF56T10132660132670SSCycf1 exon
28T135197251985LSCtrnF-ndhJ
29T105260052610LSCndhJ-ndhK57A10133320133330IRAycf1 exon
30 a,bT105533555345LSCtrnM-atpE58A10133419133429IRAycf1 exon
31T146362963643LSCycf4-cemA59A11159696159707IRArpl2-trnH
cpSSR regions also present in Platanus (a), Nelumbo (b)
Distribution of Macadamia integrifolia chloroplast simple sequence repeat (cpSSR) regions

Phylogenetic analyses

Maximum likelihood (ML) analyses were performed on 87-taxa chloroplast gene and 160-taxa IR alignments in order to determine the position of Macadamia within Angiospermae, and the consequence of its inclusion on inferring phylogenetic relationships among basal eudicots.

Chloroplast Gene Phylogeny

The final 87-taxa and 83-gene alignment used for analyses was 66,738 bp in length. The proportion of gaps and undetermined characters was 4.04 %, and GC content was 38.4%. The optimal partitioning scheme identified under the Bayesian information criteria (BIC) using relaxed clustering analysis in PartitionFinder (lnL = -1081518.0; BIC 2170800.8) contained 49 partitions. Maximum likelihood analyses under the 49-partition and single partition (hereafter unpartitioned) strategies and the GTR+Γ model produced identical topologies. The ML 'best' tree with the highest likelihood score (lnL = -1087999.5) produced by the partitioned ML analysis (Figure 3; Fig. S2 in Additional File 2) shared the same topology as the best tree from unpartitioned analysis (lnL= -1110496.6).
Figure 3

Chloroplast gene . Phylogram of the best tree determined by RAxML for the 83-gene, 87-taxa, 49-partition data set. Numbers associated with branches are maximum likelihood percentage bootstrap support values for partitioned and unpartitioned (in italics) analyses. Unnumbered branches had 100% support. Collapsed monophyletic clades and number of taxa in brackets are monocots (11), superasterids (26), superosids (27), magnoliids (3) and gymnosperm outgroup (3). Scale represents substitutions per site.

Chloroplast gene . Phylogram of the best tree determined by RAxML for the 83-gene, 87-taxa, 49-partition data set. Numbers associated with branches are maximum likelihood percentage bootstrap support values for partitioned and unpartitioned (in italics) analyses. Unnumbered branches had 100% support. Collapsed monophyletic clades and number of taxa in brackets are monocots (11), superasterids (26), superosids (27), magnoliids (3) and gymnosperm outgroup (3). Scale represents substitutions per site.

Inverted repeat region phylogeny

The final IR alignment used for analyses was 24,693 bp in length, including 10,781 bp (43.7%) of non-coding sequence from spacers and introns. The proportion of gaps and undetermined characters was 13.2% and GC content was 42.5%. The optimal partitioning scheme in PartitionFinder (lnL = -140178.1; BIC 296837.1) contained 5 partitions. Maximum likelihood analyses under the 5-partition and unpartitioned strategies with the GTR+Γ model produced identical topologies. The ML 'best' tree (lnL = -261860.8) produced by the partitioned analysis (Figure 4; Fig. S3 in Additional File 3) shared the same topology as the best tree from unpartitioned analysis (lnL= -288975.8).
Figure 4

Inverted repeat . Phylogram of the best tree determined by RAxML for the inverted repeat, 160-taxa, 5-partition data set. Numbers associated with branches are maximum likelihood percentage bootstrap support values for partitioned and unpartitioned (in italics) analyses. Unnumbered branches had 100% support. Collapsed monophyletic clades and number of taxa in brackets are superasterids (59), superosids (82) and Dilleniaceae (7). Scale represents substitutions per site.

Inverted repeat . Phylogram of the best tree determined by RAxML for the inverted repeat, 160-taxa, 5-partition data set. Numbers associated with branches are maximum likelihood percentage bootstrap support values for partitioned and unpartitioned (in italics) analyses. Unnumbered branches had 100% support. Collapsed monophyletic clades and number of taxa in brackets are superasterids (59), superosids (82) and Dilleniaceae (7). Scale represents substitutions per site. Phylogenetic analyses based on both chloroplast genes and inverted repeat regions provided maximum bootstrap (BS) support for a sister relationship between Macadamia and Platanus, and for a Proteales clade also containing Nelumbo (BS 100%). Sabiaceae (Meliosma) was sister to the Proteales in all analyses, however, the level of support for this clade was lower in the IR (BS 53%) compared to the 83-gene (BS 70%) partitioned analyses (Figure 3, Figure 4). The 83-gene and IR phylogenies were highly congruent, with the only differences among basal eudicot taxa in the position of Buxus and Trochodendron. In the 83-gene phylogeny, support for a Buxus versus Trochondendron sister relationship to the core eudicots was marginal (BS 50%), whereas Trochodendron was sister to the core eudicots (= Gunneridae) in the partitioned (BS 90%) and unpartitioned (BS 86%) IR analyses respectively. The main topological difference among core eudicots was in the position of the three major clades: superrosids, superasterids and Dilleniaceae. In cp-gene phylogenies, Dillenia was sister to the superrosids (BS 90%) and in IR phylogenies Dillenaceae was sister to the superosids+superasterids (BS 80%).

Discussion

Characteristics of the Macadamia cp genome and comparison to other angiosperms

The chloroplast genome of Macadamia integrifolia cultivar HAES 741 was sequenced at deep coverage (~2000x) and assembled de novo using Illumina NGS reads. Structure, gene content and order appear to be highly conserved in the basal eudicot order Proteales, and in comparision to the inferred ancestral cp genome organization of Nicotiana tabacum and many other angiosperms [20]. Consistently high levels of conservation within Angiospermae are indicative of evolutionary constraints on the cp genome of photosynthetic plants [21]. Major differences among angiosperm cp genomes are due to gene losses, inversions and expansion/contraction of inverted repeat regions. Gene loss in parasitic plants can lead to markedly reduced cp genome size. For example, the cp genome of the underground orchid Rhizanthella gardneri is only 59 kb [22] and there may have been a complete loss of the plastid genome in Rafflesia lagascae [23]. Gene loss can also be due to the transfer of cp genes to the nuclear genome [24]. Gene order is largely conserved among angiosperms including Macadamia and other basal eudicots, however, large inversions altering gene order have been reported in some core eudicot species [e.g. 25,26]. The main effects of expansion and contraction of the IR regions at the LSC and SSC junctions are the formation of pseudogenes, and changes in genome size and evolutionary rate [24]. The smaller cp genome of Macadamia compared to Platanus and Nelumbo is primarily due to relative reduction of the single copy regions, with most deletions in intergenic spacers and introns. The Macadamia IR at 26.4 kb is the largest yet reported in the Proteales but is considerably smaller than those of the basal eudicot Trochondendron (30.7 kb) and the core eudicot Pelargonium x hortorum, (76 kb) [25,27]. Proteales cp genomes also differ in the complement of pseudogenes. The intact ycf15 gene of Macadamia is, in Platanus a pseudogene due to a large deletion. The function and validity of ycf15 are uncertain, and there is no evidence of chloroplast-nuclear gene transfer in angiosperms with intact or disabled ycf15 genes [28]. The rps19 gene spans the IRA-LSC junction causing an pseudogene in the IRB of Platanus.

Phylogenetic implications and the position of Proteaceae

Australia is the origin and centre of diversity of the Proteaceae, and this morphologically distinct and diverse family is distributed across remnant landmasses of the southern supercontinent Gondwana [15]. The order Proteales inclusive of Proteaceae, Platanaceae and Nelumbonaceae was established relatively recently, on the basis of molecular data, and morphological synapomorphies for the order are yet to be identified [29-31]. The 83 cp gene and IR region alignments used in this study are the largest sequence-rich datasets to include the basal eudicot family Proteaceae. Maximum likelihood phylogenies confirmed the position of Macadamia within the Proteales, and were congruent and largely concordant with recent phlyogenomic studies [1,3-6]. There was maximum support for a sister relationship between Proteaceae and Platanaceae, and for a Proteales clade containing these families and Nelumbonaceae. A 640-taxa angiosperm phylogeny using 17 genes from all three plant genomes included the Proteaceae taxa Petrophile and Roupala [6], and a 57-taxa, 17 kb alignment of chloroplast introns, spacers and genes included Embothrium and Grevillea [32]. Both studies confirmed inclusion of Proteaceae in the Proteales (BS 100%), in accordance with the Angiosperm Phylogeny Group III system [10]. In this study, a clade containing Sabiaceae (Meliosma) and Proteales was recovered from both the 83-gene and IR analyses with moderate support (Figure 3, Figure 4). These results are consistent with those from previous phylogenomic studies with support values ranging from 43-88%, although a Proteales-Sabiaceae clade was not recovered in all analyses [1-6,32]. The inclusion of Macadamia in taxon-rich chloroplast gene and inverted repeat alignments produced largely congruent and well-supported ML phylogenies. The main topological differences were in the positions of taxa representing lineages that are unplaced in the APG III system including Dillenia, Trochodendron and Buxus [10]. Within core eudicots, there was conflicting strong support for a sister relationship between (1) Dillenia and superrosids, and (2) Dillenaceae, represented by 7 genera, and superosids+superasterids, in cp-gene and IR analyses respectively. There was strong support for Trochodendron as sister to core eudicots in IR analyses, whilst the core eudicot sister was undetermined between Trochodendron and Buxus in cp-gene analyses (BS 50%). Interestingly, previous studies provided strong support for a Buxaceae-core eudicot clade based on data from the cp, mitochondrial and nuclear genes [6] and for an alternative Trochodendron-core eudicot clade using the cp IR region. Efforts to resolve relationships among unplaced angiosperm lineages, are hampered by short internal branch lengths due to rapid divergence of major lineages in the Cretaceous [7]. Full resolution of relationships among basal eudicots may require denser sampling of both taxa and genes.

Utility of the Macadamia chloroplast genome

Problems in identfying a single locus DNA barcode for plants, and advances in sequencing technologies have led to suggestions that the cp genome could have utility in species identification [33,34]. Possible obstacles include the cost and complexity of assembly [35]. However, the advantages of using a NGS approach to chloroplast DNA barcoding include the potential to eliminate PCR and hence reliance on 'universal' primers. Given the widely reported transfer of chloroplast sequence to the nuclear genome [36,37] avoidance of PCR further eliminates the risk of amplifying paralogous nuclear plastid-like sequences (NUPTs). The availability of high quality cp genomes for representatives of each of the 413 recognised angiosperm families should facilitate species identification. This can be achieved through rapid identification of cp sequences by reference mapping of low coverage NGS reads at multiple locations, without the requirement for complete genome assembly. Continual improvements in sequencing technologies, including increased read lengths and decreasing cost, in addition to new methods to optimise recovery of chloroplast sequences from plant DNA [38,39] are bringing cp genome-wide barcoding closer to reality. Whole chloroplast genome sequencing enables identification of intraspecific variation for phylogeographic studies, even in genetically depauperate species [40] and cpSSR regions have been widely used in population genetics [41,42]. The 59 Macadamia cpSSR identified in this study may provide markers with broad utility across Proteaceae species. The Macadamia cp genome is currently providing a reference sequence for inferring the domestication and evolutionary histories of Macadamia (unpublished results). Furthermore, it will be of benefit for taxon-rich phylogenomic studies and understanding of the evolution and adaptations underlying the remarkable diversity of this large Southern Hemisphere plant family [31].

Conclusions

The complete chloroplast genome of Macadamia integrifolia was assembled de novo from Illumina NGS reads, and provides the first reference genome sequence for the Gondwanan plant family, Proteaceae. Despite sequencing at deep coverage (~2000x) the genome was recovered in three contigs, one of which corresponded to a collapsed copy of the inverted repeat regions. Although genome assembly from these contigs was straightforward, this provides an illustration of the problems that large repeat regions present to de novo genome assembly from NGS short read sequence data. Phylogenetic analyses of both 83-gene and inverted repeat region alignments confirmed the position of Proteaceae in the order Proteales, with maximum support for a sister relationship between Platanaceae (Platanus) and Proteaceae (Macadamia). The Macadamia chloroplast genome provides a high-quality reference for future evolutionary studies within the Proteaceae and will be of benefit for taxon-rich phylogenomic analyses aiming to resolve relationships among early-diverging angiosperms and more broadly across the plant tree of life.

Methods

Sample Collection

Fresh leaf material was collected from a single Macadamia integrifolia, cultivar 741 'Mauka' individual from the Macadamia Varietal Trial plantation M2 at Clunes, New South Wales and stored at -80ºC prior to DNA extraction. A voucher specimen was deposited in the Southern Cross University herbarium [accession PHARM-13-0813].

DNA extraction, library preparation and Illumina sequencing

Leaf tissue was frozen in liquid nitrogen and ground using a tissue lyser (MM200, Retsch, Haan, Germany). Total genomic DNA was extracted using a DNeasy Plant Maxi kit (Qiagen Inc., Valencia, CA, USA) and quantified using a Qubit dsDNA BR assay (Life Technologies, Carlsbad, CA, USA). Genomic DNA was sheared using a Covaris S220 focused-ultrasonication device (Covaris Inc., Woburn USA) to a mean fragment size of 500 bp. A DNA library was prepared using Illumina TruSeq DNA Sample Preparation kit v2 following manufacturer's instructions (Illumina, San Diego, USA). Fragment size distribution and concentration were determined using a DNA 1000 chip on a Bioanalyser 2100 instrument (Agilent Technologies, Santa Clara, USA). Approximately 4 pmol of the library was paired-end sequenced (150 × 2 cycles) on an Illumina GA IIx instrument. Base calling was performed with Illumina Pipeline version 1.7.

De novo assembly of chloroplast genome

Paired-end sequence reads were trimmed to remove low quality bases (Q<20, 0.01 probabilility error) and adapter sequences in CLC Genomics Workbench, version 4.9 (CLC Bio, Aarhus, Denmark; http://www.clcbio.com). CLC de novo assembler, which utilises de Bruijn graphs for the assembly, was used for the assembly with the option to map reads back to contigs and the following optimised parameters: k-mer = 35; bubble size = 50; indel cost =3; length fraction = 0.5; similarity index = 0.8. In order to identify contigs of cp origin, assembled sequences were aligned to a local database containing angiosperm complete cp genome sequences from NCBI using BLASTN [43]. Contigs with significant alignment were selected for further analysis. Alignments were visualised using mummer dotplots [44] to estimate the proportion of the genome covered and to order and connect contigs. Quality trimmed reads were mapped back to contigs using Burrows-Wheeler Aligner [45] and contigs were extended and joined using Gap5 [46]. Coverage and quality of the draft genome sequence were assessed by reference mapping of trimmed paired-end reads using CLC Genomics Workbench. To confirm accuracy, sequences spanning contig and repeat region junction regions were PCR amplified using custom primers and Sanger sequenced. The finished genome was annotated using DOGMA (Dual Organellar Genome Annotator) [47] and deposited in Genbank [Genbank:KF862711].

Phylogenetic and Comparative Analyses

To compare structure and gene content within the order Proteales, and to identify variable regions the annotated cp genomes of Macadamia integrifolia, Platanus occidentalis [DQ923116] and Nelumbo nucifera [FJ754270] were aligned using MAFFT version 7.017 [48]. A visual representation of the alignment and regions of interspecific variation was generated in mVISTA [49]. Chloroplast SSR regions were identified using Msatcommander version 1.8.2 specifying minimum lengths of 10, 16, and 24 bp for mono-, di-, and tri- and tetranucleotides respectively [50]. To examine the position of Proteaceae within Angiospermae, genes were extracted from the Macadamia cp genome and added to the 83-gene, 86-taxa alignment of Moore et al. [4] in Geneious Pro, version 7.1.5 (Biomatters Ltd., Auckland, New Zealand). The gymnosperm outgroup taxa were Pinus, Cycas and Ginkgo. To further examine relationships among basal eudicot taxa, the slowly-evolving IR region of the Macadamia cp genome was aligned to a 159-taxa eudicot subset of the inverted repeat alignment of Moore et al. [5] in Geneious Pro with the outgroup taxon Ceratophyllum. Regions of alignment ambiguous sequence data in both matrices, and insertions present in one or few taxa were excluded. For each alignment, PartitionFinder version 1.1.1 was used to select the best-fit partitioning scheme under the Bayesian information criterion with the relaxed clustering algorithm, default 10% [51]. Data blocks analysed in the 83-gene alignment included first, second and third codon positions for 79 protein-coding genes, and 4 ribosomal RNA genes. Data blocks analysed in the IR alignment included codon positions for 7 protein-coding genes, and 4 ribosomal RNA genes, 7 transfer RNA genes, intergenic spacers and introns. Maximum likelihood analyses on partitioned and unpartitioned datasets were conducted using Randomised Accelerated Maximum Likelihood, RaxML version 7.4.2 with 100 bootstrap replicates and 10 subsequent thorough ML searches under the general time reversable (GTR) substition model and the gamma (Γ ) model of among site rate heterogeneity [52]. Bootstrap proportions were drawn on the tree with highest likelihood score from the 10 independent searches. Trees were visualised in FigTree version1.4.0.

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

CN and GK conceived and designed the study. CN performed de novo assembly, genome annotation, phylogenetic and other analyses and drafted the manuscript. AB participated in bioinformatics, genome assembly and phylogenetic analyses. All authors read and approved the final manuscript.

Additional File 1

Figure S1: Dot plot analysis of . Dotplot showing identify of three de novo assembled chloroplast contigs in comparison to the chloroplast genome of Click here for file

Additional File 2

Figure S2: Phylogram of the best ML tree determined by RaxML (lnL = -1087999.5) for the 83-gene, 87-taxa and 49-partition data set. Numbers associated with branches are ML percentage bootstrap support values. Click here for file

Additional File 3

Figure S3: Phylogram of the best ML tree determined by RAxML (lnL = -261860.8) for the inverted repeat region, 160-taxa and 5-partition data set. Numbers associated with branches are ML percentage bootstrap support values. Click here for file
  44 in total

1.  A greedy algorithm for aligning DNA sequences.

Authors:  Z Zhang; S Schwartz; L Wagner; W Miller
Journal:  J Comput Biol       Date:  2000 Feb-Apr       Impact factor: 1.479

2.  Automatic annotation of organellar genomes with DOGMA.

Authors:  Stacia K Wyman; Robert K Jansen; Jeffrey L Boore
Journal:  Bioinformatics       Date:  2004-06-04       Impact factor: 6.937

3.  Partitionfinder: combined selection of partitioning schemes and substitution models for phylogenetic analyses.

Authors:  Robert Lanfear; Brett Calcott; Simon Y W Ho; Stephane Guindon
Journal:  Mol Biol Evol       Date:  2012-01-20       Impact factor: 16.240

4.  Chloroplast genome sequences from total DNA for plant identification.

Authors:  Catherine J Nock; Daniel L E Waters; Mark A Edwards; Stirling G Bowen; Nicole Rice; Giovanni M Cordeiro; Robert J Henry
Journal:  Plant Biotechnol J       Date:  2010-08-27       Impact factor: 9.803

5.  RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models.

Authors:  Alexandros Stamatakis
Journal:  Bioinformatics       Date:  2006-08-23       Impact factor: 6.937

6.  Analysis of 81 genes from 64 plastid genomes resolves relationships in angiosperms and identifies genome-scale evolutionary patterns.

Authors:  Robert K Jansen; Zhengqiu Cai; Linda A Raubeson; Henry Daniell; Claude W Depamphilis; James Leebens-Mack; Kai F Müller; Mary Guisinger-Bellian; Rosemarie C Haberle; Anne K Hansen; Timothy W Chumley; Seung-Bum Lee; Rhiannon Peery; Joel R McNeal; Jennifer V Kuehl; Jeffrey L Boore
Journal:  Proc Natl Acad Sci U S A       Date:  2007-11-28       Impact factor: 11.205

7.  Correlation between nuclear plastid DNA abundance and plastid number supports the limited transfer window hypothesis.

Authors:  David Roy Smith; Kate Crosby; Robert W Lee
Journal:  Genome Biol Evol       Date:  2011-02-03       Impact factor: 3.416

8.  Contradiction between plastid gene transcription and function due to complex posttranscriptional splicing: an exemplary study of ycf15 function and evolution in angiosperms.

Authors:  Chao Shi; Yuan Liu; Hui Huang; En-Hua Xia; Hai-Bin Zhang; Li-Zhi Gao
Journal:  PLoS One       Date:  2013-03-18       Impact factor: 3.240

9.  An evaluation of the PacBio RS platform for sequencing and de novo assembly of a chloroplast genome.

Authors:  Marco Ferrarini; Marco Moretto; Judson A Ward; Nada Šurbanovski; Vladimir Stevanović; Lara Giongo; Roberto Viola; Duccio Cavalieri; Riccardo Velasco; Alessandro Cestaro; Daniel J Sargent
Journal:  BMC Genomics       Date:  2013-10-01       Impact factor: 3.969

10.  Insights from the complete chloroplast genome into the evolution of Sesamum indicum L.

Authors:  Haiyang Zhang; Chun Li; Hongmei Miao; Songjin Xiong
Journal:  PLoS One       Date:  2013-11-26       Impact factor: 3.240

View more
  11 in total

1.  Assembly and comparative analysis of the complete mitochondrial genome of three Macadamia species (M. integrifolia, M. ternifolia and M. tetraphylla).

Authors:  Yingfeng Niu; Yongjie Lu; Weicai Song; Xiyong He; Ziyan Liu; Cheng Zheng; Shuo Wang; Chao Shi; Jin Liu
Journal:  PLoS One       Date:  2022-05-03       Impact factor: 3.752

2.  InCoB2014: mining biological data from genomics for transforming industry and health.

Authors:  Christian Schönbach; Tin Tan; Shoba Ranganathan
Journal:  BMC Genomics       Date:  2014-12-08       Impact factor: 3.969

3.  Evolution of short inverted repeat in cupressophytes, transfer of accD to nucleus in Sciadopitys verticillata and phylogenetic position of Sciadopityaceae.

Authors:  Jia Li; Lei Gao; Shanshan Chen; Ke Tao; Yingjuan Su; Ting Wang
Journal:  Sci Rep       Date:  2016-02-11       Impact factor: 4.379

4.  Genome and transcriptome sequencing characterises the gene space of Macadamia integrifolia (Proteaceae).

Authors:  Catherine J Nock; Abdul Baten; Bronwyn J Barkla; Agnelo Furtado; Robert J Henry; Graham J King
Journal:  BMC Genomics       Date:  2016-11-17       Impact factor: 3.969

5.  Complete plastome sequencing of both living species of Circaeasteraceae (Ranunculales) reveals unusual rearrangements and the loss of the ndh gene family.

Authors:  Yanxia Sun; Michael J Moore; Nan Lin; Kole F Adelalu; Aiping Meng; Shuguang Jian; Linsen Yang; Jianqiang Li; Hengchang Wang
Journal:  BMC Genomics       Date:  2017-08-09       Impact factor: 3.969

6.  Complete chloroplast genome sequence of Dryopteris fragrans (L.) Schott and the repeat structures against the thermal environment.

Authors:  Rui Gao; Wenzhong Wang; Qingyang Huang; Ruifeng Fan; Xu Wang; Peng Feng; Guangming Zhao; Shuang Bian; Hongli Ren; Ying Chang
Journal:  Sci Rep       Date:  2018-11-09       Impact factor: 4.379

7.  Integrated analysis of three newly sequenced fern chloroplast genomes: Genome structure and comparative analysis.

Authors:  Ruifeng Fan; Wei Ma; Shilei Liu; Qingyang Huang
Journal:  Ecol Evol       Date:  2021-03-18       Impact factor: 2.912

8.  The complete chloroplast genomes of three Hamamelidaceae species: Comparative and phylogenetic analyses.

Authors:  NingJie Wang; ShuiFei Chen; Lei Xie; Lu Wang; YueYao Feng; Ting Lv; YanMing Fang; Hui Ding
Journal:  Ecol Evol       Date:  2022-02-16       Impact factor: 2.912

9.  The Chromosome-Scale Reference Genome of Macadamia tetraphylla Provides Insights Into Fatty Acid Biosynthesis.

Authors:  Yingfeng Niu; Guohua Li; Shubang Ni; Xiyong He; Cheng Zheng; Ziyan Liu; Lidan Gong; Guanghong Kong; Wei Li; Jin Liu
Journal:  Front Genet       Date:  2022-02-23       Impact factor: 4.599

10.  Chromosome-Scale Assembly and Annotation of the Macadamia Genome (Macadamia integrifolia HAES 741).

Authors:  Catherine J Nock; Abdul Baten; Ramil Mauleon; Kirsty S Langdon; Bruce Topp; Craig Hardner; Agnelo Furtado; Robert J Henry; Graham J King
Journal:  G3 (Bethesda)       Date:  2020-10-05       Impact factor: 3.154

View more

北京卡尤迪生物科技股份有限公司 © 2022-2023.