Literature DB >> 26645680

Organellar Genomes of White Spruce (Picea glauca): Assembly and Annotation.

Shaun D Jackman¹, René L Warren¹, Ewan A Gibb¹, Benjamin P Vandervalk¹, Hamid Mohamadi¹, Justin Chu¹, Anthony Raymond¹, Stephen Pleasance¹, Robin Coope¹, Mark R Wildung², Carol E Ritland³, Jean Bousquet⁴, Steven J M Jones⁵, Joerg Bohlmann⁶, Inanç Birol⁷.

Abstract

The genome sequences of the plastid and mitochondrion of white spruce (Picea glauca) were assembled from whole-genome shotgun sequencing data using ABySS. The sequencing data contained reads from both the nuclear and organellar genomes, and reads of the organellar genomes were abundant in the data as each cell harbors hundreds of mitochondria and plastids. Hence, assembly of the 123-kb plastid and 5.9-Mb mitochondrial genomes were accomplished by analyzing data sets primarily representing low coverage of the nuclear genome. The assembled organellar genomes were annotated for their coding genes, ribosomal RNA, and transfer RNA. Transcript abundances of the mitochondrial genes were quantified in three developmental tissues and five mature tissues using data from RNA-seq experiments. C-to-U RNA editing was observed in the majority of mitochondrial genes, and in four genes, editing events were noted to modify ACG codons to create cryptic AUG start codons. The informatics methodology presented in this study should prove useful to assemble organellar genomes of other plant species using whole-genome shotgun sequencing data.

Entities: Chemical Disease Gene Species

Keywords: ABySS; genome assembly; gymnosperms; organelle; sequencing; white spruce

Mesh：

Year: 2015 PMID： 26645680 PMCID： PMC4758241 DOI： 10.1093/gbe/evv244

Source DB: PubMed Journal: Genome Biol Evol ISSN： 1759-6653 Impact factor: 3.416

Introduction

Most plant cells contain two types of organelles that comprise their own genomes, mitochondria, and plastids. In Pinaceae, mitochondrial genomes are inherited maternally, and plastid genomes are inherited paternally (Whittle and Johnston 2002). Complete plastid genomes of the gymnosperms Norway spruce (Picea abies) (Nystedt et al. 2013), Podocarpus lambertii (Vieira Ldo, Faoro, Rogalski, et al. 2014), Taxus chinensis var. mairei (Zhang et al. 2014), and four Juniperus species (Guo et al. 2014) have recently been published in National Center for Biotechnology Information (NCBI) GenBank (Benson et al. 2014). These projects used a variety of strategies for isolating plastid DNA (cpDNA), using physical separation methods in the lab or computationally separating cpDNA sequences from nuclear sequences. They also used different approaches for sequencing and assembly (table 1).

Table 1

Methods of cpDNA Separation, Sequencing, and Assembly of Complete Plastid Genomes of Gymnosperms Published

Species	cpDNA Separation	Sequencing	Sequence Assembler Software Tool
Picea abies	BLAST in silico	454 GS FLX Titanium^a	Newbler
Podocarpus lambertii	Saline Percoll gradient	Illumina MiSeq	Newbler
Juniperus bermudiana	Longer-range PCR	Illumina GAII^a	Geneious
Other Juniperus	Unspecified	Illumina MiSeq	Velvet
Taxus chinensis	BLAT in silico	Illumina HiSeq 2000	SOAPdenovo

aFinished with PCR and Sanger sequencing.

Methods of cpDNA Separation, Sequencing, and Assembly of Complete Plastid Genomes of Gymnosperms Published aFinished with PCR and Sanger sequencing. The P. abies project used 454 GS FLX Titanium sequencing and Sanger sequencing of polymerase chain reaction (PCR) amplicons for finishing, Basic Local Alignment Search Tool (BLAST) (Altschul et al. 1990) to isolate the cpDNA reads, and the software Newbler to assemble the reads. [AQ]The Po. lambertii project isolated the cpDNA with a saline Percoll gradient protocol (Vieira Ldo, Faoro, Fraga, et al. 2014), used Illumina MiSeq sequencing data, and assembled the reads using the Newbler software. The Juniperus bermudiana project used long-range PCR to amplify the plastid DNA, a combination of Illumina GAII and Sanger sequencing, and the software Geneious to assemble the reads using Camelliajaponica as a reference genome. The other three Juniperus projects used Illumina MiSeq sequencing and the software Velvet (Zerbino and Birney 2008) to assemble the reads. The T. chinensis project used whole-genome Illumina HiSeq 2000 sequencing, BLAT (Kent 2002) to isolate the cpDNA reads, and SOAPdenovo (Luo et al. 2012) to assemble the isolated cpDNA reads. All of these projects used DOGMA (Wyman et al. 2004) to annotate their assemblies. Only one complete mitochondrial genome of a gymnosperm has been published so far (Cycas taitungensis [Chaw et al. 2008]), whereas complete mitochondrial genome sequences of the angiosperms Brassica maritima (Grewe et al. 2014), Brassica oleracea (Grewe et al. 2014), Capsicum annuum (Jo et al. 2014), Eruca sativa (Wang et al. 2014), Helianthus tuberosus (Bock et al. 2014), Raphanus sativus (Jeong et al. 2014), Rhazya stricta (Park et al. 2014) and Vaccinium macrocarpon (Fajardo et al. 2014) have been deposited in NCBI GenBank. Six of these projects gave details of the sample preparation, sequencing, assembly, and annotation strategy. Three projects enriched organellar DNA using various laboratory methods (Kim et al. 2007;Keren et al. 2009;Chen et al. 2011), and the remainder used total genomic DNA. Three projects used Illumina HiSeq 2000 sequencing and Velvet for assembly, and three projects used Roche 454 GS-FLX sequencing and Newbler for assembly. Most projects used an aligner such as BLAST (Altschul et al. 1990) to isolate sequences with similarity to known mitochondrial sequence, either before or after assembly. Two projects used Mitofy (Alverson et al. 2010) to annotate the genome, and the remainder used a collection of tools such as BLAST, tRNAscan-SE (Lowe and Eddy 1997), and ORF Finder to annotate genes. Plant mitochondrial genomes can substantially vary in size, with some of the largest mitochondrial genomes reported for the basal angiosperm Amborella trichopoda (3.9 Mb) (Rice et al. 2013) and the two Silene species S. noctiflora and S. conica (6.7 and 11.3 Mb, respectively) (Sloan et al. 2012). The mitochondrial genome of the gymnosperm Cycas is relatively smaller with a length of 415 kb (Chaw et al. 2008). The SMarTForests project (www.smartforests.ca) has recently published a set of stepwise improved assemblies of the 20-Gb white spruce (Picea glauca) genome (Birol et al. 2013; Warren, Keeling, et al. 2015; Warren, Yang, et al. 2015), a gymnosperm genome seven times the size of the human genome, sequenced using the Illumina HiSeq and MiSeq sequencing platforms. The whole-genome sequencing data contained reads originating from both the nuclear and organellar genomes. Although one copy of the diploid nuclear genome is found in each cell, hundreds of organelles are present, and thus hundreds of copies of the organellar genomes. This abundance results in an overrepresentation of the organellar genomes in whole-genome sequencing data. Assembling low coverage white spruce whole-genome shotgun (WGS) sequencing data using the software ABySS (Simpson et al. 2009) yielded assemblies composed mainly of organellar sequences and nuclear repeat elements. The assembled sequences that originate from the organellar genomes were separated from those of nuclear origin by classifying the sequences using their length, depth of coverage and GC content. The plastid genome of white spruce was compared with that of Norway spruce (P.abies) (Nystedt et al. 2013), and the mitochondrial genome of white spruce was compared with that of prince sago palm (C. taitungensis) (Chaw et al. 2008). Notably, white spruce and Norway spruce belong to phylogenetically remote spruce lineages (Bouillé et al. 2011) and their split occurred at least 10 Ma (Bouillé and Bousquet 2005), resulting in a nuclear genome sequence divergence of approximately 3% (Warren, Keeling, et al. 2015).

Materials and Methods

DNA, RNA, and Software Materials

Genomic DNA was collected from the apical shoot tissues of a single interior white spruce tree, clone PG29 from the British Columbia Ministry of Forests, Lands and Natural Resource Operations, and sequencing libraries constructed as described before (Birol et al. 2013). Because the original intention of this sequencing project was to assemble the nuclear genome, an organelle exclusion method was used to preferentially extract nuclear DNA. However, sequencing reads from both organellar genomes were present in sufficient depth to assemble their genomes. RNA was extracted from eight samples, three developmental stages and five mature tissues: Megagametophyte, embryo, seedling, young bud, xylem, mature needle, flushing bud and bark, as described earlier (Warren, Keeling, et al. 2015). These samples were sequenced with the Illumina HiSeq 2000 (Warren, Keeling, et al. 2015). The RNA-seq data were used to quantify the transcript abundance of the annotated mitochondrial genes using the software Salmon (Patro et al. 2014). The software used in this analysis and their versions are listed in supplementary table S1, Supplementary Material online. All software tools were installed using Homebrew (http://brew.sh, last accessed December 17, 2015).

Plastid Genome Assembly

A single lane of Illumina MiSeq paired-end sequencing (SRR525215) was used to assemble the plastid genome. Paired-end sequencing usually leaves a gap of unsequenced nucleotides in the middle of the DNA fragment. Because 300-bp paired-end reads were sequenced from a library of 500-bp DNA fragments, the reads are expected to overlap by 100 bp. These overlapping paired-end reads were merged using ABySS-mergepairs, a component of the software ABySS (Simpson et al. 2009). These merged reads were assembled using ABySS. Contigs that were putatively derived from the plastid genome were separated by length and depth of coverage using thresholds chosen by inspection of a scatter plot (see supplementary fig. S1, Supplementary Material online). These putative plastid contigs were assembled into scaffolds using ABySS-scaffold. We ran the gap-filling application Sealer (Paulino et al. 2015) (options -v -j 12 -b 30G -B 300 -F 700 with -k from 18 to 108 with step size 6) on the ABySS assembly of the plastid genome, closing five of the remaining seven gaps, with a resulting assembly consisting of two large (∼50 and ∼70 kb) scaftigs. Given the small size of the plastid genome, we opted to manually finish the assembly using the software Consed 20.0 (Gordon and Green 2013). We loaded the resulting gap-filled assembly into Consed and imported Pacific Biosciences (PacBio) sequencing data (SRR2148116 and SRR2148117), 9,204 reads 500 bp and larger, into the assembly and aligned them to the plastid genome using cross_match from within Consed. For each scaftig end, six PacBio reads were pulled out and assembled using the mini-assembly feature in Consed. Cross_match alignments of the resulting contigs to the plastid assembly were used to merge the two scaftigs and confirm that the complete circular genome sequence was obtained. In a subsequent step, 7,742 Illumina HiSeq reads were imported and aligned to the assembly using Consed. These reads were selected from the library of 133 million reads used to assemble the mitochondrial genome (see below) on the basis of alignment to our draft plastid genome using BWA 0.7.5a (Li 2013), focusing on regions that would benefit from read import by restricting our search to regions with ambiguity and regions covered by PacBio reads exclusively. The subset of Illumina reads was selected using samtools 0.1.18, mini-assembled with Phrap (Gordon and Green 2013) and the resulting contigs remerged to correct bases in gaps filled only by PacBio, namely one gap and sequence at edges confirming the circular topology. The starting base was chosen using the Norway spruce plastid genome sequence (NC_021456) (Nystedt et al. 2013). Our assembly was further polished using the Genome Analysis Toolkit (GATK) 2.8-1-g932cd3a FastaAlternateReferenceMaker (McKenna et al. 2010). The assembled plastid genome was initially annotated using DOGMA (Wyman et al. 2004). Being an interactive web application, it is not convenient for automated annotation. We used the software MAKER (Campbell et al. 2014) to annotate the white spruce plastid using the Norway spruce plastid genome for both protein-coding and noncoding gene homology evidence. The parameters of MAKER are shown in supplementary table S2, Supplementary Material online. The inverted repeat was identified using MUMmer (Kurtz et al. 2004), shown in supplementary figure S3, Supplementary Material online. The assembled plastid genome was aligned to the Norway spruce plastid using BWA-MEM (Li 2013). The two genomes were compared using QUAST (Gurevich et al. 2013) to confirm the presence and position of the annotated genes of the Norway spruce plastid in the white spruce plastid.

Mitochondrial Genome Assembly

Konnector (Vandervalk et al. 2015) was used to fill the gap between the paired-end reads of a single lane of Illumina HiSeq 2000 paired-end sequencing (SRR525196). These connected paired-end reads were assembled using ABySS. Putative mitochondrial sequences were separated from nuclear sequences by their length, depth of coverage and GC content using k-means clustering in R (see supplementary fig. S2, Supplementary Material online). The putative mitochondrial contigs were then assembled into scaffolds using ABySS-scaffold with a single lane of Illumina HiSeq sequencing of a mate-pair library. The ABySS assembly of the white spruce mitochondrial genome resulted in 71 scaffolds. We ran the gap-filling application Sealer attempting to close the gaps between every combination of two scaffolds. This approach closed 10 gaps and yielded 61 scaffolds, which we used as input to the LINKS scaffolder 1.1 (Warren, Yang, et al. 2015) (options -k 15 -t 1 -l 3 -r 0.4, 19 iterations with -d from 500 to 6,000 with step size 250) in conjunction with long PacBio reads, further decreasing the number of scaffolds to 58. The Konnector pseudoreads were aligned to the 58 LINKS scaffolds with BWA 0.7.5a (bwa mem -a multimap), and we created links between two scaffolds when reads aligned within 1,000 bp of the edges of any two scaffolds. We modified LINKS to read the resulting SAM alignment file and link scaffolds satisfying this criterion (options LINKS-sam -e 0.9 -a 0.5), bringing the final number of scaffolds to 38. We confirmed the merges using mate-pair reads. The white spruce mate-pair libraries used for confirmation were presented earlier (Birol et al. 2013), and are available from DNAnexus (http://sra.dnanexus.com/studies/SRP014489 last accessed 17 Dec 2015). In brief, mate-pair reads from three fragment size libraries (5, 8, and 12 kb) were aligned to the 38-scaffold assembly with BWA-MEM 0.7.10-r789 and the resulting alignments parsed with a PERL script. A summary of this validation is presented in supplementary table S4, Supplementary Material online. Automated gap-closing was performed with Sealer 1.0 (options -j 12 -B 1000 -F 700 -P10 -k96 -k80) using Bloom filters built from the entire white spruce PG29 read data set (Warren, Keeling, et al. 2015) and closed 55 of the 182 total gaps (30.2%). We polished the gap-filled assembly using GATK, as described for the plastid genome. The assembled scaffolds were aligned to the NCBI nucleotide (nt) database using BLAST to check for hits to published mitochondrial genomes, and to screen for contamination. The mitochondrial genome was annotated using MAKER (parameters shown in supplementary table S3, Supplementary Material online) and Prokka (Seemann 2014), and the two sets of annotations were merged using BEDTools (Quinlan and Hall 2010) and GenomeTools (Gremme et al. 2013), selecting the MAKER annotation when the two tools had overlapping annotations. The proteins of all green plants (Viridiplantae) with complete mitochondrial genome sequences in NCBI GenBank (Benson et al. 2014), 142 species, were used for protein homology evidence, the most closely related of which is the prince sago palm (C. taitungensis; NC_010303) (Chaw et al. 2008), being the only gymnosperm with a complete mitochondrial genome. Transfer RNA (tRNA) were annotated using ARAGORN (Laslett and Canback 2004). Ribosomal RNA (rRNA) were annotated using RNAmmer (Lagesen et al. 2007). Prokka uses Prodigal (Hyatt et al. 2010) to annotate open-reading frames (ORFs). Repeats were identified using RepeatMasker and RepeatModeler (Smit et al. 1996). The RNA-seq reads were aligned to the annotated mitochondrial genes using BWA-MEM and variants were called using samtools and bcftools requiring a minimum genotype quality of 50 to identify possible sites of C-to-U RNA editing. A gene with an abundance of at least ten transcripts per million as quantified by Salmon (Patro et al. 2014) was considered expressed.

Results

The White Spruce Plastid Genome

The assembly and annotation metrics for the white spruce plastid and mitochondrial genomes are summarized in table 2. The plastid genome was assembled into a single circular contig of 123,266 bp containing 114 identified genes: 74 protein-coding (mRNA) genes, 36 tRNA genes, and 4 rRNA genes (fig. 1).

Table 2

Sequencing, Assembly, and Annotation Metrics of the White Spruce Organellar Genomes

Metric	Plastid	Mitochondrion
Number of lanes	1 MiSeq lane	1 HiSeq lane
Number of read pairs	4.9 million	133 million
Read length	2 × 300 bp	2 × 150 bp
Number of merged reads	3.0 million	1.4 million
Median merged read length	492 bp	465 bp
Number of assembled reads	21,000	377,000
Proportion of organellar reads	1/140 or 0.7%	1/350 or 0.3%
Depth of coverage	80×	30×
Assembled genome size	123,266 bp	5.94 Mb
Number of contigs	1 contig	130 contigs
Contig N50	123 kb	102 kb
Number of scaffolds	1 scaffold	36 scaffolds
Scaffold N50	123 kb	369 kb
Largest scaffold	123 kb	1,222 kb
GC content	38.8%	44.7%
Number of genes without ORFs	114 (108)	143 (74)
Protein-coding genes (mRNA)	74 (72)	106 (51)
rRNA genes	4 (4)	8 (3)
tRNA genes	36 (32)	29 (20)
ORFs ≥ 300 bp	Not available	1,065
Coding genes containing introns	8	5
Introns in coding genes	9	7
tRNA genes containing introns	6	0

Note.—The number of distinct genes are shown in parentheses.

The complete plastid genome of white spruce. The PG29 white spruce chloroplast genome was annotated using MAKER and plotted using OrganellarGenomeDRAW (Lohse et al. 2007). The inner gray track depicts the G+C content of the genome. Sequencing, Assembly, and Annotation Metrics of the White Spruce Organellar Genomes Note.—The number of distinct genes are shown in parentheses. The majority of the protein-coding genes and tRNAs was present in single copies, with the exception of the coding genes psbI and ycf12, which were found to be duplicated. Likewise, the tRNAs trnH-GUG, trnI-CAU, trnS-GCU, and trnT-GGU were found to have two copies each. All the rRNA genes were found to be single copy. Most of the plastid protein-coding genes had no introns. However, like other conifer plastid genomes, we found that introns were prevalent in the white spruce plastid genome. We found the protein-coding genes atpF, petB, petD, rpl2, rpl16, rpoC1, and rps12 each contained a single intron, whereas ycf3 contained two introns. Like the protein-coding genes, many tRNA genes including trnA-UGC, trnG-GCC, trnI-GAU, trnK-UUU, trnL-UAA, and trnV-UAC were split by an intron. In total we observed 15 intron insertions, 11 of which had a group II intron signature as determined by the online software RNAweasel (Lang et al. 2007). Group II introns are mobile, self-splicing ribozymes found inserted in bacterial and organellar genomes (Lambowitz and Zimmerly 2011). Interestingly, some protein-coding genes were difficult to annotate using MAKER due to particularly small initial exons. The smallest observed exons were found for petB, petD and rpl16 where the exons were 6, 8 and 9 bp, respectively. These genes, likely to belong to polycistronic transcripts (Barkan 1988), were annotated manually. Another gene we annotated manually was rps12 (Hildebrand et al. 1988), as this gene is typically trans-spliced; where exons of two different primary transcripts are ligated together, making this difficult to annotate using MAKER. In the spruce plastid, rps12 is composed of three exons and one cis-spliced intron. The plastid genome was mostly free of repeat elements, with the exception of one class of inverted repeats (IR), present in two copies. Each copy of the IR was 445 bp in size, much smaller than most plants, typical of Pinaceae (Lin et al. 2010). Yet, atypically the two copies differed by a single base. In both cases, the IR contained a single gene, the tRNA trnI-CAU. Having completely annotated the white spruce plastid genome, we sought to determine the similarity of this genome to other conifer plastids. Alignment of the white spruce plastid genome to that of the Norway spruce resulted in a sequence coverage of 99.7% and the sequence identity in aligned regions was 99.2%. Consistent with this similarity, we found all 114 genes of the Norway spruce plastid genome (Nystedt et al. 2013) were present in the white spruce plastid genome in perfect synteny and order. Altogether, these observations indicate that the congeneric spruce plastid genomes have not diverged significantly over evolutionary time. These data are in contrast to the level of nuclear genome conservation, which shows 3% sequence divergence between white spruce and Norway spruce (Warren, Keeling, et al. 2015). Higher sequence divergence is also frequently observed among the congeneric plastid genomes in the Angiosperms (Yang et al. 2013; Huang et al. 2014).

The White Spruce Mitochondrial Genome

The white spruce mitochondrial genome was assembled into 38 scaffolds (132 contigs) with a scaffold N50 of 369 kb (contig N50 of 102 kb). The largest scaffold was 1,222 kb (table 2). The scaffolds were aligned to the NCBI nucleotide (nt) database using BLAST. Of the 38 scaffolds, 26 scaffolds aligned to mitochondrial genomes, 3 small scaffolds (<10 kb) aligned to white spruce mRNA clones and BAC sequences, 7 small scaffolds (<10 kb) had no significant hits, and 2 small scaffolds (<5 kb) aligned to cloning vectors. These last two scaffolds were removed from the assembly. The mitochondrial genome was rich with both annotated and putative protein-coding genes. A total of 106 protein-coding genes (51 distinct genes) were identified, which comprised 75 kb (1.3%) of the genome. In addition to 106 protein-coding genes, we found an additional 6,265 ORFs of least 90 bp (or 30 amino acids), which occupied 1.4 Mb (24%) of genome sequence, including 1,065 ORFs of at least 300 bp (100 amino acids), covering 413 kb (7%). We could not identify similar genes in the Viridiplantae mitochondrial genome (Benson et al. 2014) to annotate these ORFs. As such, these putative coding genes may represent novel proteins unique to the spruce or conifer mitochondrial genomes. In addition to protein-coding genes, we identified 29 tRNAs and 8 rRNA genes. As with the spruce plastid, these ncRNAs showed variable copy number, with the majority being in a single copy. We found three copies of trnD-GUC, seven copies of trnM-CAU, and two copies of trnY-GUA. Five of the seven trnM-CAU genes share sequence similarity to the plastid translation initiator trnfM-CAU. The significance of the duplication of the remaining tRNA genes is unknown. Like the tRNAs, the rRNAs genes were variable with the rrn5 present in four copies, rrn18 in three copies, and rrn26 in single copy. The relative order of the genes on the scaffolds and gene size is shown in figure 2. The size of each gene family is shown in figure 3. The precise position of each gene on its scaffold is shown in supplementary figure S4, Supplementary Material online.

Gene content of the white spruce mitochondrial genome, grouped by gene family. Each box is proportional to the size of the gene including introns. The color of each gene is unique within its gene family.

Relative order and size of genes on the scaffolds of the white spruce mitochondrial genome. Each box is proportional to the size of the gene including introns, except that genes smaller than 200 bp are shown as 200 bp. The space between genes is not to scale. An asterisk indicates that the gene name is truncated. Only scaffolds that harbor annotated genes are shown. Gene content of the white spruce mitochondrial genome, grouped by gene family. Each box is proportional to the size of the gene including introns. The color of each gene is unique within its gene family. As we observed for the plastid genome, introns were not particularly abundant, and were found to be inserted in only in a handful of genes. A total of seven intron insertions were found distributed among five protein-coding genes. These included nad2, nad5, and nad7 which each contained one intron, and nad4 and rps3 which had two introns. All introns were determined to be group II introns using RNAweasel (Lang et al. 2007). Unexpectedly, repeat elements comprised only 390 kb (6.6%) of the mitochondrial genome (fig. 4). The most commonly represented repeat elements were simple repeats and LTR Copia, ERV1, and Gypsy.

Repetitive sequence content of the white spruce mitochondrial genome, annotated using RepeatMasker and RepeatModeler.

Repetitive sequence content of the white spruce mitochondrial genome, annotated using RepeatMasker and RepeatModeler. We compared the spruce mitochondrial genome with the closest sequenced spruce relative, the gymnosperm C. taitungensis. The C. taitungensis mitochondrial genome contains 39 protein-coding genes and 3 rRNA genes, all of which were also identified in white spruce. Of the 22 tRNA genes of C. taitungensis, we found 13 in white spruce, but the spruce mitochondrial genome also had an additional eight tRNA genes that were not observed in C. taitungensis. Transfer of organellar DNA to the nucleus is common in plants (Kleine et al. 2009). To gather evidence for DNA transfer between the organellar and nuclear genome, we aligned the two organellar genomes to the WGS assembly of white spruce (Warren, Keeling, et al. 2015). As the WGS assembly may contain fragments of assembled organellar DNA, alignments with perfect identity were excluded. For aligned segments larger than 500 bp, we observed that 98% of the plastid genome and 54% of the mitochondrial genome were represented in the WGS assembly, with 7% and 4% mean DNA sequence divergence, respectively, suggesting that nearly all of the plastid genome and over half of the mitochondrial genome is represented in the white spruce nuclear genome. In comparison, 16% of the plastid genome of Arabidopsis thaliana is found in its nuclear genome, and 83% of the plastid genome of Oryza sativa is found in its nuclear genome (Shahmuradov et al. 2003). Although these results are intriguing, it is unclear whether the high sequence identity reflects true gene transfer events between the organellar and nuclear genomes of white spruce. Further investigation is warranted in future work.

The White Spruce Mitochondrial Transcriptome

Having assembled and annotated the mitochondrial genome of the white spruce, we next sought to identify the gene expression patterns of this organellar genome in the developing spruce and in the adult tissues. First, we profiled the expression patterns of the 106 coding genes with known function across three developmental tissues and five mature tissues (fig. 5). Here, we found that at least 29 genes were expressed in at least one of the developing tissues, but not in a mature tissue, suggesting some mitochondrial transcripts may be expressed during spruce development. In contrast, we found 60 genes to be expressed in at least one of the mature tissues and a total of 17 genes were not expressed. The developing spruce megagametophyte and embryo were the most transcriptionally active tissues and clustered together using an unsupervised clustering algorithm.

Heatmap of the transcript abundance of mitochondrial protein-coding genes of white spruce. Each column is a tissue sample. Each row is a gene. Each cell represents the transcript abundance of one gene in one sample. The color scale is log10(TPM+1), where TPM is transcripts per million as measured by Salmon (Patro et al. 2014). Conversely, although 2,809/6,265 (45%) of ORFs at least 90 bp were expressed in at least one developing tissue, only 427/6,265 (7%) were expressed in at least one mature tissue (table 3 and fig. 6). Nearly half (3,029/6,265; 48%) did not have detectable expression in any tissue sampled.

Table 3

Number of Expressed Protein-Coding Genes and ORFs of the White Spruce Mitochondrial Transcriptome Tabulated by Developmental Stage

	Both	Mature Only	Developing Only	Neither	Sum
CDS	60	0	29	17	106
ORF	411	16	2,809	3,029	6,265
Sum	471	16	2,838	3,046	6,371

Heatmap of the transcript abundance of mitochondrial protein-coding genes of white spruce, including ORFs. Each column is a tissue sample. Each row is a gene. Each cell represents the transcript abundance of one gene in one sample. The color scale is log10(TPM+1), where TPM is transcripts per million as measured by Salmon (Patro et al. 2014). Number of Expressed Protein-Coding Genes and ORFs of the White Spruce Mitochondrial Transcriptome Tabulated by Developmental Stage Many conifer organellar transcripts are subject to C-to-U RNA editing (Chateigner-Boutin and Small 2010). To explore this possibility for the mitochondrial transcriptome of white spruce, we analyzed RNA-seq read alignments against the reconstructed mitochondrial genome using a custom pipeline (supplementary listing S1, Supplementary Material online). RNA editing events were predicted as nucleotide positions where the genomic sequence shows a C-residue, but the RNA-sequencing data suggest a T (supplementary table S5, Supplementary Material online). Correctly calling RNA editing events can be confounded by genomic single nucleotide variants and by false-positives originating from misaligned reads. However, we see an enrichment of C-to-T variants in 91% (1,601 of 1,751) positions (supplementary table S6, Supplementary Material online), suggesting a large fraction of these are true C-to-U RNA edits. We find C-to-U RNA editing events occurred in 68 of the 106 coding genes (supplementary table S5, Supplementary Material online), with the most highly edited gene, nad3, edited at a rate of nine edits per 100 bp. Uniquely, C-to-U RNA editing can generate new start and stop codons, but it is unable to destroy existing start and stop codons. In organellar genomes, editing of the ACG (Thr) codon to AUG (Met), which creates a novel start codon, is frequently observed (Neckermann et al. 1994). In the white spruce mitochondrial genome, we observed four such editing events for the genes mttB, nad1, rps3, and rps4.

Discussion

In this work, we outline an informatics methodology to assemble organellar genomes of plant species using WGS sequencing data. Usually plant genomics projects generate such data sets to reconstruct and study the nuclear genomes of their target species. Here we demonstrate that the same data sets can be mined for the organellar genomes, providing added value to those projects with no additional cost to experimental budgets. One lane of MiSeq sequencing of whole-genome DNA (4.9 million read pairs) was sufficient to assemble the 123-kb plastid genome, and one lane of HiSeq sequencing of whole-genome DNA (133 million read pairs) was sufficient to assemble the 5.9-Mb mitochondrial genome of white spruce. Additional Illumina and PacBio sequencing was used to improve scaffold contiguity and to close scaffold gaps, after which the plastid genome was assembled in a single contig, and the largest mitochondrial scaffold was 1.2 Mb. In previous studies, analysis of cpDNA was useful in reconstructing phylogenies of plants (Wu et al. 2007), in determining the origin of an expanding population (Aizawa et al. 2012), and in determining when distinct lineages of a species resulted from multiple colonization events (Jardon-Barbolla et al. 2011). The contrasting inheritance schemes of plastids and mitochondria can be useful in the characterization of species expanding their range. In the case of two previously allopatric species now found in sympatry and hybridizing, the mitochondrial DNA (mtDNA) would be contributed by the resident species, whereas introgression of the plastid genome into the expanding species would usually be limited, as pollen would be more readily dispersed than seeds (Du et al. 2011). Differential gene flow of cpDNA and mtDNA due to different modes of inheritance and dispersion would result in new assemblages of organellar genomes and an increase of genetic diversity after expansion from a refugium (Gerardi et al. 2010). The white spruce plastid genome showed no structural rearrangements when compared with that of Norway spruce despite the divergence of these two species, estimated at more than 10 Ma (Bouillé and Bousquet 2005). All genes of the Norway spruce plastid were present and collinear in the white spruce plastid, and no new plastid genes were found in white spruce. The remarkable level of sequence conservation between these spruces for the plastid genome was also in contrast to the nuclear genome, suggesting strong evolutionary pressure to maintain a functional plastid genome. The plastids of the angiosperms demonstrate frequent rearrangements (Palmer and Herbon 1988; Knox 2014) or higher sequence divergence (Yang et al. 2013; Huang et al. 2014). Likewise, comparative genomics among the five extant gymnosperm group show that Pinaceae and non-Pinaceae conifers (cupressophytes) have lost a different copy of IR (Wu et al. 2011; Yi et al. 2013; Wu and Chaw 2014), and may explain the reduced diversity of cpDNA organizations in Pinaceae. It will be interesting to see whether high sequence conservation is observed for plastid genomes within the Pinaceae family, and particularly within the genus Picea. All genes of the prince sago palm (C. taitungensis) mitochondrial genome were also present in white spruce, but mitochondrial ORFs were found that were unique to white spruce. The protein-coding gene content of the white spruce mitochondrial genome was quite sparse, with 106 protein-coding genes contained in 5.9 Mb, in comparison to the plastid genome with 74 protein-coding genes in 123 kb. The mitochondrial genome of white spruce appears quite large for a gymnosperm, compared with the 415-kb mitochondrial genome of Cycas (Chaw et al. 2008) or the estimate of 750 kb–1 Mb obtained from southern blot analysis for the conifer Larix (Kumar et al. 1995). Nearly 7% of the white spruce mitochondrial genome was composed of repeats, and roughly 1% was composed of coding genes. Thus, a significant portion of the unusually large white spruce mitochondrial genome remains to be characterized. Sequencing and annotation of organellar genomes in spruce trees offers significant advancement in our understanding of conifer biology and evolution while providing a reference for further research. For instance, the remarkable level of structure and sequence conservation of the plastid genome between the distantly related white spruce and Norway spruce indicates strong selective pressures not only on genes but also on intergenic regions and overall genome structure, which should facilitate comparative evolutionary studies in the genus. Further investigations implicating several other Pinaceae genera appear necessary to assess the extent of this trend at a larger phylogenetic scale and how different it is from the trends seen in other plant groups. The present reference genomes should also be helpful in resequencing projects so to better identify islands of intraspecific sequence variation and how they vary among conifer taxa. Overall, these new reference genomes should help develop applications for the management and conservation of natural genetic diversity in this group of ecologically and economically important trees.

Supplementary Material

Supplementary listing S1, figures S1–S3, and tables S1–S6 are available at Genome Biology and Evolution online (http://www.gbe.oxfordjournals.org/).

69 in total

1. BLAT--the BLAST-like alignment tool.

Authors: W James Kent
Journal: Genome Res Date: 2002-04 Impact factor: 9.043

2. Automatic annotation of organellar genomes with DOGMA.

Authors: Stacia K Wyman; Robert K Jansen; Jeffrey L Boore
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

Review 3. Group II introns: mobile ribozymes that invade DNA.

Authors: Alan M Lambowitz; Steven Zimmerly
Journal: Cold Spring Harb Perspect Biol Date: 2011-08-01 Impact factor: 10.005

4. Chloroplast genome (cpDNA) of Cycas taitungensis and 56 cp protein-coding genes of Gnetum parvifolium: insights into cpDNA evolution and phylogeny of extant seed plants.

Authors: Chung-Shien Wu; Ya-Nan Wang; Shu-Mei Liu; Shu-Miaw Chaw
Journal: Mol Biol Evol Date: 2007-03-22 Impact factor: 16.240

5. GenomeTools: a comprehensive software library for efficient processing of structured genome annotations.

Authors: Gordon Gremme; Sascha Steinbiss; Stefan Kurtz
Journal: IEEE/ACM Trans Comput Biol Bioinform Date: 2013 May-Jun Impact factor: 3.710

6. Horizontal transfer of entire genomes via mitochondrial fusion in the angiosperm Amborella.

Authors: Danny W Rice; Andrew J Alverson; Aaron O Richardson; Gregory J Young; M Virginia Sanchez-Puerta; Jérôme Munzinger; Kerrie Barry; Jeffrey L Boore; Yan Zhang; Claude W dePamphilis; Eric B Knox; Jeffrey D Palmer
Journal: Science Date: 2013-12-20 Impact factor: 47.728

7. MAKER-P: a tool kit for the rapid creation, management, and quality control of plant genome annotations.

Authors: Michael S Campbell; MeiYee Law; Carson Holt; Joshua C Stein; Gaurav D Moghe; David E Hufnagel; Jikai Lei; Rujira Achawanantakun; Dian Jiao; Carolyn J Lawrence; Doreen Ware; Shin-Han Shiu; Kevin L Childs; Yanni Sun; Ning Jiang; Mark Yandell
Journal: Plant Physiol Date: 2013-12-04 Impact factor: 8.340

8. Whole plastome sequencing reveals deep plastid divergence and cytonuclear discordance between closely related balsam poplars, Populus balsamifera and P. trichocarpa (Salicaceae).

Authors: Daisie I Huang; Charles A Hefer; Natalia Kolosova; Carl J Douglas; Quentin C B Cronk
Journal: New Phytol Date: 2014-07-31 Impact factor: 10.151

9. Proteins encoded by a complex chloroplast transcription unit are each translated from both monocistronic and polycistronic mRNAs.

Authors: A Barkan
Journal: EMBO J Date: 1988-09 Impact factor: 11.598

10. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler.

Authors: Ruibang Luo; Binghang Liu; Yinlong Xie; Zhenyu Li; Weihua Huang; Jianying Yuan; Guangzhu He; Yanxiang Chen; Qi Pan; Yunjie Liu; Jingbo Tang; Gengxiong Wu; Hao Zhang; Yujian Shi; Yong Liu; Chang Yu; Bo Wang; Yao Lu; Changlei Han; David W Cheung; Siu-Ming Yiu; Shaoliang Peng; Zhu Xiaoqian; Guangming Liu; Xiangke Liao; Yingrui Li; Huanming Yang; Jian Wang; Tak-Wah Lam; Jun Wang
Journal: Gigascience Date: 2012-12-27 Impact factor: 6.524

17 in total

1. Deciphering the Multi-Chromosomal Mitochondrial Genome of Populus simonii.

Authors: Changwei Bi; Yanshu Qu; Jing Hou; Kai Wu; Ning Ye; Tongming Yin
Journal: Front Plant Sci Date: 2022-06-15 Impact factor: 6.627

Review 2. Chloroplast genomes: diversity, evolution, and applications in genetic engineering.

Authors: Henry Daniell; Choun-Sea Lin; Ming Yu; Wan-Jung Chang
Journal: Genome Biol Date: 2016-06-23 Impact factor: 13.583

3. Assembly of the Complete Sitka Spruce Chloroplast Genome Using 10X Genomics' GemCode Sequencing Data.

Authors: Lauren Coombe; René L Warren; Shaun D Jackman; Chen Yang; Benjamin P Vandervalk; Richard A Moore; Stephen Pleasance; Robin J Coope; Joerg Bohlmann; Robert A Holt; Steven J M Jones; Inanc Birol
Journal: PLoS One Date: 2016-09-15 Impact factor: 3.240

4. Factors affecting the accuracy of genomic selection for growth and wood quality traits in an advanced-breeding population of black spruce (Picea mariana).

Authors: Patrick R N Lenz; Jean Beaulieu; Shawn D Mansfield; Sébastien Clément; Mireille Desponts; Jean Bousquet
Journal: BMC Genomics Date: 2017-04-28 Impact factor: 3.969

5. Interspecific Plastome Recombination Reflects Ancient Reticulate Evolution in Picea (Pinaceae).

Authors: Alexis R Sullivan; Bastian Schiffthaler; Stacey Lee Thompson; Nathaniel R Street; Xiao-Ru Wang
Journal: Mol Biol Evol Date: 2017-07-01 Impact factor: 16.240

6. Complete Mitochondrial Genome of a Gymnosperm, Sitka Spruce (Picea sitchensis), Indicates a Complex Physical Structure.

Authors: Shaun D Jackman; Lauren Coombe; René L Warren; Heather Kirk; Eva Trinh; Tina MacLeod; Stephen Pleasance; Pawan Pandoh; Yongjun Zhao; Robin J Coope; Jean Bousquet; Joerg Bohlmann; Steven J M Jones; Inanc Birol
Journal: Genome Biol Evol Date: 2020-07-01 Impact factor: 3.416

7. Complete Chloroplast Genome Sequence of an Engelmann Spruce (Picea engelmannii, Genotype Se404-851) from Western Canada.

Authors: Diana Lin; Lauren Coombe; Shaun D Jackman; Kristina K Gagalova; René L Warren; S Austin Hammond; Helen McDonald; Heather Kirk; Pawan Pandoh; Yongjun Zhao; Richard A Moore; Andrew J Mungall; Carol Ritland; Trevor Doerksen; Barry Jaquish; Jean Bousquet; Steven J M Jones; Joerg Bohlmann; Inanc Birol
Journal: Microbiol Resour Announc Date: 2019-06-13

8. The complete mitochondrial genome of Cycas debaoensis revealed unexpected static evolution in gymnosperm species.

Authors: Sadaf Habib; Shanshan Dong; Yang Liu; Wenbo Liao; Shouzhou Zhang
Journal: PLoS One Date: 2021-07-22 Impact factor: 3.240

9. Will Benchtop Sequencers Resolve the Sequencing Trade-off in Plant Genetics?

Authors: Alex D Twyford
Journal: Front Plant Sci Date: 2016-04-06 Impact factor: 5.753

10. The mitochondrial genome of the terrestrial carnivorous plant Utricularia reniformis (Lentibulariaceae): Structure, comparative analysis and evolutionary landmarks.

Authors: Saura R Silva; Danillo O Alvarenga; Yani Aranguren; Helen A Penha; Camila C Fernandes; Daniel G Pinheiro; Marcos T Oliveira; Todd P Michael; Vitor F O Miranda; Alessandro M Varani
Journal: PLoS One Date: 2017-07-19 Impact factor: 3.240