Literature DB >> 25658309

Are differences in genomic data sets due to true biological variants or errors in genome assembly: an example from two chloroplast genomes.

Zhiqiang Wu¹, Luke R Tembrock², Song Ge³.

Abstract

DNA sequencing has been revolutionized by the development of high-throughput sequencing technologies. Plummeting costs and the massive throughput capacities of second and third generation sequencing platforms have transformed many fields of biological research. Concurrently, new data processing pipelines made rapid de novo genome assemblies possible. However, high quality data are critically important for all investigations in the genomic era. We used chloroplast genomes of one Oryza species (O. australiensis) to compare differences in sequence quality: one genome (GU592209) was obtained through Illumina sequencing and reference-guided assembly and the other genome (KJ830774) was obtained via target enrichment libraries and shotgun sequencing. Based on the whole genome alignment, GU592209 was more similar to the reference genome (O. sativa: AY522330) with 99.2% sequence identity (SI value) compared with the 98.8% SI values in the KJ830774 genome; whereas the opposite result was obtained when the SI values in coding and noncoding regions of GU592209 and KJ830774 were compared. Additionally, the junctions of two single copies and repeat copies in the chloroplast genome exhibited differences. Phylogenetic analyses were conducted using these sequences, and the different data sets yielded dissimilar topologies: phylogenetic replacements of the two individuals were remarkably different based on whole genome sequencing or SNP data and insertions and deletions (indels) data. Thus, we concluded that the genomic composition of GU592209 was heterogeneous in coding and non-coding regions. These findings should impel biologists to carefully consider the quality of sequencing and assembly when working with next-generation data.

Entities: Chemical Disease Gene Species

Mesh：

Year: 2015 PMID： 25658309 PMCID： PMC4320078 DOI： 10.1371/journal.pone.0118019

Source DB: PubMed Journal: PLoS One ISSN： 1932-6203 Impact factor: 3.240

Introduction

High-throughput sequencing or next-generation sequencing (NGS) technologies have transformed many fields of biological research: including genetics, phylogenetics, population biology and comparative genomics, by delivering tens of thousands of genome and transcriptome sequences within a short time and with low cost [1, 2]. For example, Illumina announced in 2014 that they could sequence full coverage human genomes for only $1,000 within a few days. At the same time, a diverse array of algorithms was generated to assemble reads from different NGS platforms [3-6]. Despite the advancements brought by NGS technology, biologists remain concerned with obtaining high-quality and high-fidelity data instead of simply acquiring copious quantities of nucleotides. The errors associated with different sequencing platforms and bioinformatic analyses (e.g., reference-guided assemblies) need to be differentiated from true biological variants, such as nucleotide substitutions, insertions or deletions, and large-scale translocations. The errors in sequencing and assembly caused incorrect inferences in genomic analyses such as annotation and downstream analyses [7-10]. For example, Alkan et al. [1] found that de novo assembly from a human genome of Han Chinese origin was 16.2% shorter than the reference genome and that 99.1% of the validated duplicated sequences were lost in the comparison to the reference genome. These differences appear inconsequential; however, this translates into more than 2,377 coding exons completely missing from the Han genome. High-quality sequences must be emphasized in combination with high-throughput sequencing, as actively requested by comparative genomic and evolutionary genomic researchers. Zook et al. [11] recently showed that existing sequencing methods and algorithms produced substantial discordance between different bioinformatic pipelines and thus advocated for caution in producing such data sets. Hence, for NGS genome assemblies and downstream comparative analyses, it is paramount to critically assess and compare sequence data to differentiate errors and artifacts from true variants. Microstructural changes, including insertions and deletions (indels), which frequently occur in intronic and intergenic regions, are just some of the problems biologists face during assembly and mapping of short high-throughput reads [12-14]. Diverse algorithms were developed to tackle the challenge posed in assembling from NGS data sets [14, 16]. Indels are an important class of mutations that not only provide a basis for analytical procedures (i.e., synapomorphies in phylogenetic analyses) but are also linked to genetic diseases [17]. For example, cystic fibrosis, one of the most common genetic diseases in humans, is frequently caused by a single amino acid deletion within the CFTR gene [18]. Indels are often treated as a “fifth base” and occasionally contain a valuable evolutionary signal. In the angiosperms, indels were successfully used to resolve phylogenetic relationships among basal lineages [19] and among closely related taxa [20, 21]. In both crop breeding and population genetics studies, useful molecular markers for the accurate and efficient identification of individuals and populations were indels [22, 23]. Ultimately, the documentation and verification of indels is based on the quality of the assembled genome sequence. Compared with the gigantic nuclear genome, chloroplast genomes (plastomes) are relatively small, and thus sequencing can be conducted more quickly and at a lower cost. Typically, plastomes exhibit a conserved circular double-stranded DNA arrangement, with sizes that ranged from 115 to 165 kb [24, 25], and the gene content and gene order [26] are highly preserved in the land plants. These features and the high-through sequencing technologies led to an increase in the number of the completed plastomes. Complete plastome sequences from more than 400 species are currently stored in the NCBI database (http://www.ncbi.nlm.nih.gov/genomes/; S1 Table). Publically available plastome sequences such as those stored at NCBI provide a valuable genetic resource for several different types of biological research. First, plastome sequences are a primary source for plant molecular systematic studies [27-31]. The increasing number of complete plant plastome sequences that possess low rates of nucleotide substitutions and structural changes are well suited to resolve the relationships among different plant lineages [30, 32–34]. Second, plastomes of plants are an important resource for DNA barcoding, which is based on sequences from a short and standardized DNA region to identify species [35, 36]. The loci of matK, rbcL, atpF-atpH, trnH-psbA, and psbK-psbI were used successfully in barcoding efforts to identify species [37-39]. Third, compared with the transformations of the nuclear genome in biotechnology, chloroplast transformations function more effectively [40-42]. The configuration of the transformation vector was primarily based on a similar sequence from the plastome sequence [43, 44]. These applications are all dependent on high quality plastome sequences. In this study, we compared whether the sequence differences were real variants or rather the result of sequencing or assembly errors. The comparisons were conducted between two published plastomes from two individuals of Oryza australiensis (Domin & C.E. Hubb). One plastome (O. australiensis: GU592209) was obtained through Illumina sequencing and reference-guided assembly [45] and the other plastome (O. australiensis: KJ830774) was completed through the construction of target enrichment libraries and shotgun Sanger sequencing [46]. These two different sequencing and assembling strategies provided the basis for the comparisons. O. australiensis is a diploid species from the E-genome group of the rice genus and is an important wild relative to domesticated rice [47-49]. We systematically compared these two plastomes by whole genome alignment, including examination of the sequence identity in both the coding and noncoding regions and the variation in the junction of single copy and repeat copy in the plastome. Additionally, phylogenetic analyses were conducted based on the whole plastome sequence, single nucleotide polymorphisms (SNP) and indels data. We found that the quality of sequences and assemblies from high-throughput genome sequencing deserved special attention.

Materials and Methods

Plastome annotation

All eight published plastomes from the Oryza genus and an out-group plastome sequence from the species Leersia tisserantii (A. Chev. Launert) (the closest relative in the same tribe of Oryzeae) were downloaded from the NCBI database (Table 1). To fully and consistently compare the plastome annotation, DOGMA (Dual OrganellarGenoMe Annotator [50]) was employed for genome annotation, which included the protein-coding genes, transfer RNAs (tRNAs), and ribosomal RNAs (rRNAs). To accurately confirm the start and stop codons and the exon-intron boundaries of genes, the draft annotation was subsequently inspected and adjusted manually based on the published plastomes from the database. Additionally, both tRNA and rRNA genes were identified by BLASTN searches against the same database of plastomes. The tRNAscan-SE 1.21 [51] was also used to further verify the tRNA genes.

Table 1

Comparison of the major features of nine chloroplast genomes from the rice tribe (Oryzeae).

Species	Total size		LSC region		IR region		SSC region		GenBank accession no.	Ambiguous Base (N)	Reference
Species	Length (bp)	GC (%)	Length (bp)	GC (%)	Length (bp)	GC (%)	Length (bp)	GC (%)	GenBank accession no.	Ambiguous Base (N)	Reference
Oryza sativa ssp. Indica	134,496	39.00	80,553	37.11	20,798	44.35	12,347	33.32	NC_008155	-	[56]
Oryza nivara	134,494	39.01	80,544	37.12	20,802	44.35	12,346	33.33	NC_005973	-	[56]
Oryza sativa ssp. Japonica	134,551	39.00	80,604	37.11	20,802	44.35	12,343	33.37	AY522330	-	[53]
Oryza rufipogon	134,557	39.01	80,604	37.11	20,803	44.35	12,347	33.36	NC_022668	-	[57]
Oryza rufipogon	134,544	39.00	80,594	37.11	20,802 ^a	44.35	12,347	33.33	NC_017835	-	[57]
Oryza meridionalis	134,551	39.01	80,606	37.11	20,802	44.35	12,343	33.36	GU592208	-	[45]
Oryza australiensis	134,549	38.93	80,614	37.07	20,796	44.36	12,343	33.25	GU592209	177 bp	[45]
Oryza australiensis	135,224	38.95	81,074	37.07	20,840	44.33	12,470	33.18	KJ830774	-	[46]
Leersia tisserantii	136,551	38.88	81,865	37.01	21,329	44.05	12,027	33.23	JN415112	-	[30]

a: Two IR regions have one base pair difference in this species.

Differences from comparative chloroplast genomic analysis

To fully compare the complete plastomes of O. australiensis isolate 86524 (KJ830774, [46]) and O. australiensis isolate 300136 (GU592209, [45]), the mVISTA program was employed in the Shuffle-LAGAN mode [52] to detect whole genome variation. The plastome of O. sativa ssp. Japonica (AY522330, [53]) was used as a reference. To assess the sequence identity (SI) values of the coding and noncoding regions of the two plastomes (KJ830774 and GU592209), the nucleotide sequences of all protein coding and RNA genes and noncoding sequences were aligned to the reference genome (O. sativa ssp. Japonica, AY522330) using the ClustalX [54] and adjusted manually, and the SI values were calculated using the BioEdit [55]. The final alignments are shown in the S2 Table.

Differences from phylogenetic reconstructions using different data sets

To construct and compare the phylogenetic relationships of different data sets, nine published plastomes from the rice tribe (Oryzeae) were downloaded from the NCBI database for use in the analyses (Table 1). In the first phylogenetic analysis, the whole plastome sequence data were used. Based on the conserved structure and gene order of chloroplast genomes [26], the sequence alignments were made in the BioEdit software [55] with the coding gene positions manually inspected (S2 Table). Four methods were employed to construct the phylogenetic trees, including maximum parsimony (MP) implemented with PAUP 4.0b10 [58], maximum likelihood (ML) [59] and neighbor-joining (NJ) with MEGA6 [59], and Bayesian inference (BI) with MrBayes3.1.2 [60]. Using a heuristic search with 1000 random addition sequence replicates, the MP method was executed under tree-bisection-reconnection (TBR) branch-swapping tree search criteria. Parameters for the ML analysis were optimized with a BIONJ tree as a default point with 1000 bootstrap replicates using the Kimura 2-parameter model and the gamma distribution with invariant sites for rate variation. The NJ settings employed 1000 bootstrap replicates using the p-distance model with uniform rates. For the estimation of Bayesian posterior probabilities (PP) in the BI analyses, the MCMC algorithm was run for 1,000,000 generations with 4 incrementally heated chains, starting from random trees and sampling one out of every 100 generations. When the log-likelihood scores stabilized, a consensus tree was calculated after discarding the first 25% of the trees as burn-in. In the second phylogenetic analysis, only single nucleotide polymorphism (SNP) data were used. The SNP matrix was extracted using the DAMBE software [61] from the aligned whole genome data set used previously (S2 Table). Furthermore, three SNP matrices were built that contained the whole plastome, coding regions or noncoding regions. The neighbor-joining (NJ) and unweighted pair group method with arithmetic mean (UPGMA) methods were used to construct the phylogenetic tree in MEGA6 [59]. Both methods were run using 1000 bootstrap replicates and the p-distance model with uniform rate variation. In the third analysis, only the indels matrix from noncoding regions was extracted to construct the phylogenetic trees. Microstructural changes such as indels were widely used for resolving phylogenetic relationships [19-21]. The software DnaSP5 [62] was employed to acquire the indels polymorphism using the aligned data from above. The indels data were checked manually to confirm the reliability. All 527 indels sites (S3 Table) were used in the phylogenetic analysis. The indels sites were coded with zero (nongap variant) and one (gap variant). The settings for MP and BI analyses were identical to those used in the whole genome work described above. The neighbor-joining (NJ) tree was resolved in R with the ‘phangorn’ package [63] with 1000 bootstrap replicates.

Results and Discussion

Overview of plastome sequencing

From the time the first two species (Marchantia polymorpha L. and Nicotiana tabacum L.) plastomes were sequenced [64, 65], over 400 chloroplast genomes of land plants (Fig. 1 and S1 Table) have been published (as of February 2014). Of the over 400 complete plastome sequences, angiosperms were 72.07% of the data set, gymnosperms 10.81%, ferns 11.71%, and bryophytes 5.41% (Fig. 1A). Angiosperm species occupied the dominant priority (Fig. 1A) because the plastomes of most angiosperms are highly conserved in genome size, gene content and gene order [26].

Fig 1

Information from the published chloroplast genomes in land plants, as of February 1, 2014.

A. The list of plastomes was acquired from the NCBI Organelle Genome Resources (http://www.ncbi.nlm.nih.gov/genomes/) and related published reports. B. Number of plastomes published since 1986. The year of each genome sequence is according to the release date of its upload to GenBank.

The rapid increase in number of complete plastome sequences is attributed to the advances in sequencing technologies. Before 2005, approximately two dozens plastomes were sequenced. At that time, the chemical method (Gilbert) and the dideoxy nucleotide procedure (Sanger) were the major techniques to sequence plastomes. These methods for sequencing a complete plastome were expensive, slow and laborious [66]. Because of limitations associated with the pre-NGS sequencing techniques, only model species were targeted for complete plastome sequencing. Since the development of the next-generation sequencing (NGS) platforms, the rate and number of sequenced plastomes increased rapidly, and more nonmodel species were sequenced (Fig. 1B). For example, Park et al. [67] was able to fully sequence 36 species in Pinaceae in a single study using the Illumina-Solexa platform. Similarly, Bayly et al. [68] used the Illumina platform to sequence 39 species in the eucalypt group. The unprecedented power of NGS undoubtedly increased the number of finished plastomes. However, the quality and accuracy of plastomes generated from these methods should be viewed with caution. For example, ambiguous bases still remained in the finished genomes, and some inverted repeat regions were of varying lengths (S1 Table). Of 424 plastomes, 51 (12.03%) plastomes contained ambiguous bases regardless of which methods were used to sequence them. Hence, it is imperative to carefully execute quality control on NGS sequence reads as the technology becomes ubiquitous in the biological and medical fields [1, 12].

Information from the published chloroplast genomes in land plants, as of February 1, 2014.

Differences from plastome junction boundary

Two inverted repeats (IRs) and two unequal single-copy regions characterized the typical quadripartite structure of plastomes from most land plants [25, 69]. Previous study (e.g., [25]) showed that the extension or contraction of IR regions is one of the major mechanisms causing variation in plastome size [25]. Wang et al. [70] uncovered the dynamics and evolution of the border regions between the two IR regions and the single-copy regions among monocot lineages. Four junctions (JLA, JLB, JSA, and JSB) were between the two IRs (IRA and IRB) and the two single copy (LSC and SSC) regions (Fig. 2) [70]. We carefully compared the exact IR border positions and the adjacent genes among the eight in-group Oryza and the one out-group species (L. tisserantii) [30] plastomes (Fig. 2). For JLA, it was located between rps19 and psbA. The variation in distances between rps19 and JLA was from 40 bp to 49 bp; however, the distance between psbA and JLA was consistent at 81 bp, except for O. australiensis (GU592209) with 38 bp and 85 bp, respectively. For JLB, the distance between rpl22 and JLB varied from 24 bp to 30 bp. When compared with JLA and JLB, however, the border regions for JSA and JSB were more conserved. The ndhH gene spanned the SSC and IRA region with approximately 163 bp located in the IR region for all eight Oryza species. The ndhF gene was located in the SSC region, and 41 bp distances were also conserved for all eight Oryza species. The same distance was found for the rps15 gene (301 bp). However, when the out-group species was considered, the main variation was located in the border regions of SSC and IR. For the ndhH gene, approximately 625 bp were integrated into IRA region. This 625 bp extension also contributed to the overall size differences between the out-group and the Oryza species plastomes [25].

Fig 2

Comparisons of border distances between adjacent genes and junctions of LSC, SSC, and two IR regions among nine rice tribe chloroplast genomes.

Boxes above or below the main line indicate the adjacent border genes. The figure is not to scale with sequence length and only shows relative changes at or near the IR/SC borders.

Comparisons of border distances between adjacent genes and junctions of LSC, SSC, and two IR regions among nine rice tribe chloroplast genomes.

Boxes above or below the main line indicate the adjacent border genes. The figure is not to scale with sequence length and only shows relative changes at or near the IR/SC borders.

Comparative differences between the two plastomes

We compared the plastome (O. australiensis: GU592209) that was sequenced via Illumina and reference-guided assembly [45], with a plastome (O. australiensis: KJ830774) that was completed with target enrichment libraries and shotgun Sanger sequencing [46]. The two published plastomes of O. australiensis demonstrated the two different sequencing and assembling strategies and provided an opportunity to compare the sequence quality of the two methods. How to handle the repetitive regions is one of the intractable bottlenecks for practical assembly of next-generation short reads [71], and the same problem was introduced for the reference-guided assembly for O. australiensis (GU592209). This might cause some variation for the two inverted repeats and their junction regions. For the plastome of O. australiensis (KJ830774), Fosmid libraries were constructed, followed by shearing, cloning, and sequencing. This method was labor-intensive but was shown to be an effective approach for obtaining high quality sequence data [72]. First, the mVISTA program [52] was used to demonstrate the whole genome variation with O. sativa ssp. Japonica (AY522330) as the reference for comparison with the two plastomes (Fig. 3). As the whole, the organization of the plastome was rather conserved between two individuals, and no translocations or inversions were detected in the architecture of the two genomes. The two IR regions were more conserved than the LSC and SSC regions. However, we found more local variations in O. australiensis (KJ830774) than in O. australiensis (GU592209). For example, two variations in the rpoC2 gene were found in KJ830774 but not in GU592209. Many of the intergenic region (ndhC-trnV, rbcL-psaI and others) variations were found in KJ830774, but no such variation was found in GU592209. The results indicated that the full sequence of GU592209 was more similar to AY522330 and that KJ830774 was more divergent compared with GU592209.

Fig 3

Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.

Identity plot that compares the chloroplast genomes of the two O. australiensis data sets used in this study with O. sativa ssp. Japonica (AY522330) as the reference sequence.

The vertical scale indicates the percentage of identity, ranging from 50% to 100%. The horizontal axis indicates the coordinated base position within the chloroplast genome. Genome regions are color coded as protein-coding, rRNA, tRNA, intron, and conserved noncoding sequences (CNS). Second, to further examine the differences of the two individual plastomes, we divided the plastome into individual genes (coding) and intergenic regions (noncoding). For all nine species, 111 genes were annotated, which was the same as other published species [30]. Of these genes, 103 (92.8%) genes were found with 100% sequence identity (SI) between KJ830774 and GU592209. 52 genes were found with 100% SI between GU592209 and AY522330. However, of these 52 genes, 51 genes shared 100% SI among AY522330, GU592209 and KJ830774. Only two genes (rpl32 and rpoC2) were found to have same level of SI between GU592209 and AY522330 compared with KJ830774. From these coding sequence SI results, KJ830774 was more similar to GU592209. However, the intergenic sequences (noncoding regions, IGS) exhibited different trends (Fig. 4). Among 149 IGS, 30 demonstrated high SI (1% to 6.6% difference) in GU592209-KJ830774 compared with AY522330-GU592209, and 27 IGS displayed high SI (1.2% to 28.5% difference) in AY522330-GU592209 compared with GU592209-KJ830774. For the remaining IGS, 43 had no SI difference and 49 showed less than 1% in SI difference. From examination of noncoding regions, GU592209 was more similar to the reference genome (AY522330). We also compared the whole genome SI value and found that GU592209 and AY522330 had 99.2% sequence similarity. However, the similarity was 98.2% for KJ830774 and AY522330. Although GU592209 was published as an unfinished genome (177 ambiguous bases (N)), those ambiguous bases were distributed in 18 different regions with lengths ranging from 1 bp to 45 bp (S3 Table). When we excluded them from analysis, the results were the same as above. Integrating this evidence, GU592209 contained heterogeneity in coding and non-coding regions, and therefore, the assembled plastome for GU592209 might be inaccurate.

Fig 4

Sequence identity (SI) variations are presented for 149 intergenic sequences (IGS) between O. sativa ssp. Japonica (AY522330) and O. australiensis (GU592340) versus between O. sativa ssp. Japonica (AY522330) and O. australiensis (KJ830774).

A. 30 IGS regions with SI values GU592209-KJ830774 larger than AY522330-GU592209 values. B. 27 IGS regions with SI values AY522330-GU592209 larger than GU592209-KJ830774 values. The 43 IGS regions with no differences and the 49 IGS regions with less than 1% difference for SI values are not shown.

Sequence identity (SI) variations are presented for 149 intergenic sequences (IGS) between O. sativa ssp. Japonica (AY522330) and O. australiensis (GU592340) versus between O. sativa ssp. Japonica (AY522330) and O. australiensis (KJ830774).

Phylogenetic reconstruction from different data sets

From the results described above, we concluded that coding and noncoding regions of O. australiensis (KJ830774) and O. australiensis (GU592209) might contain different phylogenetic signals. Therefore, the plastome data were divided into 1) the whole genome sequence, 2) three SNPs matrices (extracting all polymorphic sites using the DAMBE software) from the whole plastome, coding or noncoding regions, and 3) indels from noncoding regions to examine our deduction. Different methods were used to construct the phylogenetic trees (Fig. 5).

Fig 5

Phylogenetic trees were constructed for nine species from the rice tribe using different methods, and two Bayesian trees are shown for the whole genome sequence and the insertion-deletion data.

Phylogenetic trees were constructed for nine species from the rice tribe using different methods, and two Bayesian trees are shown for the whole genome sequence and the insertion-deletion data.

A. The whole genome sequence data were used with four different methods, Bayesian inference (BI), maximum parsimony (MP), maximum likelihood (ML) and neighbor-joining (NJ). Numbers above the branches are the posterior probabilities for BI and bootstrap values of MP, ML and NJ, respectively. B. The coding data from insertions and deletions (indels) were used with three different methods, Bayesian inference (BI) and maximum parsimony (MP), and two neighbor-joining (NJ) methods, for two different sets of coded data. Numbers above the branches are the posterior probabilities for BI and bootstrap values of MP and NJ. Branch length is proportional to the number of substitutions, as indicated by the scale bar. Stars represent the different positions for O. australiensis (GU592340) in the two trees. The whole plastome sequence (S2 Table) and SNP (from whole plastome, coding or noncoding regions) data generated the same phylogenetic tree (Fig. 5A). In the phylogenetic trees from these two types of data sets, O. australiensis (KJ830774) and O. australiensis (GU592209) formed a single clade with high BI and bootstrap support under the four different methods. Moreover, the tree topology corroborated the relationships inferred from the phylogenetic work conducted by Zou et al. [48]. All the other six Oryza species formed one well-supported branch and were from the A-genome and O. australiensis was in the E-genome group in the rice genus [47, 48], which evolved in the middle Miocene [49]. The two cultivated and two wild rice individuals formed a well-supported clade; however, individual relationships within this clade could not be fully resolved. This result that concerned the wild and cultivated lineages of rice was similar to that from Waters et al. [57]. However, when we applied our methods for phylogenetic reconstruction using the indels-only data set: O. australiensis was resolved on different branches (Fig. 5B). From the indels data, O. australiensis (GU592209) was a sister to O. sativa ssp. Japonica (AY522330) with high BI and bootstrap support, whereas O. australiensis (KJ830774) was resolved as a sister to all other Oryza species (formed an AA genome clade) in all analyses. From this analysis, the two O. australiensis individuals were placed in two different clades. The position of O. australiensis (GU592209) did not conform to previously published phylogenies for the group [47, 48] nor was it resolved as sister to the other Oryza individuals. However, O. australiensis (KJ830774) still remained sister to the remaining Oryza species as was found in previous studies [47, 48]. When using the phylogenetic analyses to test for differences between sequencing and alignment methods, we found that O. australiensis (GU592209) was heterogeneous in the assembled sequences for coding and noncoding regions.

Conclusions

With the development of next-generation sequencing technologies, it is now possible to sequence whole nuclear genomes of any species, including the chloroplast genome. However, it is urgent for us to consider the sequencing quality of the NGS data. In this study, we employed the plastomes to carefully compare the quality of chloroplast genomes generated with two different sequencing strategies. Two O. australiensis individual plastome sequences were generated. The O. australiensis (GU592209) was sequenced using NGS and assembled with a reference genome, whereas O. australiensis (KJ830774) was constructed using Fosmid libraries and sequenced with clone sequencing. For the whole genome alignment, O. australiensis (GU592209) was more similar to the reference with 99.2% sequence identity than O. australiensis (KJ830774) with 98.8% sequence identity. From the sequence analysis, the coding regions of the two individuals contained no differences from the references genome; however, for the intergenic regions, O. australiensis (GU592209) was more similar to the reference than O. australiensis (KJ830774). The phylogenetic analyses also found that coding and noncoding regions generated two different topologies regarding the replacement of O. australiensis (GU592209). From all the analyses, we concluded that the plastome of O. australiensis (GU592209) obtained via NGS might be less accurate than the O. australiensis (KJ830774) plastome that was generated via Sanger sequencing. Thus, our finding demonstrates the requirement for careful quality control as NGS methods become more prevalent in biological studies.

0424 chloroplast genomes downloaded from the NCBI database.

(XLSX) Click here for additional data file.

The whole genome alignment of plastid genome from nine species.

(NEX) Click here for additional data file.

Indels code matrix from nine species and 18 regions with N base pair from GU592209.

(XLSX) Click here for additional data file.

62 in total

1. DAMBE: software package for data analysis in molecular biology and evolution.

Authors: X Xia; Z Xie
Journal: J Hered Date: 2001 Jul-Aug Impact factor: 2.645

2. Molecular evolution of insertions and deletion in the chloroplast genome of silene.

Authors: Pär K Ingvarsson; Sarah Ribstein; Douglas R Taylor
Journal: Mol Biol Evol Date: 2003-06-27 Impact factor: 16.240

3. MrBayes 3: Bayesian phylogenetic inference under mixed models.

Authors: Fredrik Ronquist; John P Huelsenbeck
Journal: Bioinformatics Date: 2003-08-12 Impact factor: 6.937

4. Automatic annotation of organellar genomes with DOGMA.

Authors: Stacia K Wyman; Robert K Jansen; Jeffrey L Boore
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

5. Methods for obtaining and analyzing whole chloroplast genome sequences.

Authors: Robert K Jansen; Linda A Raubeson; Jeffrey L Boore; Claude W dePamphilis; Timothy W Chumley; Rosemarie C Haberle; Stacia K Wyman; Andrew J Alverson; Rhiannon Peery; Sallie J Herman; H Matthew Fourcade; Jennifer V Kuehl; Joel R McNeal; James Leebens-Mack; Liying Cui
Journal: Methods Enzymol Date: 2005 Impact factor: 1.600

6. Phylogeny of rice genomes with emphasis on origins of allotetraploid species.

Authors: S Ge; T Sang; B R Lu; D Y Hong
Journal: Proc Natl Acad Sci U S A Date: 1999-12-07 Impact factor: 11.205

7. The complete nucleotide sequence of wild rice (Oryza nivara) chloroplast genome: first genome wide comparative sequence analysis of wild and cultivated rice.

Authors: M Shahid Masood; Tomotaro Nishikawa; Shu-Ichi Fukuoka; Peter K Njenga; Takahiko Tsudzuki; Koh-Ichi Kadowaki
Journal: Gene Date: 2004-09-29 Impact factor: 3.688

8. A comparison of rice chloroplast genomes.

Authors: Jiabin Tang; Hong'ai Xia; Mengliang Cao; Xiuqing Zhang; Wanyong Zeng; Songnian Hu; Wei Tong; Jun Wang; Jian Wang; Jun Yu; Huanming Yang; Lihuang Zhu
Journal: Plant Physiol Date: 2004-04-30 Impact factor: 8.340

9. VISTA: computational tools for comparative genomics.

Authors: Kelly A Frazer; Lior Pachter; Alexander Poliakov; Edward M Rubin; Inna Dubchak
Journal: Nucleic Acids Res Date: 2004-07-01 Impact factor: 16.971

10. The tRNAscan-SE, snoscan and snoGPS web servers for the detection of tRNAs and snoRNAs.

Authors: Peter Schattner; Angela N Brooks; Todd M Lowe
Journal: Nucleic Acids Res Date: 2005-07-01 Impact factor: 16.971

22 in total

1. First reported chloroplast genome sequence of Punica granatum (cultivar Helow) from Jabal Al-Akhdar, Oman: phylogenetic comparative assortment with Lagerstroemia.

Authors: Abdul Latif Khan; Sajjad Asaf; In-Jung Lee; Ahmed Al-Harrasi; Ahmed Al-Rawahi
Journal: Genetica Date: 2018-08-29 Impact factor: 1.082

2. Software Choice and Sequencing Coverage Can Impact Plastid Genome Assembly-A Case Study in the Narrow Endemic Calligonum bakuense.

Authors: Eka Giorgashvili; Katja Reichel; Calvinna Caswara; Vuqar Kerimov; Thomas Borsch; Michael Gruenstaeudl
Journal: Front Plant Sci Date: 2022-07-06 Impact factor: 6.627

3. Chloroplast genomes of Arabidopsis halleri ssp. gemmifera and Arabidopsis lyrata ssp. petraea: Structures and comparative analysis.

Authors: Sajjad Asaf; Abdul Latif Khan; Muhammad Aaqil Khan; Muhammad Waqas; Sang-Mo Kang; Byung-Wook Yun; In-Jung Lee
Journal: Sci Rep Date: 2017-08-08 Impact factor: 4.379

4. Characterization of the whole chloroplast genome of Chikusichloa mutica and its comparison with other rice tribe (Oryzeae) species.

Authors: Zhiqiang Wu; Cuihua Gu; Luke R Tembrock; Dong Zhang; Song Ge
Journal: PLoS One Date: 2017-05-24 Impact factor: 3.240

5. The Complete Chloroplast Genome of Wild Rice (Oryza minuta) and Its Comparison to Related Species.

Authors: Sajjad Asaf; Muhammad Waqas; Abdul L Khan; Muhammad A Khan; Sang-Mo Kang; Qari M Imran; Raheem Shahzad; Saqib Bilal; Byung-Wook Yun; In-Jung Lee
Journal: Front Plant Sci Date: 2017-03-07 Impact factor: 5.753

6. Complete chloroplast genome sequence and comparative analysis of loblolly pine (Pinus taeda L.) with related species.

Authors: Sajjad Asaf; Abdul Latif Khan; Muhammad Aaqil Khan; Raheem Shahzad; Sang Mo Kang; Ahmed Al-Harrasi; Ahmed Al-Rawahi; In-Jung Lee
Journal: PLoS One Date: 2018-03-29 Impact factor: 3.240

7. The Complete Plastid Genome of Lagerstroemia fauriei and Loss of rpl2 Intron from Lagerstroemia (Lythraceae).

Authors: Cuihua Gu; Luke R Tembrock; Nels G Johnson; Mark P Simmons; Zhiqiang Wu
Journal: PLoS One Date: 2016-03-07 Impact factor: 3.240

8. Complete Chloroplast Genome of Nicotiana otophora and its Comparison with Related Species.

Authors: Sajjad Asaf; Abdul L Khan; Abdur R Khan; Muhammad Waqas; Sang-Mo Kang; Muhammad A Khan; Seok-Min Lee; In-Jung Lee
Journal: Front Plant Sci Date: 2016-06-14 Impact factor: 5.753

9. Chloroplast Genome Sequence of Lagerstroemia guilinensis (Lythraceae, Myrtales), a Species Endemic to the Guilin Limestone Area in Guangxi Province, China.

Authors: Cuihua Gu; Luke R Tembrock; Zhiqiang Wu
Journal: Genome Announc Date: 2016-05-19

10. Comparative analysis of complete plastid genomes from wild soybean (Glycine soja) and nine other Glycine species.

Authors: Sajjad Asaf; Abdul Latif Khan; Muhammad Aaqil Khan; Qari Muhammad Imran; Sang-Mo Kang; Khdija Al-Hosni; Eun Ju Jeong; Ko Eun Lee; In-Jung Lee
Journal: PLoS One Date: 2017-08-01 Impact factor: 3.240