| Literature DB >> 29933597 |
Michael Gruenstaeudl1, Nico Gerschler2, Thomas Borsch3,4,5.
Abstract
The sequencing and comparison of plastid genomes are becoming a standard method in plant genomics, and many researchers are using this approach to infer plant phylogenetic relationships. Due to the widespread availability of next-generation sequencing, plastid genome sequences are being generated at breakneck pace. This trend towards massive sequencing of plastid genomes highlights the need for standardized bioinformatic workflows. In particular, documentation and dissemination of the details of genome assembly, annotation, alignment and phylogenetic tree inference are needed, as these processes are highly sensitive to the choice of software and the precise settings used. Here, we present the procedure and results of sequencing, assembling, annotating and quality-checking of three complete plastid genomes of the aquatic plant genus Cabomba as well as subsequent gene alignment and phylogenetic tree inference. We accompany our findings by a detailed description of the bioinformatic workflow employed. Importantly, we share a total of eleven software scripts for each of these bioinformatic processes, enabling other researchers to evaluate and replicate our analyses step by step. The results of our analyses illustrate that the plastid genomes of Cabomba are highly conserved in both structure and gene content.Entities:
Keywords: Cabomba; bioinformatics; genome assembly; phylogenomics; plastid genome; standardization; workflow
Year: 2018 PMID: 29933597 PMCID: PMC6160935 DOI: 10.3390/life8030025
Source DB: PubMed Journal: Life (Basel) ISSN: 2075-1729
List of species names, herbarium vouchers and GenBank accession number of each taxon analyzed. The taxonomic authority of each described species is given after the specific epithet. Standard herbarium abbreviations are given in parentheses.
| Species Name | Publication of Genome Sequence | GenBank Accession Number | Sample ID | Herbarium Voucher |
|---|---|---|---|---|
| this study | MG720559 | NY684 | Gartenherbarbeleg Cubr 50791 (B) | |
| Gruenstaeudl et al. (2017) | KT705317 | NY112 | J.C. Ludwig s.n. (VPI) | |
| this study | MG720558 | NY690 | Gartenherbarbeleg Cubr 50793 (B) | |
| this study | MG967470 | NY691 | Gartenherbarbeleg Cubr 50792 (B) |
Figure 1Circular plastid genome map of Cabomba aquatica (MG720559). Genes depicted as facing inward from the outer circle are transcribed clockwise, those facing outward are transcribed counter-clockwise. The inner circle indicates the GC content of each nucleotide position (dark gray).
Overview of the bioinformatic workflow applied, including the use each custom software script, names and version numbers of third-party software tools employed or initiated by the scripts, and the computation time required to perform each automated analysis step. Computation time was measured on a machine with an i5-2500K 3.3 GHz Intel Quad-Core processor (Santa Clara, CA, USA), 24 GB of RAM and a Linux 4.14.1 kernel and is given as the average computation time for each of the three newly sequenced plastid genomes. Abbreviations: min = minutes; n.a. = not applicable; s = seconds.
| Analysis Step | Details of Analysis | Custom Software Script | Third-Party Software Tool Employed 1 | Computation Time (Mean) |
|---|---|---|---|---|
| Quality control of reads | Generating ordered intersection of R1/R2 reads | Script 01 | bioawk v.20110810 | 6 min 29 s |
| Filtering of reads by quality score | Script 02 | FASTX Toolkit v.0.0.14 | 14 min 03 s | |
| Genome assembly | Assembling reads to contigs | Script 03 | IOGA v.20160910 | 17 min 12 s |
| Stitching contigs to complete genomes | Manual step | Geneious v.10.2.3 | ||
| Evaluation of assembly | Confirming the IR boundaries | Script 04 | BLAST+ v.2.4.0 | <1 s |
| Extracting reads that map to final assembly | Script 05 | bowtie2 v.2.3.2 | 3 min 37 s | |
| Generating assembly statistics | Script 06 | bowtie2 v.2.3.2 | 7 min 20 s | |
| Genome annotation | Generating raw annotations | Manual step | DOGMA | |
| Converting annotations of DOGMA to GFF format | Script 07 | n.a. | <1 s | |
| Combining annotations of DOGMA and cpGAVAS | Script 08 | Python v.2.7.14 | <1 s | |
| Evaluation of annotations | Confirming the validity of annotations | Manual step | Geneious v.10.2.3 | |
| Sequence alignment | Extraction and alignment of coding regions | Script 09 | Python v.2.7.14 | 39 s |
| Removal of gap positions | Script 10 | statAl v.1.4.rev22 | <1 s | |
| Phylogenetic analysis | Phylogenetic tree inference under ML, including bootstrapping | Script 11 | R v.3.4.4 | 17 s |
1 Software tools such as sed, grep and awk, which are part of most Unix command shells, are not listed separately. Similarly, individual R, Python or Perl dependencies are not listed separately.
Overview of the number of read pairs, mean coverage depth, contig number, contig length and other assembly and contig statistics. Only paired reads were counted for read statistics; orphaned reads were discarded prior to quality filtering. Contig statistics are based on contigs of a size equal to or greater than 1000 bp. Information in square brackets indicates the unit that the values are presented in. Abbreviations: N = number; P = percentage.
|
|
| ||
|---|---|---|---|
| N of read pairs after quality filtering | 1,896,979 | 4,005,075 | 4,132,180 |
| N of read pairs that mapped to the reference genomes (P of quality-filtered pairs) 1 | 73,452 (3.87%) | 55,254 (1.37%) | 46,681 (1.12%) |
| Mean read length [bp] 2,3 | 593.58 | 595.09 | 594.30 |
| Mean coverage depth [fold] 3 | 349.36 | 266.01 | 224.87 |
| P of bases with coverage depth greater than 20-fold, 50-fold, and 100-fold | 99.89%—99.70%—98.51% | 99.97%—99.93%—90.40% | 99.87%—99.76%—89.72% |
| N of contigs after automatic assembly | 3 | 4 | 3 |
| Size of largest contig [bp] | 89,168 | 80,536 | 90,298 |
| Total length of contigs [bp] | 134,685 | 135,444 | 135,367 |
| N50 [bp] | 89,168 | 80,536 | 90,298 |
| L50 | 1 | 1 | 1 |
1 All quality filtered reads that mapped concordantly one or more times to the complete plastid genomes of Cabomba caroliniana and Brasenia schreberi were counted. Read pairs located in the IR usually map to a reference genome more than one time. 2 Calculated as mean length of (R1 plus R2). 3 Calculated from all quality-filtered reads that mapped to the final assembly.
Comparison of genome structure, IR length and gene content of the complete plastid genomes of Cabomba. Abbreviations: N = number.
| Name of Organism |
|
|
| |
|---|---|---|---|---|
| Genome size (bp) | 159,487 | 164,057 | 160,177 | 160,271 |
| LSC length (bp) | 89,433 | 82,090 | 89,835 | 90,037 |
| SSC length (bp) | 19,114 | 18,827 | 19,392 | 19,384 |
| IR length (bp) | 25,470 | 31,570 | 25,475 | 25,425 |
| N of genes | 116 | 116 | 116 | 116 |
| N of protein-coding genes (duplicated in IR) | 82 (9) | 82 (19) | 82 (9) | 82 (9) |
| N of tRNA genes (duplicated in IR) | 30 (7) | 30 (7) | 30 (7) | 30 (7) |
| N of rRNA genes (duplicated in IR) | 4 (4) | 4 (4) | 4 (4) | 4 (4) |
| Proportion of coding to non-coding regions | 0.69 | 0.69 | 0.69 | 0.69 |
| Average gene density (genes/kb) | 0.85 | 0.89 | 0.85 | 0.85 |
| GC content (%) | 38.0 | 38.3 | 38.0 | 38.1 |
Figure 2The best phylogenetic tree inferred under the ML criterion from the gap-free alignment visualized as unrooted phylogram. Bootstrap support for the inferred nodes is given as branch label. A branch length scale, the log-likelihood value, the gamma distribution value and the value for the invariant sites parameter of the best ML tree are given below the tree.
Overview of alignment statistics calculated on the concatenation of aligned coding regions. Abbreviations: N = number; nucl. pos. = nucleotide positions; P = percentage.
| Statistic | Before Gap Removal | After Gap Removal |
|---|---|---|
| Total alignment length (bp) | 68,922 | 68,451 |
| Average pairwise sequence identity across | 0.9852 (0.9986) | 0.9891 (0.9998) |
1 Calculated across all pairs of the concatenated coding regions.