Literature DB >> 27634949

Verdant: automated annotation, alignment and phylogenetic analysis of whole chloroplast genomes.

Michael R McKain¹, Ryan H Hartsock¹, Molly M Wohl¹, Elizabeth A Kellogg¹.

Abstract

MOTIVATION: Chloroplast genomes are now produced in the hundreds for angiosperm phylogenetics projects, but current methods for annotation, alignment and tree estimation still require some manual intervention reducing throughput and increasing analysis time for large chloroplast systematics projects.
RESULTS: Verdant is a web-based software suite and database built to take advantage a novel annotation program, annoBTD. Using annoBTD, Verdant provides accurate annotation of chloroplast genomes without manual intervention. Subsequent alignment and tree estimation can incorporate newly annotated and publically available plastomes and can accommodate a large number of taxa. Verdant sharply reduces the time required for analysis of assembled chloroplast genomes and removes the need for pipelines and software on personal hardware.
AVAILABILITY AND IMPLEMENTATION: Verdant is available at: http://verdant.iplantcollaborative.org/plastidDB/ It is implemented in PHP, Perl, MySQL, Javascript, HTML and CSS with all major browsers supported. CONTACT: mrmckain@gmail.comSupplementary information: Supplementary data are available at Bioinformatics online.

Entities: Species

Mesh：

Year: 2016 PMID： 27634949 PMCID： PMC5408774 DOI： 10.1093/bioinformatics/btw583

Source DB: PubMed Journal: Bioinformatics ISSN： 1367-4803 Impact factor: 6.937

1 Introduction

Chloroplast genomes, or plastomes, are a valuable tool for phylogenetics in angiosperms as they provide a less complex, albeit partial, history relative to nuclear loci. Plastomes are also easily obtainable from low coverage genome sequencing (Soltis ; Steele ) making them a desirable by-product from multiple sequencing projects. A major hurdle in scalable use of these data is quick and accurate annotation of plastomes and subsequent alignment and phylogenetic estimation. The time and necessary computational resources or skill required to complete these tasks may act as a barrier for novel research in underrepresented flowering plant groups. Here, we present Verdant, a taxonomically structured, database-driven suite of tools for annotation, alignment and tree estimation of chloroplast genomes in a web-based platform. An exhaustive tutorial is provided, but much of Verdant’s interface is designed to be as intuitive as possible. Verdant provides a number of key features designed for usability, including: Automated annotation of both whole and partial plastomes for protein coding genes, tRNAs and rRNAs using our novel software, annoBTD. Orientation and orthology focused alignments of annotated genes, rRNAs, tRNAs, introns and intergenic regions using MAFFT (Katoh and Standley, 2013). Phylogeny estimation using RAxML (Stamatakis, 2014). Annotation visualization using Circos (Krzywinski ) and JBrowse (Skinner ). Downloadable datasets of aligned and unaligned plastome regions, both individual gene or concatenated plastome trees, and project metadata including full plastome size, large single copy (LSC), small single copy (SSC), and inverted repeat (IRA or IRB) sizes and locations and total number of annotated features present. These functions are enabled by an underlying database consisting of high-quality plastomes downloaded from GenBank and newly annotated, secure and user-populated databases for individual projects. Users can then release their data to the public database at their discretion.

2 Implementation

Verdant is broken into two primary workflows (see Fig. 1). The first, which is automatic upon upload, involves the annotation of the plastome sequence(s) using our novel software, annoBTD, the population of the user’s personal and secure data structure, and the creation of Circos and JBrowse visualization for each plastome. The second workflow is completely user-driven and includes project creation, taxon selection, feature selection, alignment and phylogenetic tree reconstruction.

Fig. 1.

Verdant workflow. Workflow diagram depicting the automated (grey box) and user-driven (white box) steps and options available in Verdant. Parallelograms at the bottom of the diagram represent downloadable files available

2.1 Annotation with annoBTD

Our novel annotation software, annoBTD, completely removes the need for manual annotation, although such intervention may occasionally be necessary with some aberrant plastomes. The time for annotation of a full plastome sequence is approximately 10–30 min, and annotations can be downloaded in GFF3 format. AnnoBTD is an alternative to the current standard web-based program DOGMA (Wyman ), which is effective and easy to use but requires manual intervention for final and accurate annotations thus limiting throughput.

2.1.1 Protein coding genes

Details of annoBTD are found in Supplementary Information. A novel feature includes de novo ORF identification; ORFs are then identified to reference using the five most closely related species available in the database. Once putative identity is established for an ORF, its position in the plastome informs the final annotation decision. Overlap of two different genes, such as psbD and psbC, is allowed. Start and stop codons for each gene, as well as intron–exon boundaries, are estimated from the sequence by methods that do not require canonical start codons or exact boundary matches. AnnoBTD also finds very small exons that may be missed by other annotation programs.

2.1.2 rRNAs and tRNAs

Because they are conserved in chloroplasts, rRNAs and tRNAs are detected by an optimized blastn and annotated via position and length in the plastome sequence.

2.1.3 LSC, SSC and IR

The LSC, SSC and IR regions of the plastome, if the full sequence is given, are estimated by identifying the repetitive IR sequences and assigning LSC and SSC by size.

2.2 Analyses in Verdant

Verdant’s project management system allows users to create multiple projects adding their own plastome data or publicly available data from the database. Users then choose single genes, ranges of genes, or whole plastomes to include in their analyses. Unaligned or aligned sequences may be downloaded. For alignments, each region of the genome, annotated feature or inter-annotated region, is aligned separately and then concatenated into a single alignment. The MAFFT nucleotide direction option is used to keep all regions properly oriented to each other in order to maintain alignment accuracy over inversion events. In cases where a taxon does not have a specific feature, the region is left as an indel for the taxon in the alignment. Both individual and concatenated alignments are provided to the user for download. Phylogenies are estimated using both individual region alignments and the concatenated alignments with all RAxML files available for download.

3 Conclusion

The annotation and project development features of Verdant provide a high-throughput method for conducting phylogenetic analyses using whole chloroplast sequences, a much needed utility with the glut of plastome data now available. Because of its focus on phylogenetic applications, Verdant complements other tools developed for functional studies of plastome biology. Future additions to Verdant will include more evolutionary analyses to look at plastome structure and function and, ultimately, user created modules. Click here for additional data file.

6 in total

1. Automatic annotation of organellar genomes with DOGMA.

Authors: Stacia K Wyman; Robert K Jansen; Jeffrey L Boore
Journal: Bioinformatics Date: 2004-06-04 Impact factor: 6.937

2. Quality and quantity of data recovered from massively parallel sequencing: Examples in Asparagales and Poaceae.

Authors: P Roxanne Steele; Kate L Hertweck; Dustin Mayfield; Michael R McKain; James Leebens-Mack; J Chris Pires
Journal: Am J Bot Date: 2012-01-30 Impact factor: 3.844

3. Circos: an information aesthetic for comparative genomics.

Authors: Martin Krzywinski; Jacqueline Schein; Inanç Birol; Joseph Connors; Randy Gascoyne; Doug Horsman; Steven J Jones; Marco A Marra
Journal: Genome Res Date: 2009-06-18 Impact factor: 9.043

4. JBrowse: a next-generation genome browser.

Authors: Mitchell E Skinner; Andrew V Uzilov; Lincoln D Stein; Christopher J Mungall; Ian H Holmes
Journal: Genome Res Date: 2009-07-01 Impact factor: 9.043

5. MAFFT multiple sequence alignment software version 7: improvements in performance and usability.

Authors: Kazutaka Katoh; Daron M Standley
Journal: Mol Biol Evol Date: 2013-01-16 Impact factor: 16.240

6. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies.

Authors: Alexandros Stamatakis
Journal: Bioinformatics Date: 2014-01-21 Impact factor: 6.937

6 in total

22 in total

1. Sequencing of Complete Chloroplast Genomes.

Authors: Berthold Heinze
Journal: Methods Mol Biol Date: 2021

2. Polyphyly of Arundinoideae (Poaceae) and evolution of the twisted geniculate lemma awn.

Authors: J K Teisher; M R McKain; B A Schaal; E A Kellogg
Journal: Ann Bot Date: 2017-11-10 Impact factor: 4.357

3. Specimen-based analysis of morphology and the environment in ecologically dominant grasses: the power of the herbarium.

Authors: Christine A McAllister; Michael R McKain; Mao Li; Bess Bookout; Elizabeth A Kellogg
Journal: Philos Trans R Soc Lond B Biol Sci Date: 2018-11-19 Impact factor: 6.237

4. Robust DNA Isolation and High-throughput Sequencing Library Construction for Herbarium Specimens.

Authors: Saman Saeidi; Michael R McKain; Elizabeth A Kellogg
Journal: J Vis Exp Date: 2018-03-08 Impact factor: 1.355

5. PhyloHerb: A high-throughput phylogenomic pipeline for processing genome skimming data.

Authors: Liming Cai; Hongrui Zhang; Charles C Davis
Journal: Appl Plant Sci Date: 2022-06-02 Impact factor: 2.511

6. GeSeq - versatile and accurate annotation of organelle genomes.

Authors: Michael Tillich; Pascal Lehwark; Tommaso Pellizzer; Elena S Ulbricht-Jones; Axel Fischer; Ralph Bock; Stephan Greiner
Journal: Nucleic Acids Res Date: 2017-07-03 Impact factor: 16.971

7. PACVr: plastome assembly coverage visualization in R.

Authors: Michael Gruenstaeudl; Nils Jenke
Journal: BMC Bioinformatics Date: 2020-05-24 Impact factor: 3.169

8. Chloroplast genomes of Byrsonima species (Malpighiaceae): comparative analysis and screening of high divergence sequences.

Authors: Alison P A Menezes; Luciana C Resende-Moreira; Renata S O Buzatti; Alison G Nazareno; Monica Carlsen; Francisco P Lobo; Evanguedes Kalapothakis; Maria Bernadete Lovato
Journal: Sci Rep Date: 2018-02-02 Impact factor: 4.379

9. The chloroplast genome sequence of bittersweet (Solanum dulcamara): Plastid genome structure evolution in Solanaceae.

Authors: Ali Amiryousefi; Jaakko Hyvönen; Péter Poczai
Journal: PLoS One Date: 2018-04-25 Impact factor: 3.240

10. Bioinformatic Workflows for Generating Complete Plastid Genome Sequences-An Example from Cabomba (Cabombaceae) in the Context of the Phylogenomic Analysis of the Water-Lily Clade.

Authors: Michael Gruenstaeudl; Nico Gerschler; Thomas Borsch
Journal: Life (Basel) Date: 2018-06-21