Motivation: The vast number of available sequenced bacterial genomes occasionally exceeds the facilities of comparative genomic methods or is dominated by a single outbreak strain, and thus a diverse and representative subset is required. Generation of the reduced subset currently requires a priori supervised clustering and sequence-only selection of medoid genomic sequences, independent of any additional genome metrics or strain attributes. Results: The Gaussian Genome Representative Selector with Prioritization (GGRaSP) R-package described below generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. The package also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian mixture model to select an appropriate cluster threshold. We demonstrate the capabilities of GGRaSP by generating a reduced list of 315 genomes from a genomic dataset of 4600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Availability and implementaion: GGRaSP is available at https://github.com/JCVenterInstitute/ggrasp/. Supplementary information: Supplementary data are available at Bioinformatics online.
Motivation: The vast number of available sequenced bacterial genomes occasionally exceeds the facilities of comparative genomic methods or is dominated by a single outbreak strain, and thus a diverse and representative subset is required. Generation of the reduced subset currently requires a priori supervised clustering and sequence-only selection of medoid genomic sequences, independent of any additional genome metrics or strain attributes. Results: The Gaussian Genome Representative Selector with Prioritization (GGRaSP) R-package described below generates a reduced subset of genomes that prioritizes maintaining genomes of interest to the user as well as minimizing the loss of genetic variation. The package also allows for unsupervised clustering by modeling the genomic relationships using a Gaussian mixture model to select an appropriate cluster threshold. We demonstrate the capabilities of GGRaSP by generating a reduced list of 315 genomes from a genomic dataset of 4600 Escherichia coli genomes, prioritizing selection by type strain and by genome completeness. Availability and implementaion: GGRaSP is available at https://github.com/JCVenterInstitute/ggrasp/. Supplementary information: Supplementary data are available at Bioinformatics online.
Next-generation sequencing technologies have resulted in a large number of publicly available microbial genome sequences. The number of genomes available for comparative genomic analysis can exceed what can be feasibly visualized or analyzed (Chan ; Chavda ; Zaslavsky ). Additionally, sequencing of clonal or nearly clonal bacterial pathogens involved in disease outbreaks (e.g. Acinetobacter baumannii, Escherichia coli and Klebsiella pneumoniae) can skew the analyses; therefore, a reduction in genome redundancy to maximize diversity is necessary (Chan ). One common method to reduce sequence redundancy while minimizing information loss is to cluster genomes by their nucleotide distance metrices and from each cluster select one genome, often a medoid (the genomes with the minimal combined distance to the other genomes in the cluster) (Chan ; Moreno-Hagelsieb ), as a representative. However, these methods require the user to a priori specify either the number of clusters or a distance cutoff, and they do not allow the user to use the highest quality (i.e. most complete) representative genome for each cluster. Likewise, no dedicated program exists for loading and selecting these genomes.
2 Materials and methods
Here, we introduce GGRaSP (Gaussian Genome Representative Selector with Prioritization), a R-package and associated executable Rscript program that generates a list of prioritized representative genomes from either supervised or unsupervised clustering of related genomes. GGRaSP supports three forms of input to describe the relationship between the genomes: (i) a phylogeny in Newick format; (ii) a distance or similarity matrix; or (iii) an aligned multiple FASTA file. GGRaSP uses hierarchical clustering in the hclust R function or the APE R-package to create phylogenies from (ii) and (iii) (Paradis ). By default, GGRaSP prioritizes medoids as representative genomes in order to minimize the loss of information, but this can result in removal of genomes that contain regions of interests (e.g. plasmids, antibiotic resistance islands, pathogenicity islands and prophage), have a more complete assembly, or are from a given project. Users can therefore specify criteria of genomes for selection as representatives by generating a text file containing tiered ranks of the genomes.GGRaSP can cluster genomes using supervised methods, including specifying the number of clusters or the cluster cut-off distance, but it also allows for unsupervised clustering by using Gaussian mixture models (GMMs) to identify a cut-off value that separates the most closely related genomes from the more diverse genomes. GMMs of sequence distances have previously been used to model the evolutionary relationship between multiple genomes in metagenomes (e.g. Alneberg ; Ji ), and to model homologs descending from distinct ancient large-scale duplications in various eukaryotic organisms (e.g. Cui ; Schwager ). The GMM model could be biased or limited by collections of genomes which contain a single branch of highly related genomes (for which GGRaSP will select a cutoff that will only cluster that single branch) or a set of genomes that can be best modeled by a single Gaussian peak (in which case GGRaSP cannot find a cutoff).In GGRaSP, GMMs are calculated using expectation maximization via mixtools or bgmm (Benaglia ; Biecek ). Multiple Gaussian distributions are tested incrementally until the addition of a distribution is not significant by the Likelihood Ratio test or exceeds the user defined limit. After the GMM is cleaned by removing overlapping and low count distributions, the inflection point between the first two distributions is used as the cut-off to generate the clusters (see dotted vertical line, Fig. 1). The default pipeline behavior is described earlier, but many of the parameters for the GMM-based threshold calculation are user-modifiable for the cases where the GMM varies from the default model.
Fig. 1.
GGRaSP based reduction of 4600 E.coli genomes 4600 E.coli genomes were downloaded from NCBI RefSeq (A), clustered using a cut-off (shown as dotted line) determined by GMM (B), and reduced to 315 representative genomes (C). The colored branches in (A) denote branches reduced to a single node in (C)
GGRaSP based reduction of 4600 E.coli genomes 4600 E.coli genomes were downloaded from NCBI RefSeq (A), clustered using a cut-off (shown as dotted line) determined by GMM (B), and reduced to 315 representative genomes (C). The colored branches in (A) denote branches reduced to a single node in (C)GGRaSP can output multiple supporting files as is described in detail on the R help pages including: tab-delineated files with information on the clusters; ggplot2-based images showing the GMM, the initial or the final phylogenies (Wickham, 2009) with colorspace to determine the hues of GMM and phylogeny shading (Ihaka ); the Newick files for any phylogeny used in GGRaSP; and the iTOL-formatted text files showing the clusters on the phylogenies (Letunic and Bork, 2016). A Rscript version of GGRaSP to run on a command line to facilitate high-throughput analyses is also provided.
3 Usage scenario
To demonstrate the usefulness of GGRaSP, we downloaded 4600 Escherichia genomes from NCBI RefSeq on 2/2/2017 using the downloader script in the LOCUST package (Brinkac ). A whole genome-based Average Nucleotide Identity (gANI) matrix was calculated with Mash (Ondov ). The genomes were ranked, in order by: whether it was a type strain; whether it was circular; and whether it was complete. The remaining genomes were ranked by the number of contigs and genes according to the LOCUST downloader output. The similarity matrix and the ranking file were input to GGRaSP, which computed a cut-off of 1.09% identity after modeling 9 Gaussian distributions (10 before clean-up), leading to a selection of 315 representative genomes in 98 min and 2s (Fig. 1, Supplementary Fig. S1). When using to a priori cutoff of 96.5% gANI cutoff suggested for species (Varghese ), only nine clusters were generated with the largest cluster containing 98.9% of the genomes. Ranking the genomes as described earlier increased the completeness of retained genomes compared to selecting the representatives from an unranked set number of complete genomes (from 6.7 to 25.4%) and mean N50 (from 205 to 556 kb). All input and output files for these runs and the a priori cutoffs are available on the GitHub repository.
4 Conclusion
As the number of sequenced genomes available for comparative genomic analysis continues to expand, the need to generate robust representative genomic subsets will increase. Building off the statistical, bioinformatic, and graphical capabilities of R, GGRaSP and the accompanying Rscript provides a single and customizable platform to run multiple analyses to generate a subset of representative genomes. The user can specify clustering parameters and levels of importance for ranking the genomes, thus allowing for both generalizable high-throughput and more dataset specific use.Click here for additional data file.
Authors: Liying Cui; P Kerr Wall; James H Leebens-Mack; Bruce G Lindsay; Douglas E Soltis; Jeff J Doyle; Pamela S Soltis; John E Carlson; Kathiravetpilla Arumuganathan; Abdelali Barakat; Victor A Albert; Hong Ma; Claude W dePamphilis Journal: Genome Res Date: 2006-05-15 Impact factor: 9.043
Authors: Brian D Ondov; Todd J Treangen; Páll Melsted; Adam B Mallonee; Nicholas H Bergman; Sergey Koren; Adam M Phillippy Journal: Genome Biol Date: 2016-06-20 Impact factor: 13.583
Authors: Evelyn E Schwager; Prashant P Sharma; Thomas Clarke; Daniel J Leite; Torsten Wierschin; Matthias Pechmann; Yasuko Akiyama-Oda; Lauren Esposito; Jesper Bechsgaard; Trine Bilde; Alexandra D Buffry; Hsu Chao; Huyen Dinh; HarshaVardhan Doddapaneni; Shannon Dugan; Cornelius Eibner; Cassandra G Extavour; Peter Funch; Jessica Garb; Luis B Gonzalez; Vanessa L Gonzalez; Sam Griffiths-Jones; Yi Han; Cheryl Hayashi; Maarten Hilbrant; Daniel S T Hughes; Ralf Janssen; Sandra L Lee; Ignacio Maeso; Shwetha C Murali; Donna M Muzny; Rodrigo Nunes da Fonseca; Christian L B Paese; Jiaxin Qu; Matthew Ronshaugen; Christoph Schomburg; Anna Schönauer; Angelika Stollewerk; Montserrat Torres-Oliva; Natascha Turetzek; Bram Vanthournout; John H Werren; Carsten Wolff; Kim C Worley; Gregor Bucher; Richard A Gibbs; Jonathan Coddington; Hiroki Oda; Mario Stanke; Nadia A Ayoub; Nikola-Michael Prpic; Jean-François Flot; Nico Posnien; Stephen Richards; Alistair P McGregor Journal: BMC Biol Date: 2017-07-31 Impact factor: 7.431
Authors: Kalyan D Chavda; Liang Chen; Derrick E Fouts; Granger Sutton; Lauren Brinkac; Stephen G Jenkins; Robert A Bonomo; Mark D Adams; Barry N Kreiswirth Journal: MBio Date: 2016-12-13 Impact factor: 7.867
Authors: Yi Duan; Cristina Llorente; Sonja Lang; Katharina Brandl; Huikuan Chu; Lu Jiang; Richard C White; Thomas H Clarke; Kevin Nguyen; Manolito Torralba; Yan Shao; Jinyuan Liu; Adriana Hernandez-Morales; Lauren Lessor; Imran R Rahman; Yukiko Miyamoto; Melissa Ly; Bei Gao; Weizhong Sun; Roman Kiesel; Felix Hutmacher; Suhan Lee; Meritxell Ventura-Cots; Francisco Bosques-Padilla; Elizabeth C Verna; Juan G Abraldes; Robert S Brown; Victor Vargas; Jose Altamirano; Juan Caballería; Debbie L Shawcross; Samuel B Ho; Alexandre Louvet; Michael R Lucey; Philippe Mathurin; Guadalupe Garcia-Tsao; Ramon Bataller; Xin M Tu; Lars Eckmann; Wilfred A van der Donk; Ry Young; Trevor D Lawley; Peter Stärkel; David Pride; Derrick E Fouts; Bernd Schnabl Journal: Nature Date: 2019-11-13 Impact factor: 49.962
Authors: Zaira Seferbekova; Alexey Zabelkin; Yulia Yakovleva; Robert Afasizhev; Natalia O Dranenko; Nikita Alexeev; Mikhail S Gelfand; Olga O Bochkareva Journal: Front Microbiol Date: 2021-04-12 Impact factor: 5.640
Authors: Jaime L Mencke; Yunxiu He; Andrey A Filippov; Mikeljon P Nikolich; Ashton T Belew; Derrick E Fouts; Patrick T McGann; Brett E Swierczewski; Derese Getnet; Damon W Ellison; Katie R Margulieux Journal: Viruses Date: 2022-03-29 Impact factor: 5.818